This makes it possible to use dm stats with DM multipath.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add an option to dm statistics to collect and report a histogram of
IO latencies.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Make it possible to use precise timestamps with nanosecond granularity
in dm statistics.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If the number_of_areas argument was zero the kernel would crash on
div-by-zero. Add better input validation.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.12+
The Stochastic multiqueue (SMQ) policy (vs MQ) offers the promise of
less memory utilization, improved performance and increased adaptability
in the face of changing workloads. SMQ also does not have any
cumbersome tuning knobs.
Users may switch from "mq" to "smq" simply by appropriately reloading a
DM table that is using the cache target. Doing so will cause all of the
mq policy's hints to be dropped. Also, performance of the cache may
degrade slightly until smq recalculates the origin device's hotspots
that should be cached.
In the future the "mq" policy will just silently make use of "smq" and
the mq code will be removed.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
The metadata space map has a simplified 'bootstrap' mode that is
operational when extending the space maps. Whilst in this mode it's
possible for some refcount decrement operations to become queued (eg, as
a result of shadowing one of the bitmap indexes). These decrements were
not being applied when switching out of bootstrap mode.
The effect of this bug was the leaking of a 4k metadata block. This is
detected by the latest version of thin_check as a non fatal error.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Warning like this:
drivers/md/md.c: In function "update_array_info":
drivers/md/md.c:6394:26: warning: logical not is only applied
to the left hand side of comparison [-Wlogical-not-parentheses]
!mddev->persistent != info->not_persistent||
Fix it as Neil Brown said:
mddev->persistent != !info->not_persistent ||
Signed-off-by: Firo Yang <firogm@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
conf->released_stripes list isn't always related to where there are
free stripes pending. Active stripes can be in the list too.
And even free stripes were active very recently.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.de>
I noticed heavy spin lock contention at get_active_stripe() with fsmark
multiple thread write workloads.
Here is how this hot contention comes from. We have limited stripes, and
it's a multiple thread write workload. Hence, those stripes will be taken
soon, which puts later processes to sleep for waiting free stripes. When
enough stripes(>= 1/4 total stripes) are released, all process are woken,
trying to get the lock. But there is one only being able to get this lock
for each hash lock, making other processes spinning out there for acquiring
the lock.
Thus, it's effectiveless to wakeup all processes and let them battle for
a lock that permits one to access only each time. Instead, we could make
it be a exclusive wake up: wake up one process only. That avoids the heavy
spin lock contention naturally.
To do the exclusive wake up, we've to split wait_for_stripe into multiple
wait queues, to make it per hash value, just like the hash lock.
Here are some test results I have got with this patch applied(all test run
3 times):
`fsmark.files_per_sec'
=====================
next-20150317 this patch
------------------------- -------------------------
metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params
------------------------- ------------------------- -------- ------------------------------
25.600 ±0.0 92.700 ±2.5 262.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
25.600 ±0.0 77.800 ±0.6 203.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
32.000 ±0.0 93.800 ±1.7 193.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
32.000 ±0.0 81.233 ±1.7 153.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
48.800 ±14.5 99.667 ±2.0 104.2% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
6.400 ±0.0 12.800 ±0.0 100.0% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
63.133 ±8.2 82.800 ±0.7 31.2% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
245.067 ±0.7 306.567 ±7.9 25.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose
17.533 ±0.3 21.000 ±0.8 19.8% ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose
188.167 ±1.9 215.033 ±3.1 14.3% ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync
254.500 ±1.8 290.733 ±2.4 14.2% ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync
`time.system_time'
=====================
next-20150317 this patch
------------------------- -------------------------
metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params
------------------------- ------------------------- -------- ------------------------------
7235.603 ±1.2 185.163 ±1.9 -97.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
7666.883 ±2.9 202.750 ±1.0 -97.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
14567.893 ±0.7 421.230 ±0.4 -97.1% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
3697.667 ±14.0 148.190 ±1.7 -96.0% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
5572.867 ±3.8 310.717 ±1.4 -94.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
5565.050 ±0.5 313.277 ±1.5 -94.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
2420.707 ±17.1 171.043 ±2.7 -92.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
3743.300 ±4.6 379.827 ±3.5 -89.9% ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose
3308.687 ±6.3 363.050 ±2.0 -89.0% ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose
Where,
1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark
1t, 64t: where 't' means thread
4M: means the single file size, corresponding to the '-s' option of fsmark
40G, 30G, 120G: means the total test size
4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means
the size of one ramdisk. So, it would be 48G in total. And we made a
raid on those ramdisk
As you can see, though there are no much performance gain for hard disk
workload, the system time is dropped heavily, up to 97%. And as expected,
the performance increased a lot, up to 260%, for fast device(ram disk).
v2: use bits instead of array to note down wait queue need to wake up.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
I noticed heavy spin lock contention at get_active_stripe(), introduced
at being wake up stage, where a bunch of processes try to re-hold the
spin lock again.
After giving some thoughts on this issue, I found the lock could be
relieved(and even avoided) if we turn the wait_for_stripe to per
waitqueue for each lock hash and make the wake up exclusive: wake up
one process each time, which avoids the lock contention naturally.
Before go hacking with wait_for_stripe, I found it actually has 2
usages: for the array to enter or leave the quiescent state, and also
to wait for an available stripe in each of the hash lists.
So this patch splits the first usage off into a separate wait_queue,
wait_for_quiescent, and the next patch will turn the second usage into
one waitqueue for each hash value, and make it exclusive, to relieve
the lock contention.
v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
Commit log refactor suggestion from Neil.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Refactor sync_request_write() of md/raid10 to use bio_copy_data()
instead of open coding bio_vec iterations.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <mlin@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
MD_RECOVERY_DONE is normally cleared by md_check_recovery after a
resync etc finished. However it is possible for raid5_start_reshape
to race and start a reshape before MD_RECOVERY_DONE is cleared. This
can lean to multiple reshapes running at the same time, which isn't
good.
To make sure it is cleared before starting a reshape, and also clear
it when reaping a thread, just to be safe.
Signed-off-by: NeilBrown <neilb@suse.de>
Checking ->sync_thread without holding the mddev_lock()
isn't really safe, even after flushing the workqueue which
ensures md_start_sync() has been run.
While this code is waiting for the lock, md_check_recovery could reap
the thread itself, and then start another thread (e.g. recovery might
finish, then reshape starts). When this thread gets the lock
md_start_sync() hasn't run so it doesn't get reaped, but
MD_RECOVERY_RUNNING gets cleared. This allows two threads to start
which leads to confusion.
So don't both if MD_RECOVERY_RUNNING isn't set, but if it is do
the flush and the test and the reap all under the mddev_lock to
avoid any race with md_check_recovery.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 6791875e2e ("md: make reconfig_mutex optional for writes to md sysfs files.")
Cc: stable@vger.kernel.org (v4.0+)
Returning zero from a 'store' function is bad.
The return value should be either len length of the string
or an error.
So use 'len' if 'err' is zero.
Fixes: 6791875e2e ("md: make reconfig_mutex optional for writes to md sysfs files.")
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel (v4.0+)
In dm_thin_find_block() the ->fail_io flag was checked outside the
metadata device's root_lock, causing dm_thin_find_block() to race with
the setting of this flag.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Use EOPNOTSUPP, rather than EINVAL, error code when user attempts to
send the pool a message. Otherwise usespace is led to believe the
message failed due to invalid argument.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Previously REQ_DISCARD bios have been split into block sized chunks
before submission to the thin target. There are a couple of issues with
this:
- If the block size is small, a large discard request can
get broken up into a great many bios which is both slow and causes
a lot of memory pressure.
- The thin pool block size and the discard granularity for the
underlying data device need to be compatible if we want to passdown
the discard.
This patch relaxes the block size granularity for thin devices. It
makes use of the recent range locking added to the bio_prison to
quiesce a whole range of thin blocks before unmapping them. Once a
thin range has been unmapped the discard can then be passed down to
the data device for those sub ranges where the data blocks are no
longer used (ie. they weren't shared in the first place).
This patch also doesn't make any apologies about open-coding portions
of block core as a means to supporting async discard completions in the
near-term -- if/when late bio splitting lands it'll all get cleaned up.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Retrieve the next run of contiguously mapped blocks. Useful for working
out where to break up IO.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The policy tick() method is normally called from interrupt context.
Both the mq and smq policies do some bottom half work for the tick
method in their map functions. However if no IO is going through the
cache, then that bottom half work doesn't occur. With these policies
this means recently hit entries do not age and do not get written
back as early as we'd like.
Fix this by introducing a new 'can_block' parameter to the tick()
method. When this is set the bottom half work occurs immediately.
'can_block' is set when the tick method is called every second by the
core target (not in interrupt context).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If a cache metadata operation fails (e.g. transaction commit) the
cache's metadata device will abort the current transaction, set a new
needs_check flag, and the cache will transition to "read-only" mode. If
aborting the transaction or setting the needs_check flag fails the cache
will transition to "fail-io" mode.
Once needs_check is set the cache device will not be allowed to
activate. Activation requires write access to metadata. Future work is
needed to add proper support for running the cache in read-only mode.
Once in fail-io mode the cache will report a status of "Fail".
Also, add commit() wrapper that will disallow commits if in read_only or
fail mode.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When the cache is idle, writeback work was only being issued every
second. With this change outstanding writebacks are streamed
constantly. This offers a writeback performance improvement.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The stochastic-multi-queue (smq) policy addresses some of the problems
with the current multiqueue (mq) policy.
Memory usage
------------
The mq policy uses a lot of memory; 88 bytes per cache block on a 64
bit machine.
SMQ uses 28bit indexes to implement it's data structures rather than
pointers. It avoids storing an explicit hit count for each block. It
has a 'hotspot' queue rather than a pre cache which uses a quarter of
the entries (each hotspot block covers a larger area than a single
cache block).
All these mean smq uses ~25bytes per cache block. Still a lot of
memory, but a substantial improvement nontheless.
Level balancing
---------------
MQ places entries in different levels of the multiqueue structures
based on their hit count (~ln(hit count)). This means the bottom
levels generally have the most entries, and the top ones have very
few. Having unbalanced levels like this reduces the efficacy of the
multiqueue.
SMQ does not maintain a hit count, instead it swaps hit entries with
the least recently used entry from the level above. The over all
ordering being a side effect of this stochastic process. With this
scheme we can decide how many entries occupy each multiqueue level,
resulting in better promotion/demotion decisions.
Adaptability
------------
The MQ policy maintains a hit count for each cache block. For a
different block to get promoted to the cache it's hit count has to
exceed the lowest currently in the cache. This means it can take a
long time for the cache to adapt between varying IO patterns.
Periodically degrading the hit counts could help with this, but I
haven't found a nice general solution.
SMQ doesn't maintain hit counts, so a lot of this problem just goes
away. In addition it tracks performance of the hotspot queue, which
is used to decide which blocks to promote. If the hotspot queue is
performing badly then it starts moving entries more quickly between
levels. This lets it adapt to new IO patterns very quickly.
Performance
-----------
In my tests SMQ shows substantially better performance than MQ. Once
this matures a bit more I'm sure it'll become the default policy.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.
This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it. c files
which need access to more backing-dev details now include
backing-dev.h directly. This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.
v2: fs/fat build failure fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->state into wb.
* enum bdi_state is renamed to wb_state and the prefix of all enums is
changed from BDI_ to WB_.
* Explicit zeroing of bdi->state is removed without adding zeoring of
wb->state as the whole data structure is zeroed on init anyway.
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: drbd-dev@lists.linbit.com
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
testing from Junichi Nomura and Bart Van Assche.
Also fix a casting bug in dm_merge_bvec() that could cause only a single
page to be added to a bio (Joe identified this while testing dm-cache
writeback).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVaLGPAAoJEMUj8QotnQNa1pcH/3TumkM0rvl1DK7DfWvjgfpp
+h+whWn79lOUzan/AszRvRU4Pwr7symndnJzJ/tqQ76fm8x84obvKa0jFfB9rmlM
sgw45alqB+AKcKAvocsuDIdzy2/K8FHBihccPWhe87AtJRlj57nTOJWO06gzExOp
/AJ/0Vgv3pxUCL2M4XucpQOV8eUg+gpwjDMxuMEsPVzBVYJfuRvNKoERtd9LSCSV
Y5iqdHCJiW0SWesZNrfthAAlReFzqb1c3mq55G0q0C+Z77Jf3TIPBhVfouM4NOKt
wjQoPavzyAEiKfLFy6QFCjhSOXKCI3klqO7zcyzyygVue5UTHICYRJKn7TFtbfU=
=xGAV
-----END PGP SIGNATURE-----
Merge tag 'dm-4.1-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device-mapper fixes from Mike Snitzer:
"Quite a few fixes for DM's blk-mq support thanks to extra DM multipath
testing from Junichi Nomura and Bart Van Assche.
Also fix a casting bug in dm_merge_bvec() that could cause only a
single page to be added to a bio (Joe identified this while testing
dm-cache writeback)"
* tag 'dm-4.1-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: fix casting bug in dm_merge_bvec()
dm: fix reload failure of 0 path multipath mapping on blk-mq devices
dm: fix false warning in free_rq_clone() for unmapped requests
dm: requeue from blk-mq dm_mq_queue_rq() using BLK_MQ_RQ_QUEUE_BUSY
dm mpath: fix leak of dm_mpath_io structure in blk-mq .queue_rq error path
dm: fix NULL pointer when clone_and_map_rq returns !DM_MAPIO_REMAPPED
dm: run queue on re-queue
When considering whether to move a block to the cache we already give
preferential treatment to discarded blocks, since they are cheap to
promote (no read of the origin required since the data is junk).
The same is true of blocks that are about to be completely
overwritten, so we likewise boost their promotion chances.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Currently individual bios are deferred to the worker thread if they
cannot be processed immediately (eg, a block is in the process of
being moved to the fast device).
This patch passes whole cells across to the worker. This saves
reaquiring the cell, and also collects bios destined for the same block
together, which allows them to be mapped with a single look up to the
policy. This reduces the overhead of using dm-cache.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Rather than always releasing the prisoners in a cell, the client may
want to promote one of them to be the new holder. There is a race here
though between releasing an empty cell, and other threads adding new
inmates. So this function makes the decision with its lock held.
This function can have two outcomes:
i) An inmate is promoted to be the holder of the cell (return value of 0).
ii) The cell has no inmate for promotion and is released (return value of 1).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
We only allow non critical writeback if the origin is idle. It is up
to the policy to decide what writeback work is critical.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A little class that keeps track of the volume of io that is in flight,
and the length of time that a device has been idle for.
FIXME: rather than jiffes, may be best to use ktime_t (to support faster
devices).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There is a race between a policy deciding to replace a cache entry,
the core target writing back any dirty data from this block, and other
IO threads doing IO to the same block.
This sort of problem is avoided most of the time by the core target
grabbing a bio prison cell before making the request to the policy.
But for a demotion the core target doesn't know which block will be
demoted, so can't do this in advance.
Fix this demotion race by introducing a callback to the policy interface
that allows the policy to grab the cell on behalf of the core target.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
A crypto driver can process requests synchronously or asynchronously
and can use an internal driver queue to backlog requests.
Add some comments to clarify internal logic and completion return codes.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Currently if there is a leg failure, the bio will be put into the hold
list until userspace does a remove/replace on the leg. Doing so in a
cluster config (clvmd) is problematic because there may be a temporary
path failure that results in cluster raid1 remove/replace. Such
recovery takes a long time due to a full resync.
Update dm-raid1 to optionally ignore these failures so bios continue
being issued without interrupton. To enable this feature userspace
must pass "keep_log" when creating the dm-raid1 device.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Tested-by: Liuhua Wang <lwang@suse.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
On 32-bit:
drivers/md/dm-log-writes.c: In function ‘log_super’:
drivers/md/dm-log-writes.c:323: warning: integer constant is too large for ‘long’ type
Add a ULL suffix to WRITE_LOG_MAGIC to fix this.
Also add a ULL suffix to WRITE_LOG_VERSION as it's stored in a __le64
field.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add dm-raid access to the MD RAID0 personality to enable single zone
striping.
The following changes enable that access:
- add type definition to raid_types array
- make bitmap creation conditonal in super_validate(), because
bitmaps are not allowed in raid0
- set rdev->sectors to the data image size in super_validate()
to allow the raid0 personality to calculate the MD array
size properly
- use mdddev(un)lock() functions instead of direct mutex_(un)lock()
(wrapped in here because it's a trivial change)
- enhance raid_status() to always report full sync for raid0
so that userspace checks for 100% sync will succeed and allow
for resize (and takeover/reshape once added in future paches)
- enhance raid_resume() to not load bitmap in case of raid0
- add merge function to avoid data corruption (seen with readahead)
that resulted from bio payloads that grew too large. This problem
did not occur with the other raid levels because it either did not
apply without striping (raid1) or was avoided via stripe caching.
- raise version to 1.7.0 because of the raid0 API change
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
- ensure maximum device limit in superblock
- rename DMPF_* (print flags) to CTR_FLAG_* (constructor flags)
and their respective struct raid_set member
- use strcasecmp() in raid10_format_to_md_layout() as in the constructor
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Remove comment above parse_raid_params() that claims
"devices_handle_discard_safely" is a table line argument when it is
actually is a module parameter.
Also, backfill dm-raid target version 1.6.0 documentation.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Leverage the block manager's read_only flag instead of duplicating it;
access with new dm_bm_is_read_only() method.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The overwrite has only ever about optimizing away the need to zero a
block if the entire block was being overwritten. As such it is only
relevant when zeroing is enabled.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Introduce a single common method for cleaning up a DM device's
mapped_device. No functional change, just eliminates duplication of
delicate mapped_device cleanup code.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
More often than not a request that is requeued _is_ mapped (meaning the
clone request is allocated and clone->q is initialized). Rename
dm_requeue_unmapped_original_request() to avoid potential confusion due
to function name containing "unmapped".
Also, remove dm_requeue_unmapped_request() since callers can easily call
the dm_requeue_original_request() directly.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Do not allocate the io_pool mempool for blk-mq request-based DM
(DM_TYPE_MQ_REQUEST_BASED) in dm_alloc_rq_mempools().
Also refine __bind_mempools() to have more precise awareness of which
mempools each type of DM device uses -- avoids mempool churn when
reloading DM tables (particularly for DM_TYPE_REQUEST_BASED).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm_merge_bvec() was originally added in f6fccb ("dm: introduce
merge_bvec_fn"). In that commit a value in sectors is converted to
bytes using << 9, and then assigned to an int. This code made
assumptions about the value of BIO_MAX_SECTORS.
A later commit 148e51 ("dm: improve documentation and code clarity in
dm_merge_bvec") was meant to have no functional change but it removed
the use of BIO_MAX_SECTORS in favor of using queue_max_sectors(). At
this point the cast from sector_t to int resulted in a zero value. The
fallout being dm_merge_bvec() would only allow a single page to be added
to a bio.
This interim fix is minimal for the benefit of stable@ because the more
comprehensive cleanup of passing a sector_t to all DM targets' merge
function will impact quite a few DM targets.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.19+
dm-multipath accepts 0 path mapping.
# echo '0 2097152 multipath 0 0 0 0' | dmsetup create newdev
Such a mapping can be used to release underlying devices while still
holding requests in its queue until working paths come back.
However, once the multipath device is created over blk-mq devices,
it rejects reloading of 0 path mapping:
# echo '0 2097152 multipath 0 0 1 1 queue-length 0 1 1 /dev/sda 1' \
| dmsetup create mpath1
# echo '0 2097152 multipath 0 0 0 0' | dmsetup load mpath1
device-mapper: reload ioctl on mpath1 failed: Invalid argument
Command failed
With following kernel message:
device-mapper: ioctl: can't change device type after initial table load.
DM tries to inherit the current table type using dm_table_set_type()
but it doesn't work as expected because of unnecessary check about
whether the target type is hybrid or not.
Hybrid type is for targets that work as either request-based or bio-based
and not required for blk-mq or non blk-mq checking.
Fixes: 65803c2059 ("dm table: train hybrid target type detection to select blk-mq if appropriate")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Unfortunately this functionality was merged a little prematurely.
The necessary testing and code review is now complete (or as
complete as it can be) and to code passes a variety of tests
and looks quite sensible.
Also a fix for some recent locking changes - a race was introduced
which causes a reshape request to sometimes fail. No data safety issues.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVWf/sDnsnt1WYoG5AQJ8hQ/+KIGUijacXXUXBE4QuO1DMTkltV61bk6E
TJQ6fTuMvXeOuyGm+BoSFTJrOJiP6/PVxl4jnAkjLlvAK/JVKekG0PXv2flmD9EJ
udK/g8d2k+4L2O0uiGdGSfOQaEaQ4OQvNmQOP9GF/FXNdyYfZbSJnxG+kzWnStGZ
3LNEMoDok9TiUDVSJ3PgibnUHYr3zNJFjBGszfRW0HqXBRWM5TI6HQ0bWwrm61mQ
sIOvFeS7CVOBQWW7zkY3uvz/g7dpuPlXqmDOomF+prKlU320SrpSDDBD2Qg56rXh
8YGAzLPV8R6xB5hjGFnoHtvxF/f5Fntb3WbC5az0zv+q/phDYA9Nd2UN5APemyGB
PJuxW4Ojq2DWIAvmf0HQEkvjJlqeugCgIQXJJ8yvIaBXJJjit1jMSEXjolM4vlLh
h6Su/hwoyTi9NxdYpFeR6JuyHjzTyrjyBkbW8y12wVQjmDncBdKtieYZX4TvPxVz
n7Qrk2bpFhR/icP6eYWCvt6iwU1e+5lXNb/18AYm9bJe5BE5/N1X0azrxbdZT4cl
1DvQw2HAMBGp+nSr+R1lqO4yX+busBZUTYsaGvH4T7Ubs+UjwgTE3tPoevj6w829
0/7r/UPfSn0XFbd5rrPY+bOBsAOIMDG5g3mj7K7+38sVeX9VOVN4sGftS5dWTr9e
RQBTZAK0+qI=
=Y0Vm
-----END PGP SIGNATURE-----
Merge tag 'md/4.1-rc5-fixes' of git://neil.brown.name/md
Pull m,ore md bugfixes gfrom Neil Brown:
"Assorted fixes for new RAID5 stripe-batching functionality.
Unfortunately this functionality was merged a little prematurely. The
necessary testing and code review is now complete (or as complete as
it can be) and to code passes a variety of tests and looks quite
sensible.
Also a fix for some recent locking changes - a race was introduced
which causes a reshape request to sometimes fail. No data safety
issues"
* tag 'md/4.1-rc5-fixes' of git://neil.brown.name/md:
md: fix race when unfreezing sync_action
md/raid5: break stripe-batches when the array has failed.
md/raid5: call break_stripe_batch_list from handle_stripe_clean_event
md/raid5: be more selective about distributing flags across batch.
md/raid5: add handle_flags arg to break_stripe_batch_list.
md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list
md/raid5: remove condition test from check_break_stripe_batch_list.
md/raid5: Ensure a batch member is not handled prematurely.
md/raid5: close race between STRIPE_BIT_DELAY and batching.
md/raid5: ensure whole batch is delayed for all required bitmap updates.
When stacking request-based dm device on non blk-mq device and
device-mapper target could not map the request (error target is used,
multipath target with all paths down, etc), the WARN_ON_ONCE() in
free_rq_clone() will trigger when it shouldn't.
The warning was added by commit aa6df8d ("dm: fix free_rq_clone() NULL
pointer when requeueing unmapped request"). But free_rq_clone() with
clone->q == NULL is valid usage for the case where
dm_kill_unmapped_request() initiates request cleanup.
Fix this false warning by just removing the WARN_ON -- it only generated
false positives and was never useful in catching the intended case
(completing clone request not being mapped e.g. clone->q being NULL).
Fixes: aa6df8d ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A recent change removed the need for locking around writing
to "sync_action" (and various other places), but introduced a
subtle race.
When e.g. setting 'reshape' on a 'frozen' array, the 'frozen'
flag is cleared before 'reshape' is set, so the md thread can
get in and start trying recovery - which isn't wanted.
So instead of clearing MD_RECOVERY_FROZEN for any command
except 'frozen', only clear it when each specific command
is parsed. This allows the handling of 'reshape' to clear
the bit while a lock is held.
Also remove some places where we set MD_RECOVERY_NEEDED,
as it is always set on non-error exit of the function.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 6791875e2e ("md: make reconfig_mutex optional for writes to md sysfs files.")
Now that the code in break_stripe_batch_list() is nearly identical
to the end of handle_stripe_clean_event, replace the later
with a function call.
The only remaining difference of any interest is the masking that is
applieds to dev[i].flags copied from head_sh.
R5_WriteError certainly isn't wanted as it is set per-stripe, not
per-patch. R5_Overlap isn't wanted as it is explicitly handled.
Signed-off-by: NeilBrown <neilb@suse.de>
When a batch of stripes is broken up, we keep some of the flags
that were per-stripe, and copy other flags from the head to all
others.
This only happens while a stripe is being handled, so many of the
flags are irrelevant.
The "SYNC_FLAGS" (which I've renamed to make it clear there are
several) and STRIPE_DEGRADED are set per-stripe and so need to be
preserved. STRIPE_INSYNC is the only flag that is set on the head
that needs to be propagated to all others.
For safety, add a WARN_ON if others are set, except:
STRIPE_HANDLE - this is safe and per-stripe and we are going to set
in several cases anyway
STRIPE_INSYNC
STRIPE_IO_STARTED - this is just a hint and doesn't hurt.
STRIPE_ON_PLUG_LIST
STRIPE_ON_RELEASE_LIST - It is a point pointless for a batched
stripe to be on one of these lists, but it can happen
as can be safely ignored.
Signed-off-by: NeilBrown <neilb@suse.de>
When we break a stripe_batch_list we sometimes want to set
STRIPE_HANDLE on the individual stripes, and sometimes not.
So pass a 'handle_flags' arg. If it is zero, always set STRIPE_HANDLE
(on non-head stripes). If not zero, only set it if any of the given
flags are present.
Signed-off-by: NeilBrown <neilb@suse.de>
break_stripe_batch list didn't clear head_sh->batch_head.
This was probably a bug.
Also clear all R5_Overlap flags and if any were cleared, wake up
'wait_for_overlap'.
This isn't always necessary but the worst effect is a little
extra checking for code that is waiting on wait_for_overlap.
Also, don't use wake_up_nr() because that does the wrong thing
if 'nr' is zero, and it number of flags cleared doesn't
strongly correlate with the number of threads to wake.
Signed-off-by: NeilBrown <neilb@suse.de>
handle_stripe_clean_event() contains a chunk of code very
similar to check_break_stripe_batch_list().
If we make the latter more like the former, we can end up
with just one copy of this code.
This first step removed the condition (and the 'check_') part
of the name. This has the added advantage of making it clear
what check is being performed at the point where the function is
called.
Signed-off-by: NeilBrown <neilb@suse.de>
If a stripe is a member of a batch, but not the head, it must
not be handled separately from the rest of the batch.
'clear_batch_ready()' handles this requirement to some
extent but not completely. If a member is passed to handle_stripe()
a second time it returns '0' indicating the stripe can be handled,
which is wrong.
So add an extra test.
Signed-off-by: NeilBrown <neilb@suse.de>
When we add a write to a stripe we need to make sure the bitmap
bit is set. While doing that the stripe is not locked so it could
be added to a batch after which further changes to STRIPE_BIT_DELAY
and ->bm_seq are ineffective.
So we need to hold off adding to a stripe until bitmap_startwrite has
completed at least once, and we need to avoid further changes to
STRIPE_BIT_DELAY once the stripe has been added to a batch.
If a bitmap_startwrite() completes after the stripe was added to a
batch, it will not have set the bit, only incremented a counter, so no
extra delay of the stripe is needed.
Reported-by: Shaohua Li <shli@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
When we add a stripe to a batch, we need to be sure that
head stripe will wait for the bitmap update required for the new
stripe.
Signed-off-by: NeilBrown <neilb@suse.de>
Use BLK_MQ_RQ_QUEUE_BUSY to requeue a blk-mq request directly from the
DM blk-mq device's .queue_rq. This cleans up the previous convoluted
handling of request requeueing that would return BLK_MQ_RQ_QUEUE_OK
(even though it wasn't) and then run blk_mq_requeue_request() followed
by blk_mq_kick_requeue_list().
Also, document that DM blk-mq ontop of old request_fn devices cannot
fail in clone_rq() since the clone request is preallocated as part of
the pdu.
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When stacking request-based DM on blk_mq device, request cloning and
remapping are done in a single call to target's clone_and_map_rq().
The clone is allocated and valid only if clone_and_map_rq() returns
DM_MAPIO_REMAPPED.
The "IS_ERR(clone)" check in map_request() does not cover all the
!DM_MAPIO_REMAPPED cases that are possible (E.g. if underlying devices
are not ready or unavailable, clone_and_map_rq() may return
DM_MAPIO_REQUEUE without ever having established an ERR_PTR). Fix this
by explicitly checking for a return that is not DM_MAPIO_REMAPPED in
map_request().
Without this fix, DM core may call setup_clone() for a NULL clone
and oops like this:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
IP: [<ffffffff81227525>] blk_rq_prep_clone+0x7d/0x137
...
CPU: 2 PID: 5793 Comm: kdmwork-253:3 Not tainted 4.0.0-nm #1
...
Call Trace:
[<ffffffffa01d1c09>] map_tio_request+0xa9/0x258 [dm_mod]
[<ffffffff81071de9>] kthread_worker_fn+0xfd/0x150
[<ffffffff81071cec>] ? kthread_parkme+0x24/0x24
[<ffffffff81071cec>] ? kthread_parkme+0x24/0x24
[<ffffffff81071fdd>] kthread+0xe6/0xee
[<ffffffff81093a59>] ? put_lock_stats+0xe/0x20
[<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b
[<ffffffff814c2d98>] ret_from_fork+0x58/0x90
[<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b
Fixes: e5863d9ad ("dm: allocate requests in target when stacking on blk-mq devices")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.0+
Without kicking queue, requeued request may stay forever in
the queue if there are no other I/O activities to the device.
The original error had been in v2.6.39 with commit 7eaceaccab
("block: remove per-queue plugging"), which replaced conditional
plugging by periodic runqueue.
Commit 9d1deb83d4 in v4.1-rc1 removed the periodic runqueue
and the problem started to manifest.
Fixes: 9d1deb83d4 ("dm: don't schedule delayed run of the queue if nothing to do")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
- one serious RAID0 data corruption - caused by recent bugfix that wasn't
reviewed properly.
- one raid5 fix in new code (a couple more of those to come).
- one little fix to stop static analysis complaining about silly rcu
annotation.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVV6rmDnsnt1WYoG5AQICpQ//VleOWnZERlVQN1nYzGHqIyPUz+dV6x8W
RfzccMzUWqaHpZY/BaTMqJxPWgt6lxqqtN5F3u+XMdM+L2Pl0NXoRTqB+FzpLucR
d0p76FwbW2IWnycxXUZtGjm8xT942KPdqXIP+LQRKaIhguV9P9th9WbPRvlXFfB6
pO1h1AdxVlsfn6qgBpe49pYoUuNmpwSTTJ8Gvb7YRsHXnmB/lsjQvVsWLiMqPwug
+bR0gM07OvTfYenrY82ztu42+xAoVoeu/XhHmPPLEV3S/SA/9kVLgBcWjq2bBjw0
REzR2i+exEtCRe2yvetDX8320IKZcZSOE8BwBB2iUm8OVjplPxw83Bk/kB2+Zr8n
VxAa9/vCZPLNCWyzOSi2V471NeFBys9PVkVIroHT5nQZGadkw0JYt1cEbcHSBKS/
NYmJCu1IM0o9DPzn8jW7sXjDcB0WE8VwFMJ09QSiEYQu2bAb7j2f1pcoweN2MeNF
qOitGj/luygXg2TBvema9qPL533DUj+eSYyyO7OQBsfkfZSDM5di4bGoZBVw118Q
kARaJ3IZffT8gsVpgSTCsviDv5QUcs0izVS1sMKB6/SDJjtLXBMERW+LkrmjBsnB
6+CMmwNZUIzugzBgSPBVozdNQQYjAXL7LNV6Q9YQhE0RK0+10fLLxmpovJR62bRf
PWlg23+n4+E=
=jXKM
-----END PGP SIGNATURE-----
Merge tag 'md/4.1-rc4-fixes' of git://neil.brown.name/md
Pull md bugfixes from Neil Brown:
"I have a few more raid5 bugfixes pending, but I want them to get a bit
more review first. In the meantime:
- one serious RAID0 data corruption - caused by recent bugfix that
wasn't reviewed properly.
- one raid5 fix in new code (a couple more of those to come).
- one little fix to stop static analysis complaining about silly rcu
annotation"
* tag 'md/4.1-rc4-fixes' of git://neil.brown.name/md:
md/bitmap: remove rcu annotation from pointer arithmetic.
md/raid0: fix restore to sector variable in raid0_make_request
raid5: fix broken async operation chain
Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.
This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence. With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.
I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
1) saving a bio's original bi_end_io
2) wiring up an intermediate bi_end_io
3) restoring the original bi_end_io from intermediate bi_end_io
4) calling bio_endio() to execute the restored original bi_end_io
The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called. For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0. When bio_inc_remaining() occurred during step
3 it brought it to a value of 2. When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).
Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining. For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.
Also, the bio_inc_remaining() interface has been moved local to bio.c.
Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Evaluating "&mddev->disks" is simple pointer arithmetic, so
it does not need 'rcu' annotations - no dereferencing is happening.
Also enhance the comment to explain that 'rdev' in that case
is not actually a pointer to an rdev.
Reported-by: Patrick Marlier <patrick.marlier@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The variable "sector" in "raid0_make_request()" was improperly updated
by a call to "sector_div()" which modifies its first argument in place.
Commit 47d68979cc restored this variable
after the call for later re-use. Unfortunetly the restore was done after
the referenced variable "bio" was advanced. This lead to the original
value and the restored value being different. Here we move this line to
the proper place.
One observed side effect of this bug was discarding a file though
unlinking would cause an unrelated file's contents to be discarded.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 47d68979cc ("md/raid0: fix bug with chunksize not a power of 2.")
Cc: stable@vger.kernel.org (any that received above backport)
URL: https://bugzilla.kernel.org/show_bug.cgi?id=98501
ops_run_reconstruct6() doesn't correctly chain asyn operations. The tx returned
by async_gen_syndrome should be added as the dependent tx of next stripe.
The issue is introduced by commit 59fc630b8b
RAID5: batch adjacent full stripe write
Reported-and-tested-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Most of these are related to the new "batched stripe writeout",
but there are a few others.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVVAgfznsnt1WYoG5AQKNsBAAw+yXVnRlLfzTPOhiIolvdmdmck6ghpV7
eTHTbDEVkDo+a+caU01mhkHtL2wEuxGqr4XHTASIM9Xjn2YLhH14HOtBPl6+IddR
F58X+SN8/pK4tzc5P42AAbfGfqeJSBPSl49L94+Pd9BsagNYZOUx7c/WK5lKay6O
ZQRvI3Bafg7VTzHCw931sN11yY/PS+45zpOentk5TlavRPcZfuVTh1QzSyAUvbtN
sAFK2+ISflj8+TW0rLo44c3kkC7ARFKcq5lJt4WsQUntJTjiIROvqvaEP+IUZkGr
y+yvCJbskJ6B03XrcuvdkFrp3Qe0yWtHCN6qKhc8AI/c/pcuz24xTYQ6V6LDyAy8
N2z6m1Kri97qxZpOONNh31WOCmVyFmwiz7+54soEeTDg3xgTb60UiZe8T+9ZcxoY
sN8ksVSGGapgQUUGYcidbZla6uSWNjZq9hQgDJFUwG3HhvT3E/qiUT9kqAxfPsLE
Vxw/KQ5GkW+7ldaPmGuv29hc3THjsiLY5Bx0kBlmgao3fcT61WbNcZPJOQfMxlT7
6qoJSiNoq612TaHj3X06FXhMGE/fwjwjOO+PpULxM47giHcKrVR/qlG4kREYa2vj
UgV05rP/S+k3Mce4dHzvUuE57t0ADff0d/mZvhfzERjNdwTMdWgia89XLg7D06rk
UGB/o2IwWug=
=tapF
-----END PGP SIGNATURE-----
Merge tag 'md/4.1-rc3-fixes' of git://neil.brown.name/md
Pull md bugfixes from Neil Brown:
"A few fixes for md.
Most of these are related to the new "batched stripe writeout", but
there are a few others"
* tag 'md/4.1-rc3-fixes' of git://neil.brown.name/md:
md/raid5: fix handling of degraded stripes in batches.
md/raid5: fix allocation of 'scribble' array.
md/raid5: don't record new size if resize_stripes fails.
md/raid5: avoid reading parity blocks for full-stripe write to degraded array
md/raid5: more incorrect BUG_ON in handle_stripe_fill.
md/raid5: new alloc_stripe() to allocate an initialize a stripe.
md-raid0: conditional mddev->queue access to suit dm-raid
window: The first reverts a dm-crypt change that wasn't correct. The
second fixes a device format regression that impacted userspace.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVTRZOAAoJEMUj8QotnQNaW7EIAJlHn+S8czm1Cb4gWBn7kg+X
vzH5NIzr/SpDX8o3R8NBdrB8rgqTm4jQZrptbmgLG+j9XoaQupuFyNCiaAw47v2G
P/WYlodNwTkb3I48XjwCRo00MtR3cEJ8ywNzvEJUvgPkgMMIzhieHsVT9L8bZv3n
XDs8JzZyF966U0BeCjF4oDAazUrpEvWf0h4C5L47g8C0UQI7aGwYKoSvZm3DAImP
awbJbnqtQuoRcI0HISHrjYi1vghgnmJY6aSx3tYSJPTNRkFNqgap7eZrUacicnOH
bUVL3snBVebK3JMJhJXgfGW/FeeP9juhEY08JNTOZ5wa6BNuru0GHeqKuI3arHY=
=jlAN
-----END PGP SIGNATURE-----
Merge tag 'dm-4.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
"Two additional fixes for changes introduced via DM during the 4.1
merge window.
The first reverts a dm-crypt change that wasn't correct. The second
fixes a device format regression that impacted userspace"
* tag 'dm-4.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
init: fix regression by supporting devices with major:minor:offset format
Revert "dm crypt: fix deadlock when async crypto algorithm returns -EBUSY"
Pull block fixes from Jens Axboe:
"A collection of fixes since the merge window;
- fix for a double elevator module release, from Chao Yu. Ancient bug.
- the splice() MORE flag fix from Christophe Leroy.
- a fix for NVMe, fixing a patch that went in in the merge window.
From Keith.
- two fixes for blk-mq CPU hotplug handling, from Ming Lei.
- bdi vs blockdev lifetime fix from Neil Brown, fixing and oops in md.
- two blk-mq fixes from Shaohua, fixing a race on queue stop and a
bad merge issue with FUA writes.
- division-by-zero fix for writeback from Tejun.
- a block bounce page accounting fix, making sure we inc/dec after
bouncing so that pre/post IO pages match up. From Wang YanQing"
* 'for-linus' of git://git.kernel.dk/linux-block:
splice: sendfile() at once fails for big files
blk-mq: don't lose requests if a stopped queue restarts
blk-mq: fix FUA request hang
block: destroy bdi before blockdev is unregistered.
block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE
elevator: fix double release of elevator module
writeback: use |1 instead of +1 to protect against div by zero
blk-mq: fix CPU hotplug handling
blk-mq: fix race between timeout and CPU hotplug
NVMe: Fix VPD B0 max sectors translation
There is no need for special handling of stripe-batches when the array
is degraded.
There may be if there is a failure in the batch, but STRIPE_DEGRADED
does not imply an error.
So don't set STRIPE_BATCH_ERR in ops_run_io just because the array is
degraded.
This actually causes a bug: the STRIPE_DEGRADED flag gets cleared in
check_break_stripe_batch_list() and so the bitmap bit gets cleared
when it shouldn't.
So in check_break_stripe_batch_list(), split the batch up completely -
again STRIPE_DEGRADED isn't meaningful.
Also don't set STRIPE_BATCH_ERR when there is a write error to a
replacement device. This simply removes the replacement device and
requires no extra handling.
Signed-off-by: NeilBrown <neilb@suse.de>
As the new 'scribble' array is sized based on chunk size,
we need to make sure the size matches the largest of 'old'
and 'new' chunk sizes when the array is undergoing reshape.
We also potentially need to resize it even when not resizing
the stripe cache, as chunk size can change without changing
number of devices.
So move the 'resize' code into a separate function, and
consider old and new sizes when allocating.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 46d5b78562 ("raid5: use flex_array for scribble data")
If any memory allocation in resize_stripes fails we will return
-ENOMEM, but in some cases we update conf->pool_size anyway.
This means that if we try again, the allocations will be assumed
to be larger than they are, and badness results.
So only update pool_size if there is no error.
This bug was introduced in 2.6.17 and the patch is suitable for
-stable.
Fixes: ad01c9e375 ("[PATCH] md: Allow stripes to be expanded in preparation for expanding an array")
Cc: stable@vger.kernel.org (v2.6.17+)
Signed-off-by: NeilBrown <neilb@suse.de>
When performing a reconstruct write, we need to read all blocks
that are not being over-written .. except the parity (P and Q) blocks.
The code currently reads these (as they are not being over-written!)
unnecessarily.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: ea664c8245 ("md/raid5: need_this_block: tidy/fix last condition.")
It is not incorrect to call handle_stripe_fill() when
a batch of full-stripe writes is active.
It is, however, a BUG if fetch_block() then decides
it needs to actually fetch anything.
So move the 'BUG_ON' to where it belongs.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 59fc630b8b ("RAID5: batch adjacent full stripe write")
The new batch_lock and batch_list fields are being initialized in
grow_one_stripe() but not in resize_stripes(). This causes a crash
on resize.
So separate the core initialization into a new function and call it
from both allocation sites.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 59fc630b8b ("RAID5: batch adjacent full stripe write")
This patch is a prerequisite for dm-raid "raid0" support to allow
dm-raid to access the MD RAID0 personality doing unconditional
accesses to mddev->queue, which is NULL in case of dm-raid stacked on
top of MD.
Most of the conditional mddev->queue accesses made it to upstream but
this missing one, which prohibits md raid0 to set disk stack limits
(being done in dm core in case of md underneath dm).
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.
If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.
Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Struct bio has an atomic ref count for chained bio's, and we use this
to know when to end IO on the bio. However, most bio's are not chained,
so we don't need to always introduce this atomic operation as part of
ending IO.
Add a helper to elevate the bi_remaining count, and flag the bio as
now actually needing the decrement at end_io time. Rename the field
to __bi_remaining to catch any current users of this doing the
incrementing manually.
For high IOPS workloads, this reduces the overhead of bio_endio()
substantially.
Tested-by: Robert Elliott <elliott@hp.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
This reverts Linux 4.1-rc1 commit 0618764cb2.
The problem which that commit attempts to fix actually lies in the
Freescale CAAM crypto driver not dm-crypt.
dm-crypt uses CRYPTO_TFM_REQ_MAY_BACKLOG. This means the the crypto
driver should internally backlog requests which arrive when the queue is
full and process them later. Until the crypto hw's queue becomes full,
the driver returns -EINPROGRESS. When the crypto hw's queue if full,
the driver returns -EBUSY, and if CRYPTO_TFM_REQ_MAY_BACKLOG is set, is
expected to backlog the request and process it when the hardware has
queue space. At the point when the driver takes the request from the
backlog and starts processing it, it calls the completion function with
a status of -EINPROGRESS. The completion function is called (for a
second time, in the case of backlogged requests) with a status/err of 0
when a request is done.
Crypto drivers for hardware without hardware queueing use the helpers,
crypto_init_queue(), crypto_enqueue_request(), crypto_dequeue_request()
and crypto_get_backlog() helpers to implement this behaviour correctly,
while others implement this behaviour without these helpers (ccp, for
example).
dm-crypt (before the patch that needs reverting) uses this API
correctly. It queues up as many requests as the hw queues will allow
(i.e. as long as it gets back -EINPROGRESS from the request function).
Then, when it sees at least one backlogged request (gets -EBUSY), it
waits till that backlogged request is handled (completion gets called
with -EINPROGRESS), and then continues. The references to
af_alg_wait_for_completion() and af_alg_complete() in that commit's
commit message are irrelevant because those functions only handle one
request at a time, unlink dm-crypt.
The problem is that the Freescale CAAM driver, which that commit
describes as having being tested with, fails to implement the
backlogging behaviour correctly. In cam_jr_enqueue(), if the hardware
queue is full, it simply returns -EBUSY without backlogging the request.
What the observed deadlock was is not described in the commit message
but it is obviously the wait_for_completion() in crypto_convert() where
dm-crypto would wait for the completion being called with -EINPROGRESS
in the case of backlogged requests. This completion will never be
completed due to the bug in the CAAM driver.
Commit 0618764cb2 incorrectly made dm-crypt wait for every request,
even when the driver/hardware queues are not full, which means that
dm-crypt will never see -EBUSY. This means that that commit will cause
a performance regression on all crypto drivers which implement the API
correctly.
Revert it. Correct backlog handling should be implemented in the CAAM
driver instead.
Cc'ing stable purely because commit 0618764cb2 did. If for some reason
a stable@ kernel did pick up commit 0618764cb2 it should get reverted.
Signed-off-by: Rabin Vincent <rabin.vincent@axis.com>
Reviewed-by: Horia Geanta <horia.geanta@freescale.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit 022333427a ("dm: optimize dm_mq_queue_rq to _not_ use kthread if
using pure blk-mq") mistakenly removed free_rq_clone()'s clone->q check
before testing clone->q->mq_ops. It was an oversight to discontinue
that check for 1 of the 2 use-cases for free_rq_clone():
1) free_rq_clone() called when an unmapped original request is requeued
2) free_rq_clone() called in the request-based IO completion path
The clone->q check made sense for case #1 but not for #2. However, we
cannot just reinstate the check as it'd mask a serious bug in the IO
completion case #2 -- no in-flight request should have an uninitialized
request_queue (basic block layer refcounting _should_ ensure this).
The NULL pointer seen for case #1 is detailed here:
https://www.redhat.com/archives/dm-devel/2015-April/msg00160.html
Fix this free_rq_clone() NULL pointer by simply checking if the
mapped_device's type is DM_TYPE_MQ_REQUEST_BASED (clone's queue is
blk-mq) rather than checking clone->q->mq_ops. This avoids the need to
dereference clone->q, but a WARN_ON_ONCE is added to let us know if an
uninitialized clone request is being completed.
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit bfebd1cdb4 ("dm: add full blk-mq support to request-based DM")
didn't properly account for the need to short-circuit re-initializing
DM's blk-mq request_queue if it was already initialized.
Otherwise, reloading a blk-mq request-based DM table (either manually
or via multipathd) resulted in errors, see:
https://www.redhat.com/archives/dm-devel/2015-April/msg00132.html
Fix is to only initialize the request_queue on the initial table load
(when the mapped_device type is assigned).
This is better than having dm_init_request_based_blk_mq_queue() return
early if the queue was already initialized because it elevates the
constraint to a more meaningful location in DM core. As such the
pre-existing early return in dm_init_request_based_queue() can now be
removed.
Fixes: bfebd1cdb4 ("dm: add full blk-mq support to request-based DM")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Because of the peculiar way that md devices are created (automatically
when the device node is opened), a new device can be created and
registered immediately after the
blk_unregister_region(disk_devt(disk), disk->minors);
call in del_gendisk().
Therefore it is important that all visible artifacts of the previous
device are removed before this call. In particular, the 'bdi'.
Since:
commit c4db59d31e
Author: Christoph Hellwig <hch@lst.de>
fs: don't reassign dirty inodes to default_backing_dev_info
moved the
device_unregister(bdi->dev);
call from bdi_unregister() to bdi_destroy() it has been quite easy to
lose a race and have a new (e.g.) "md127" be created after the
blk_unregister_region() call and before bdi_destroy() is ultimately
called by the final 'put_disk', which must come after del_gendisk().
The new device finds that the bdi name is already registered in sysfs
and complains
> [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70()
> [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127'
We can fix this by moving the bdi_destroy() call out of
blk_release_queue() (which can happen very late when a refcount
reaches zero) and into blk_cleanup_queue() - which happens exactly when the md
device driver calls it.
Then it is only necessary for md to call blk_cleanup_queue() before
del_gendisk(). As loop.c devices are also created on demand by
opening the device node, we make the same change there.
Fixes: c4db59d31e
Reported-by: Azat Khuzhin <a3at.mail@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org (v4.0)
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Highlights:
- "experimental" code for managing md/raid1 across a cluster using
DLM. Code is not ready for general use and triggers a WARNING if used.
However it is looking good and mostly done and having in mainline
will help co-ordinate development.
- RAID5/6 can now batch multiple (4K wide) stripe_heads so as to
handle a full (chunk wide) stripe as a single unit.
- RAID6 can now perform read-modify-write cycles which should
help performance on larger arrays: 6 or more devices.
- RAID5/6 stripe cache now grows and shrinks dynamically. The value
set is used as a minimum.
- Resync is now allowed to go a little faster than the 'mininum' when
there is competing IO. How much faster depends on the speed of the
devices, so the effective minimum should scale with device speed to
some extent.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVTbIxDnsnt1WYoG5AQIgAA//Z9FlEpkHcJJ75WGjrXgGJPyNTfEZOkoz
jnD8PpBY2Afp341vMatd0XKErdEAhuPQmJMAa+tbxht6pk77X3fSzXghGA8FEafg
yazn5pfBt6NXmepV4vhl/+LNYuWRxxLSA9EDm7wEg+tiO0UEts+a0w2TSXzcT1w+
30Yi1EjAlTaJ5yjlBHUOtTXWE43D6RKnUr6FMy2dRlnzlFyDlezRMChWo9v6OkQF
YqJ20FcvmJdLHY/6Yif3jvpm7eQecMqdCZENTvW/mJI86zqf6E+ToCYS1VNjfDnK
ud61iU9eshu4WtNWBG6KLuHBD0grO1NaEL7/S16w1KdNJMhYgiK8WussvIAJEesA
5SlETM7Y/1XFq8puwlAq2/tuPfhZ+TFxnAwce/C3hMTDcYAACnS/R6INFQXqGvy3
nX1NLogrCycX8oqxv3jTFKLVqIVwlkSlHcUGzIWjcfCF37StcXFKI5q862agyg2+
NNocFMuXhPPM1YcB9JJSo2nCsor4e9tTdVEZlFm2B3cc8LJ9BLWUMSoi1h7VK/1g
P7psnPIjz7/cdI2TZTFjGTZ0Kvhx/NTYp41AZealDNxeGWUNM+5xGZnUF8QRBc/E
0dGHtEAah834BDQFvNnJtuuh/s+KwbvswjNP+njoBsHjIQIvngDABpOwpIkdqF6r
diQ2gUPnHN0=
=OHG6
-----END PGP SIGNATURE-----
Merge tag 'md/4.1' of git://neil.brown.name/md
Pull md updates from Neil Brown:
"More updates that usual this time. A few have performance impacts
which hould mostly be positive, but RAID5 (in particular) can be very
work-load ensitive... We'll have to wait and see.
Highlights:
- "experimental" code for managing md/raid1 across a cluster using
DLM. Code is not ready for general use and triggers a WARNING if
used. However it is looking good and mostly done and having in
mainline will help co-ordinate development.
- RAID5/6 can now batch multiple (4K wide) stripe_heads so as to
handle a full (chunk wide) stripe as a single unit.
- RAID6 can now perform read-modify-write cycles which should help
performance on larger arrays: 6 or more devices.
- RAID5/6 stripe cache now grows and shrinks dynamically. The value
set is used as a minimum.
- Resync is now allowed to go a little faster than the 'mininum' when
there is competing IO. How much faster depends on the speed of the
devices, so the effective minimum should scale with device speed to
some extent"
* tag 'md/4.1' of git://neil.brown.name/md: (58 commits)
md/raid5: don't do chunk aligned read on degraded array.
md/raid5: allow the stripe_cache to grow and shrink.
md/raid5: change ->inactive_blocked to a bit-flag.
md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe
md/raid5: pass gfp_t arg to grow_one_stripe()
md/raid5: introduce configuration option rmw_level
md/raid5: activate raid6 rmw feature
md/raid6 algorithms: xor_syndrome() for SSE2
md/raid6 algorithms: xor_syndrome() for generic int
md/raid6 algorithms: improve test program
md/raid6 algorithms: delta syndrome functions
raid5: handle expansion/resync case with stripe batching
raid5: handle io error of batch list
RAID5: batch adjacent full stripe write
raid5: track overwrite disk count
raid5: add a new flag to track if a stripe can be batched
raid5: use flex_array for scribble data
md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid
md: allow resync to go faster when there is competing IO.
md: remove 'go_faster' option from ->sync_request()
...
When array is degraded, read data landed on failed drives will result in
reading rest of data in a stripe. So a single sequential read would
result in same data being read twice.
This patch is to avoid chunk aligned read for degraded array. The
downside is to involve stripe cache which means associated CPU overhead
and extra memory copy.
Test Results:
Following test are done on a enterprise storage node with Seagate 6T SAS
drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2,
chunk size 128 KiB.
I use FIO, using direct-io with various bs size, enough queue depth,
tested sequential and 100% random read against 3 array config:
1) optimal, as baseline;
2) degraded;
3) degraded with this patch.
Kernel version is 4.0-rc3.
Each individual test I only did once so there might be some variations,
but we just focus on big trend.
Sequential Read:
bs=(KiB) optimal(MiB/s) degraded(MiB/s) degraded-with-patch (MiB/s)
1024 1608 656 995
512 1624 710 956
256 1635 728 980
128 1636 771 983
64 1612 1119 1000
32 1580 1420 1004
16 1368 688 986
8 768 647 953
4 411 413 850
Random Read:
bs=(KiB) optimal(IOPS) degraded(IOPS) degraded-with-patch (IOPS)
1024 163 160 156
512 274 273 272
256 426 428 424
128 576 592 591
64 726 724 726
32 849 848 837
16 900 970 971
8 927 940 929
4 948 940 955
Some notes:
* In sequential + optimal, as bs size getting smaller, the FIO thread
become CPU bound.
* In sequential + degraded, there's big increase when bs is 64K and
32K, I don't have explanation.
* In sequential + degraded-with-patch, the MD thread mostly become CPU
bound.
If you want to we can discuss specific data point in those data. But in
general it seems with this patch, we have more predictable and in most
cases significant better sequential read performance when array is
degraded, and almost no noticeable impact on random read.
Performance is a complicated thing, the patch works well for this
particular configuration, but may not be universal. For example I
imagine testing on all SSD array may have very different result. But I
personally think in most cases IO bandwidth is more scarce resource than
CPU.
Signed-off-by: Eric Mei <eric.mei@seagate.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The default setting of 256 stripe_heads is probably
much too small for many configurations. So it is best to make it
auto-configure.
Shrinking the cache under memory pressure is easy. The only
interesting part here is that we put a fairly high cost
('seeks') on shrinking the cache as the cost is greater than
just having to read more data, it reduces parallelism.
Growing the cache on demand needs to be done carefully. If we allow
fast growth, that can upset memory balance as lots of dirty memory can
quickly turn into lots of memory queued in the stripe_cache.
It is important for the raid5 block device to appear congested to
allow write-throttling to work.
So we only add stripes slowly. We set a flag when an allocation
fails because all stripes are in use, allocate at a convenient
time when that flag is set, and don't allow it to be set again
until at least one stripe_head has been released for re-use.
This means that a spurt of requests will only cause one stripe_head
to be allocated, but a steady stream of requests will slowly
increase the cache size - until memory pressure puts it back again.
It could take hours to reach a steady state.
The value written to, and displayed in, stripe_cache_size is
used as a minimum. The cache can grow above this and shrink back
down to it. The actual size is not directly visible, though it can
be deduced to some extent by watching stripe_cache_active.
Signed-off-by: NeilBrown <neilb@suse.de>
Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe()
succeeds, do it inside the functions.
Also choose the correct hash to handle next inside the functions.
This removes duplication and will help with future new uses of
{grow,drop}_one_stripe.
This also fixes a minor bug where the "md/raid:%md: allocate XXkB"
message always said "0kB".
Signed-off-by: NeilBrown <neilb@suse.de>
Depending on the available coding we allow optimized rmw logic for write
operations. To support easier testing this patch allows manual control
of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level.
The configuration can handle three levels of control.
rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q
calculation has no implementation path yet to factor in/out chunks of
a syndrome. Enforcing this level can be benefical for slow CPUs with
hardware syndrome support and fast SSDs.
rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will
save IOs. This equals the "old" unpatched behaviour and will be the
default.
rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are
equal. We might have higher CPU consumption because of calculating the
parity twice but it can be benefical otherwise. E.g. RAID4 with fast
dedicated parity disk/SSD. The option is implemented just to be
forward-looking and will ONLY work with this patch!
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
expansion/resync can grab a stripe when the stripe is in batch list. Since all
stripes in batch list must be in the same state, we can't allow some stripes
run into expansion/resync. So we delay expansion/resync for stripe in batch
list.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If io error happens in any stripe of a batch list, the batch list will be
split, then normal process will run for the stripes in the list.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
unit. Idealy we should use big size for adjacent full stripe writes. Bigger
stripe cache size means less stripes runing in the state machine so can reduce
cpu overhead. And also bigger size can cause bigger IO size dispatched to under
layer disks.
With below patch, we will automatically batch adjacent full stripe write
together. Such stripes will be added to the batch list. Only the first stripe
of the list will be put to handle_list and so run handle_stripe(). Some steps
of handle_stripe() are extended to cover all stripes of the list, including
ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
running in handle_stripe() and we send IO of whole stripe list together to
increase IO size.
Stripes added to a batch list have some limitations. A batch list can only
include full stripe write and can't cross chunk boundary to make sure stripes
have the same parity disks. Stripes in a batch list must be in the same state
(no written, toread and so on). If a stripe is in a batch list, all new
read/write to add_stripe_bio will be blocked to overlap conflict till the batch
list is handled. The limitations will make sure stripes in a batch list be in
exactly the same state in the life circly.
I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
PCIe SSD. This patch improves around 30% performance and IO size to under layer
disk is exactly 32k. I also run a 4k randwrite test in the same array to make
sure the performance isn't changed with the patch.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Track overwrite disk count, so we can know if a stripe is a full stripe write.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
A freshly new stripe with write request can be batched. Any time the stripe is
handled or new read is queued, the flag will be cleared.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Use flex_array for scribble data. Next patch will batch several stripes
together, so scribble data should be able to cover several stripes, so this
patch also allocates scribble data for stripes across a chunk.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The patch makes 3 references to mddev->queue in the raid0 personality
conditional in order to allow for it to be accessed from dm-raid.
Mandatory, because md instances underneath dm-raid don't manage
a request queue of their own which'd lead to oopses without the patch.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When md notices non-sync IO happening while it is trying
to resync (or reshape or recover) it slows down to the
set minimum.
The default minimum might have made sense many years ago
but the drives have become faster. Changing the default
to match the times isn't really a long term solution.
This patch changes the code so that instead of waiting until the speed
has dropped to the target, it just waits until pending requests
have completed.
This means that the delay inserted is a function of the speed
of the devices.
Testing shows that:
- for some loads, the resync speed is unchanged. For those loads
increasing the minimum doesn't change the speed either.
So this is a good result. To increase resync speed under such
loads we would probably need to increase the resync window
size.
- for other loads, resync speed does increase to a reasonable
fraction (e.g. 20%) of maximum possible, and throughput of
the load only drops a little bit (e.g. 10%)
- for other loads, throughput of the non-sync load drops quite a bit
more. These seem to be latency-sensitive loads.
So it isn't a perfect solution, but it is mostly an improvement.
Signed-off-by: NeilBrown <neilb@suse.de>
This option is not well justified and testing suggests that
it hardly ever makes any difference.
The comment suggests there might be a need to wait for non-resync
activity indicated by ->nr_waiting, however raise_barrier()
already waits for all of that.
So just remove it to simplify reasoning about speed limiting.
This allows us to remove a 'FIXME' comment from raid5.c as that
never used the flag.
Signed-off-by: NeilBrown <neilb@suse.de>
There is really no need for sync_min to be a multiple of
chunk_size, and values read from here often aren't.
That means you cannot read a value and expect to be able
to write it back later.
So remove the chunk_size check, and round down to a multiple
of 4K, to be sure everything works with 4K-sector devices.
Signed-off-by: NeilBrown <neilb@suse.de>
When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
the clustered md:
1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
clear the Faulty bit in their respective rdev->flags.
2. The node initiating re-add, gathers the bitmaps of all nodes
and copies them into the local bitmap. It does not clear the bitmap
from which it is copying.
3. Initiating node schedules a md recovery to sync the devices.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This adds the capability of re-adding a failed disk by
writing "re-add" to /sys/block/mdXX/md/dev-YYY/state.
This facilitates adding disks which have encountered a temporary
error such as a network disconnection/hiccup in an iSCSI device,
or a SAN cable disconnection which has been restored. In such
a situation, you do not need to remove and re-add the device.
Writing re-add to the failed device's state would add it again
to the array and perform the recovery of only the blocks which
were written after the device failed.
This works for generic md, and is not related to clustering. However,
this patch is to ease re-add operations listed above in clustering
environments.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This adds "remove" capabilities for the clustered environment.
When a user initiates removal of a device from the array, a
REMOVE message with disk number in the array is sent to all
the nodes which kick the respective device in their own array.
This facilitates the removal of failed devices.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This is required by the clustering module (patches to follow) to
find the device to remove or re-add.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This export is required for clustering module in order to
co-ordinate remove/readd a rdev from all nodes.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Since the node num of md-cluster is from zero, and
cinfo->slot_number represents the slot num of dlm,
no need to check for equality.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
full blk-mq support to request-based DM.
- disabled by default but user can opt-in with CONFIG_DM_MQ_DEFAULT
- depends on some blk-mq changes from Jens' for-4.1/core branch so
that explains why this pull is built on linux-block.git
- Update DM to use name_to_dev_t() rather than open-coding a less
capable device parser.
- includes a couple small improvements to name_to_dev_t() that offer
stricter constraints that DM's code provided.
- Improvements to the dm-cache "mq" cache replacement policy.
- A DM crypt crypt_ctr() error path fix and an async crypto deadlock fix.
- A small efficiency improvement for DM crypt decryption by leveraging
immutable biovecs.
- Add error handling modes for corrupted blocks to DM verity.
- A new "log-writes" DM target from Josef Bacik that is meant for
file system developers to test file system integrity at particular
points in the life of a file system.
- A few DM log userspace cleanups and fixes.
- A few Documentation fixes (for thin, cache, crypt and switch).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVMGvhAAoJEMUj8QotnQNaOKgH/iuWyEZ+ndWbeEBZs9ZJPEgU
sRiKjSaPujkdB+78o+2DfAYjwCchqH1Oy781PqG85asTENVkcjD7LBPaIom/A2ml
Z9ZeCzAKbrtB3lZR8QAg4u7+90kkuZv1SeOPMWBZCEV4GYLEpV1J4/f+RgdEs1kT
upQcH1fFoQdyHGikwAKXf85Gmi4OrjVb19yoxu2xZnfwT2sCPq0okU4yBxb1B9mL
X/FcHUeLS9LJGewEqTD3AvWrk+J5pqj+1EeKY2kRxisp1065TaljYwetgR76Vnfb
o0J5eovJVL3AKRTYxvr5JABWD09xq9nG1hVBiuQMN5VmmvxjDKdbBPaKI4dZTt0=
=uVjp
-----END PGP SIGNATURE-----
Merge tag 'dm-4.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- the most extensive changes this cycle are the DM core improvements to
add full blk-mq support to request-based DM.
- disabled by default but user can opt-in with CONFIG_DM_MQ_DEFAULT
- depends on some blk-mq changes from Jens' for-4.1/core branch so
that explains why this pull is built on linux-block.git
- update DM to use name_to_dev_t() rather than open-coding a less
capable device parser.
- includes a couple small improvements to name_to_dev_t() that offer
stricter constraints that DM's code provided.
- improvements to the dm-cache "mq" cache replacement policy.
- a DM crypt crypt_ctr() error path fix and an async crypto deadlock
fix
- a small efficiency improvement for DM crypt decryption by leveraging
immutable biovecs
- add error handling modes for corrupted blocks to DM verity
- a new "log-writes" DM target from Josef Bacik that is meant for file
system developers to test file system integrity at particular points
in the life of a file system
- a few DM log userspace cleanups and fixes
- a few Documentation fixes (for thin, cache, crypt and switch)
* tag 'dm-4.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (34 commits)
dm crypt: fix missing error code return from crypt_ctr error path
dm crypt: fix deadlock when async crypto algorithm returns -EBUSY
dm crypt: leverage immutable biovecs when decrypting on read
dm crypt: update URLs to new cryptsetup project page
dm: add log writes target
dm table: use bool function return values of true/false not 1/0
dm verity: add error handling modes for corrupted blocks
dm thin: remove stale 'trim' message documentation
dm delay: use msecs_to_jiffies for time conversion
dm log userspace base: fix compile warning
dm log userspace transfer: match wait_for_completion_timeout return type
dm table: fall back to getting device using name_to_dev_t()
init: stricter checking of major:minor root= values
init: export name_to_dev_t and mark name argument as const
dm: add 'use_blk_mq' module param and expose in per-device ro sysfs attr
dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq
dm: add full blk-mq support to request-based DM
dm: impose configurable deadline for dm_request_fn's merge heuristic
dm sysfs: introduce ability to add writable attributes
dm: don't start current request if it would've merged with the previous
...
I suspect this doesn't show up for most anyone because software
algorithms typically don't have a sense of being too busy. However,
when working with the Freescale CAAM driver it will return -EBUSY on
occasion under heavy -- which resulted in dm-crypt deadlock.
After checking the logic in some other drivers, the scheme for
crypt_convert() and it's callback, kcryptd_async_done(), were not
correctly laid out to properly handle -EBUSY or -EINPROGRESS.
Fix this by using the completion for both -EBUSY and -EINPROGRESS. Now
crypt_convert()'s use of completion is comparable to
af_alg_wait_for_completion(). Similarly, kcryptd_async_done() follows
the pattern used in af_alg_complete().
Before this fix dm-crypt would lockup within 1-2 minutes running with
the CAAM driver. Fix was regression tested against software algorithms
on PPC32 and x86_64, and things seem perfectly happy there as well.
Signed-off-by: Ben Collins <ben.c@servergy.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Commit 003b5c571 ("block: Convert drivers to immutable biovecs")
stopped short of changing dm-crypt to leverage the fact that the biovec
array of a bio will no longer be modified.
Switch to using bio_clone_fast() when cloning bios for decryption after
read.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cryptsetup home page moved to GitLab.
Also remove link to abandonded Truecrypt page.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Introduce a new target that is meant for file system developers to test file
system integrity at particular points in the life of a file system. We capture
all write requests and associated data and log them to a separate device
for later replay. There is a userspace utility to do this replay. The
idea behind this is to give file system developers a tool to verify that
the file system is always consistent.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Zach Brown <zab@zabbo.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add device specific modes to dm-verity to specify how corrupted
blocks should be handled. The following modes are defined:
- DM_VERITY_MODE_EIO is the default behavior, where reading a
corrupted block results in -EIO.
- DM_VERITY_MODE_LOGGING only logs corrupted blocks, but does
not block the read.
- DM_VERITY_MODE_RESTART calls kernel_restart when a corrupted
block is discovered.
In addition, each mode sends a uevent to notify userspace of
corruption and to allow further recovery actions.
The driver defaults to previous behavior (DM_VERITY_MODE_EIO)
and other modes can be enabled with an additional parameter to
the verity table.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Converting milliseconds to jiffies by "val * HZ / 1000" is technically
OK but msecs_to_jiffies(val) is the cleaner solution and handles all
corner cases correctly.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This fixes up a compile warning [-Wunused-but-set-variable] - given the
comment in userspace_set_region_sync() the non-reporting of errors is
intentional so the return value can be dropped to make gcc happy.
Also, fix typo in comment.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Return type of wait_for_completion_timeout() is unsigned long not int.
An appropriately named unsigned long is added and the assignment fixed.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If a device is used as the root filesystem, it can't be built
off of devices which are within the root filesystem (just like
command line arguments to root=). For this reason, Linux has a
pseudo-filesystem for root= and MD initialization (based on the
function name_to_dev_t) which handles different ways of specifying
devices including PARTUUID and major:minor.
Switch to using name_to_dev_t() in dm_get_device(). Rather than
having DM assume that all things which are not major:minor are paths in
an already-mounted filesystem, change dm_get_device() to first attempt
to look up the device in the filesystem, and if not found it will fall
back to using name_to_dev_t().
In terms of backwards compatibility, there are some cases where
behavior will be different:
- If you have a file in the current working directory named 1:2 and
you initialze DM there, then it will try to use that file rather
than the disk with that major:minor pair as a backing device.
- Similarly for other bdev types which name_to_dev_t() knows how to
interpret, the previous behavior was to repeatedly check for the
existence of the file (e.g., while waiting for rootfs to come up)
but the new behavior is to use the name_to_dev_t() interpretation.
For example, if you have a file named /dev/ubiblock0_0 which is
a symlink to /dev/sda3, but it is not yet present when DM starts
to initialize, then the name_to_dev_t() interpretation will take
precedence.
These incompatibilities would only show up in really strange setups
with bad practices so we shouldn't have to worry about them.
Signed-off-by: Dan Ehrenberg <dehrenberg@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Request-based DM's blk-mq support defaults to off; but a user can easily
change the default using the dm_mod.use_blk_mq module/boot option.
Also, you can check what mode a given request-based DM device is using
with: cat /sys/block/dm-X/dm/use_blk_mq
This change enabled further cleanup and reduced work (e.g. the
md->io_pool and md->rq_pool isn't created if using blk-mq).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm_mq_queue_rq() is in atomic context so care must be taken to not
sleep -- as such GFP_ATOMIC is used for the md->bs bioset allocations
and dm-mpath's call to blk_get_request(). In the future the bioset
allocations will hopefully go away (by removing support for partial
completions of bios in a cloned request).
Also prepare for supporting DM blk-mq ontop of old-style request_fn
device(s) if a new dm-mod 'use_blk_mq' parameter is set. The kthread
will still be used to queue work if blk-mq is used ontop of old-style
request_fn device(s).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit e5863d9ad ("dm: allocate requests in target when stacking on
blk-mq devices") served as the first step toward fully utilizing blk-mq
in request-based DM -- it enabled stacking an old-style (request_fn)
request_queue ontop of the underlying blk-mq device(s). That first step
didn't improve performance of DM multipath ontop of fast blk-mq devices
(e.g. NVMe) because the top-level old-style request_queue was severely
limited by the queue_lock.
The second step offered here enables stacking a blk-mq request_queue
ontop of the underlying blk-mq device(s). This unlocks significant
performance gains on fast blk-mq devices, Keith Busch tested on his NVMe
testbed and offered this really positive news:
"Just providing a performance update. All my fio tests are getting
roughly equal performance whether accessed through the raw block
device or the multipath device mapper (~470k IOPS). I could only push
~20% of the raw iops through dm before this conversion, so this latest
tree is looking really solid from a performance standpoint."
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Keith Busch <keith.busch@intel.com>
Otherwise, for sequential workloads, the dm_request_fn can allow
excessive request merging at the expense of increased service time.
Add a per-device sysfs attribute to allow the user to control how long a
request, that is a reasonable merge candidate, can be queued on the
request queue. The resolution of this request dispatch deadline is in
microseconds (ranging from 1 to 100000 usecs), to set a 20us deadline:
echo 20 > /sys/block/dm-7/dm/rq_based_seq_io_merge_deadline
The dm_request_fn's merge heuristic and associated extra accounting is
disabled by default (rq_based_seq_io_merge_deadline is 0).
This sysfs attribute is not applicable to bio-based DM devices so it
will only ever report 0 for them.
By allowing a request to remain on the queue it will block others
requests on the queue. But introducing a short dequeue delay has proven
very effective at enabling certain sequential IO workloads on really
fast, yet IOPS constrained, devices to build up slightly larger IOs --
yielding 90+% throughput improvements. Having precise control over the
time taken to wait for larger requests to build affords control beyond
that of waiting for certain IO sizes to accumulate (which would require
a deadline anyway). This knob will only ever make sense with sequential
IO workloads and the particular value used is storage configuration
specific.
Given the expected niche use-case for when this knob is useful it has
been deemed acceptable to expose this relatively crude method for
crafting optimal IO on specific storage -- especially given the solution
is simple yet effective. In the context of DM multipath, it is
advisable to tune this sysfs attribute to a value that offers the best
performance for the common case (e.g. if 4 paths are expected active,
tune for that; if paths fail then performance may be slightly reduced).
Alternatives were explored to have request-based DM autotune this value
(e.g. if/when paths fail) but they were quickly deemed too fragile and
complex to warrant further design and development time. If this problem
proves more common as faster storage emerges we'll have to look at
elevating a generic solution into the block core.
Tested-by: Shiva Krishna Merla <shivakrishna.merla@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Request-based DM's dm_request_fn() is so fast to pull requests off the
queue that steps need to be taken to promote merging by avoiding request
processing if it makes sense.
If the current request would've merged with previous request let the
current request stay on the queue longer.
Suggested-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit 7eaceaccab ("block: remove per-queue plugging") didn't justify
DM's use of a 100ms delay; such an extended delay is a liability when
there is reason to re-kick the queue.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
On really fast storage it can be beneficial to delay running the
request_queue to allow the elevator more opportunity to merge requests.
Otherwise, it has been observed that requests are being sent to
q->request_fn much quicker than is ideal on IOPS-bound backends.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The old dm_request() method used for q->make_request_fn had a branch for
request-based DM support but it isn't needed given that
dm_init_request_based_queue() sets it to the standard blk_queue_bio()
anyway.
Cleanup dm_init_md_queue() to be DM device-type agnostic and have
dm_setup_md_queue() properly finish queue setup based on DM device-type
(bio-based vs request-based).
A followup block patch can be made to remove the export for
blk_queue_bio() now that DM no longer calls it directly.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Since commit 20d0189b10
in v3.14-rc1 RAID0 has performed incorrect calculations
when the chunksize is not a power of 2.
This happens because "sector_div()" modifies its first argument, but
this wasn't taken into account in the patch.
So restore that first arg before re-using the variable.
Reported-by: Joe Landman <joe.landman@gmail.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Fixes: 20d0189b10
Cc: stable@vger.kernel.org (3.14 and later).
Signed-off-by: NeilBrown <neilb@suse.de>
Simon reported the md io stats accounting issue:
"
I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
the other two being write-mostly:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 345.00 0.00 0.00 0.00 0.00 100.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 58779.00 0.00 0.00 0.00 0.00 100.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 100.00
"
The cause is commit "18c0b223cf9901727ef3b02da6711ac930b4e5d4" uses the
generic_start_io_acct to account the disk stats rather than the open code,
but it also introduced the increase to .in_flight[rw] which is needless to
md. So we re-use the open code here to fix it.
Reported-by: Simon Kirby <sim@hostway.ca>
Cc: <stable@vger.kernel.org> 3.19
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>
DM multipath is the only caller of blk_lld_busy() -- which calls a
queue's lld_busy_fn hook. Request-based DM doesn't support stacking
multipath devices so there is no reason to register the lld_busy_fn hook
on a multipath device's queue using blk_queue_lld_busy().
As such, remove functions dm_lld_busy and dm_table_any_busy_target.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
__dm_get_module_param() could be useful for future DM module parameters
besides those related to "reserved_ios".
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Writeback takes out a lock on the cache block, so will increase the
latency for any concurrent io.
This patch works by placing 2 sentinel objects on each level of the
multiqueues. Every WRITEBACK_PERIOD the oldest sentinel gets moved to
the newest end of the queue level.
When looking for writeback work:
if less than 25% of the cache is clean:
we select the oldest object with the lowest hit count
otherwise:
we select the oldest object that is not past a writeback sentinel.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A sentinel object is placed on each level of the multiqueues. When an
object is hit it is requeued behind the sentinel. When the tick is
incremented we iterate through all objects behind the sentinel and
update the hit_count, then reposition the sentinel at the very back.
This saves memory by avoiding tracking the tick explicitly for every
struct entry object in the multiqueues.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
queue_shift_down() didn't adjust the hit_counts to the new levels, so it
just had the effect of scrambling levels.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Small optimisation, now queue_empty() doesn't need to walk all levels of
the multiqueue.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Use a single slab cache to allocate a mempool for each dirty-log.
This _should_ eliminate DM's need for io_schedule_timeout() in
mempool_alloc(); so io_schedule() should be sufficient now.
Also, rename struct flush_entry to dm_dirty_log_flush_entry to allow
KMEM_CACHE() to create a meaningful global name for the slab cache.
Also, eliminate some holes in struct log_c by rearranging members.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
exposed by the bdi changes that were introduced during the 4.0 merge.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVExTfAAoJEMUj8QotnQNaMMAIAOT0IjDILfxuP/m6DLl+Qq89
SWrKWJrTDc4j8yjaehELH5Vm5Y8W2qlxQAmwritOEm+53HdODSv9I8WpG/6eCEkt
sx+zwQ0wRF0TK6wrqtXe44WU2Zgi9KTj31Y5nTuOAcEqR7S4JHYmbvuzFv/kSO7T
GusA+A1HAp0klGgqxT/mzjz61J/frOCDIXWGFwGPapD+6ZLcHfnCz+Ss4CWhdzvo
xCMXkeoCgvT5dGj5HQQ3RbucaeUIIeYsja2IT2h572GRiUkau+7qtg/YyBvdMnw+
mEcdMcIg+Higc9z3ZSFu/U/QxGwRF/AycXp4zziuiW8KHfRa2BA7+iyQ0odUHlA=
=Ikkj
-----END PGP SIGNATURE-----
Merge tag 'dm-4.0-fix-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fix from Mike Snitzer:
"Fix DM core device cleanup regression -- due to a latent race that was
exposed by the bdi changes that were introduced during the 4.0 merge"
* tag 'dm-4.0-fix-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: fix add_disk() NULL pointer due to race with free_dev()
The calculations of bitmap offset is incorrect with respect to bits to bytes
conversion.
Also, remove an irrelevant duplicate message.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit c4db59d31e ("fs: don't reassign dirty inodes to
default_backing_dev_info") exposed DM to a latent race in free_dev() vs
add_disk() in relation to management of the device's minor number.
Fix this by refactoring free_dev() to match cleanup order of the
alloc_dev() error path. Move cleanup of the gendisk, queue, and bdev
to _before_ the cleanup of the idr managed minor number.
Also, purely due to cleanup that fell out during the free_dev() audit:
- adjust dm_blk_close() to access the gendisk's private_data under
the _minor_lock spinlock.
- move __dm_destroy()'s dm_get_live_table() call out from under the
_minor_lock spinlock.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1202449
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Regression in recent patch causes crash on error path.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVQyiMDnsnt1WYoG5AQKV4A//SwSZQmnbSTDZ1H5DyERXBkNNKkyFkZCL
9jwl60HaVA5U0N/fP97ku/757QtODYjjHLzlkYqgiiqMBGPFFIKa78D6Su0Z8uJK
FtMJQaalhsWOCXHUz2R6DPHR1e9x2mlrJgqvnIbnbFXym1PZJ/fo3ccsLX3qqmNJ
F4oFYpy79+yMCUSTn8Sv9czzSE19k5XC/Hz14drWi5ox9zpZmpFbMbQdNaKFa2Uy
kphgEdvFNF9C2X+rJeeY91ofz/keSsZVL8HqxLThjxKXsZ+krpywsXa2nX7W00ya
zt49VK78PL3v2eIAgf1XaBYO3oyjyzEaVuAkFOktv/LuLKAbFbup2Yi5JtHzieor
NoaV9hfgCq1OqxkY8zI8hLcL6HS2vvk8vUXbb2Xqa6X5w6xuyjRMzzAyxqXuG4qm
16jAU2RvITfdwAeNacyzq/xAWBg+jUgeg+lO/Z71O0qS7chGUNzve+hwQOtoeHzP
fmEetiZa+gbOQEBZT3Ubsw4L9CqCaYVx7KOiE8Py5nbUVPMz2nF6jjt6WSPOgyWv
JrflLIoEE5l6QffnEzcJb3OR920TKZx17/87KvBwct9XH3R0+UgJLFggHrYQAiUA
54CnXVPl0MPq5vtROviX2ZWMThOk1s5HAsMhCBuTdqghtQQdNP7DWmGfg7ys5uUq
152cPd+xGw4=
=PM5i
-----END PGP SIGNATURE-----
Merge tag 'md/4.0-rc4-fix' of git://neil.brown.name/md
Pull bugfix for md from Neil Brown:
"One fix for md in 4.0-rc4
Regression in recent patch causes crash on error path"
* tag 'md/4.0-rc4-fix' of git://neil.brown.name/md:
md: fix problems with freeing private data after ->run failure.
- fix thin target to always zero-fill reads to unprovisioned blocks
- fix to interlock device destruction's suspend from internal suspends
- fix 2 snapshot exception store handover bugs
- fix dm-io to cope with DISCARD and WRITE_SAME capabilities changing
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVDIJ8AAoJEMUj8QotnQNaynEIAK4Q3mvOeI6PIwuC2iOWRglJ
C0xzTKMR6VDQeM/uMRLeiiqxdPs6b89OdSmJHKJYOrdsusjG0a4IBAO9vDayb6sW
AxOlbnpXdoiH3cqQ4dISJIfS9qM49qGM40CAflcLKYsGfLnstVLt+4l9HaWaRrHj
b2kAC+I6PDDybFlKD5KsMB56sjzhhqg/lnnwkY2omV4vLHf6RrhVhK5gcEPF7VN0
SJ5HKSNBHQnpYSEECHXkxGvZ/ZKTe9n7q0t6Q3nnTSUFTXS6esgTsK1q2AykADDR
3W1LkrAAFP31AWKPkUwCEe3C8Rk3rCsrq2H14woWX1/UxSXNPlMYCU50ak4TVmc=
=Svyk
-----END PGP SIGNATURE-----
Merge tag 'dm-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull devicemapper fixes from Mike Snitzer:
"A handful of stable fixes for DM:
- fix thin target to always zero-fill reads to unprovisioned blocks
- fix to interlock device destruction's suspend from internal
suspends
- fix 2 snapshot exception store handover bugs
- fix dm-io to cope with DISCARD and WRITE_SAME capabilities changing"
* tag 'dm-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm io: deal with wandering queue limits when handling REQ_DISCARD and REQ_WRITE_SAME
dm snapshot: suspend merging snapshot when doing exception handover
dm snapshot: suspend origin when doing exception handover
dm: hold suspend_lock while suspending device during device deletion
dm thin: fix to consistently zero-fill reads to unprovisioned blocks
drivers/md/md-cluster.c:190:6: sparse: symbol 'recover_bitmaps' was not declared. Should it be static?
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
A --cluster-confirm without an --add (by another node) can
crash the kernel.
Fix it by guarding it using a state.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If ->run() fails, it can either free the data structures it
allocated, or leave that task to ->free() which will be called
on failures.
However:
md.c calls ->free() even if ->private_data is NULL, which
causes problems in some personalities.
raid0.c frees the data, but doesn't clear ->private_data,
which will become a problem when we fix md.c
So better fix both these issues at once.
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Fixes: 5aa61f427e
URL: https://bugzilla.kernel.org/show_bug.cgi?id=94381
Signed-off-by: NeilBrown <neilb@suse.de>
neilb: modified to not corrupt ->resync_max_sectors.
sector_div usage fixed by Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NeilBrown <neilb@suse.de>
Since it's possible for the discard and write same queue limits to
change while the upper level command is being sliced and diced, fix up
both of them (a) to reject IO if the special command is unsupported at
the start of the function and (b) read the limits once and let the
commands error out on their own if the status happens to change.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The "dm snapshot: suspend origin when doing exception handover" commit
fixed a exception store handover bug associated with pending exceptions
to the "snapshot-origin" target.
However, a similar problem exists in snapshot merging. When snapshot
merging is in progress, we use the target "snapshot-merge" instead of
"snapshot-origin". Consequently, during exception store handover, we
must find the snapshot-merge target and suspend its associated
mapped_device.
To avoid lockdep warnings, the target must be suspended and resumed
without holding _origins_lock.
Introduce a dm_hold() function that grabs a reference on a
mapped_device, but unlike dm_get(), it doesn't crash if the device has
the DMF_FREEING flag set, it returns an error in this case.
In snapshot_resume() we grab the reference to the origin device using
dm_hold() while holding _origins_lock (_origins_lock guarantees that the
device won't disappear). Then we release _origins_lock, suspend the
device and grab _origins_lock again.
NOTE to stable@ people:
When backporting to kernels 3.18 and older, use dm_internal_suspend and
dm_internal_resume instead of dm_internal_suspend_fast and
dm_internal_resume_fast.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
In the function snapshot_resume we perform exception store handover. If
there is another active snapshot target, the exception store is moved
from this target to the target that is being resumed.
The problem is that if there is some pending exception, it will point to
an incorrect exception store after that handover, causing a crash due to
dm-snap-persistent.c:get_exception()'s BUG_ON.
This bug can be triggered by repeatedly changing snapshot permissions
with "lvchange -p r" and "lvchange -p rw" while there are writes on the
associated origin device.
To fix this bug, we must suspend the origin device when doing the
exception store handover to make sure that there are no pending
exceptions:
- introduce _origin_hash that keeps track of dm_origin structures.
- introduce functions __lookup_dm_origin, __insert_dm_origin and
__remove_dm_origin that manipulate the origin hash.
- modify snapshot_resume so that it calls dm_internal_suspend_fast() and
dm_internal_resume_fast() on the origin device.
NOTE to stable@ people:
When backporting to kernels 3.12-3.18, use dm_internal_suspend and
dm_internal_resume instead of dm_internal_suspend_fast and
dm_internal_resume_fast.
When backporting to kernels older than 3.12, you need to pick functions
dm_internal_suspend and dm_internal_resume from the commit
fd2ed4d252.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
__dm_destroy() must take the suspend_lock so that its presuspend and
postsuspend calls do not race with an internal suspend.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
It was always intended that a read to an unprovisioned block will return
zeroes regardless of whether the pool is in read-only or read-write
mode. thin_bio_map() was inconsistent with its handling of such reads
when the pool is in read-only mode, it now properly zero-fills the bios
it returns in response to unprovisioned block reads.
Eliminate thin_bio_map()'s special read-only mode handling of -ENODATA
and just allow the IO to be deferred to the worker which will result in
pool->process_bio() handling the IO (which already properly zero-fills
reads to unprovisioned blocks).
Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Recent change to bitmap_create mishandles errors.
In particular a failure doesn't alway cause 'err' to be set.
Signed-off-by: NeilBrown <neilb@suse.de>
Since __ATTR_PREALLOC was introduced in v3.19-rc1~78^2~18
it can now be used by md.
This ensure that writing to these sysfs attributes will never
block due to a memory allocation.
Such blocking could become a deadlock if mdmon is trying to
reconfigure an array after a failure prior to re-enabling writes.
Signed-off-by: NeilBrown <neilb@suse.de>
When we have more than 1 drive failure, it's possible we start
rebuild one drive while leaving another faulty drive in array.
To determine whether array will be optimal after building, current
code only check whether a drive is missing, which could potentially
lead to data corruption. This patch is to add checking Faulty flag.
Signed-off-by: NeilBrown <neilb@suse.de>
When a drive is marked write-mostly it should only be the
target of reads if there is no other option.
This behaviour was broken by
commit 9dedf60313
md/raid1: read balance chooses idlest disk for SSD
which causes a write-mostly device to be *preferred* is some cases.
Restore correct behaviour by checking and setting
best_dist_disk and best_pending_disk rather than best_disk.
We only need to test one of these as they are both changed
from -1 or >=0 at the same time.
As we leave min_pending and best_dist unchanged, any non-write-mostly
device will appear better than the write-mostly device.
Reported-by: Tomáš Hodek <tomas.hodek@volny.cz>
Reported-by: Dark Penguin <darkpenguin@yandex.ru>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: http://marc.info/?l=linux-raid&m=135982797322422
Fixes: 9dedf60313
Cc: stable@vger.kernel.org (3.6+)
Algorithm:
1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
2. Node 1 sends NEWDISK with uuid and slot number
3. Other nodes issue kobject_uevent_env with uuid and slot number
(Steps 4,5 could be a udev rule)
4. In userspace, the node searches for the disk, perhaps
using blkid -t SUB_UUID=""
5. Other nodes issue either of the following depending on whether the disk
was found:
ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
disc.number set to slot number)
ioctl(CLUSTERED_DISK_NACK)
6. Other nodes drop lock on no-new-devs (CR) if device is found
7. Node 1 attempts EX lock on no-new-devs
8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
as SpareLocal
9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
set choose_first true for cluster read in read balance when the area
is resyncing.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
If there is a resync going on, all nodes must suspend writes to the
range. This is recorded in the suspend_info/suspend_list.
If there is an I/O within the ranges of any of the suspend_info,
should_suspend will return 1.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
When a RESYNC_START message arrives, the node removes the entry
with the current slot number and adds the range to the
suspend_list.
Simlarly, when a RESYNC_FINISHED message is received, node clears
entry with respect to the bitmap number.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
When a resync is initiated, RESYNCING message is sent to all active
nodes with the range (lo,hi). When the resync is over, a RESYNCING
message is sent with (0,0). A high sector value of zero indicates
that the resync is over.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Re-reads the devices by invalidating the cache.
Since we don't write to faulty devices, this is detected using
events recorded in the devices. If it is old as compared to the mddev
mark it is faulty.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
- request to send a message
- make changes to superblock
- send messages telling everyone that the superblock has changed
- other nodes all read the superblock
- other nodes all ack the messages
- updating node release the "I'm sending a message" resource.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
The sending part is split in two functions to make sure
atomicity of the operations, such as the MD superblock update.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
1. receive status
sender receiver receiver
ACK:CR ACK:CR ACK:CR
2. sender get EX of TOKEN
sender get EX of MESSAGE
sender receiver receiver
TOKEN:EX ACK:CR ACK:CR
MESSAGE:EX
ACK:CR
3. sender write LVB.
sender down-convert MESSAGE from EX to CR
sender try to get EX of ACK
[ wait until all receiver has *processed* the MESSAGE ]
[ triggered by bast of ACK ]
receiver get CR of MESSAGE
receiver read LVB
receiver processes the message
[ wait finish ]
receiver release ACK
sender receiver receiver
TOKEN:EX MESSAGE:CR MESSAGE:CR
MESSAGE:CR
ACK:EX
4. sender down-convert ACK from EX to CR
sender release MESSAGE
sender release TOKEN
receiver upconvert to EX of MESSAGE
receiver get CR of ACK
receiver release MESSAGE
sender receiver receiver
ACK:CR ACK:CR ACK:CR
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
The DLM informs us in case of node failure with the DLM slot number.
cluster_info->recovery_map sets the bit corresponding to the slot number
and wakes up the recovery thread.
The recovery thread:
1. Derives the slot number from the recovery_map
2. Locks the bitmap corresponding to the slot
3. Copies the set bits to the node-local bitmap
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
bitmap_copy_from_slot reads the bitmap from the slot mentioned.
It then copies the set bits to the node local bitmap.
This is helper function for the resync operation on node failure.
bitmap_set_memory_bits() currently assumes it is only run at startup and that
they bitmap is currently empty. So if it finds that a region is already
marked as dirty, it won't mark it dirty again. Change bitmap_set_memory_bits()
to always set the NEEDED_MASK bit if 'needed' is set.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
When a node joins, it does not know of other nodes performing resync.
So, each node keeps the resync information in it's LVB. When a new
node joins, it reads the LVB of each "online" bitmap.
[TODO] The new node attempts to get the PW lock on other bitmap, if
it is successful, it reads the bitmap and performs the resync (if
required) on it's behalf.
If the node does not get the PW, it requests CR and reads the LVB
for the resync information.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
On-disk format:
0 4k 8k 12k
-------------------------------------------------------------------
| idle | md super | bm super [0] + bits |
| bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] |
| bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits |
| bm bits [3, contd] | | |
Bitmap super has a field nodes, which defines the maximum number
of nodes the device can use. While reading the bitmap super, if
the cluster finds out that the number of nodes is > 0:
1. Requests the md-cluster module.
2. Calls md_cluster_ops->join(), which sets up clustering such as
joining DLM lockspace.
Since the first time, the first bitmap is read. After the call
to the cluster_setup, the bitmap offset is adjusted and the
superblock is re-read. This also ensures the bitmap is read
the bitmap lock (when bitmap lock is introduced in later patches)
Questions:
1. cluster name is repeated in all bitmap supers. Is that okay?
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
DLM offers callbacks when a node fails and the lock remastery
is performed:
1. recover_prep: called when DLM discovers a node is down
2. recover_slot: called when DLM identifies the node and recovery
can start
3. recover_done: called when all nodes have completed recover_slot
recover_slot() and recover_done() are also called when the node joins
initially in order to inform the node with its slot number. These slot
numbers start from one, so we deduct one to make it start with zero
which the cluster-md code uses.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
md_cluster_info stores the cluster information in the MD device.
The join() is called when mddev detects it is a clustered device.
The main responsibilities are:
1. Setup a DLM lockspace
2. Setup all initial locks such as super block locks and bitmap lock (will come later)
The leave() clears up the lockspace and all the locks held.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
A dlm_lock_resource is a structure which contains all information
required for locking using DLM. The init function allocates the
lock and acquires the lock in NL mode. The unlock function
converts the lock resource to NL mode. This is done to preserve
LVB and for faster processing of locks. The lock resource is
DLM unlocked only in the lockres_free function, which is the end
of life of the lock resource.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
to changes that enable effective use of an unbound workqueue across
all available CPUs. A large battery of tests were performed to
validate these changes, summary of results is available here:
https://www.redhat.com/archives/dm-devel/2015-February/msg00106.html
- A few additional stable fixes (to DM core, dm-snapshot and dm-mirror)
and a small fix to the dm-space-map-disk.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU51MsAAoJEMUj8QotnQNan8wH/3Oa2/ofbX8zNYtdDxFBK7W0
pHwVA2CH9yzDDi90X9GsQODmXerMrYf+SHo/RsQTpSnt5WI/4ZVP/1EoNGu6T2Yr
XiaKc/Jwo+ScxdhoHSNtyGL5vKeipX6clgz1wcd/4/UBcBAr0Vxj25s8Ta7naK2V
hZyWnt3MCGEDQ45NVBmLOU78Cl68LGf8JOUY0l3cd8ehQjGyR9a12GXwRktepslp
hw9xu8uYmE93fD+MIe/fUs7W/WvIVgFSskhswlvD3kAFpEAsMczjLGVSRQlBE90L
OK7v1HKxiIUJa17g3LmiCFa59ZjrnSHGpOOv9IPAFU/LzFIU47lcFwRFjgfB8po=
=8vFi
-----END PGP SIGNATURE-----
Merge tag 'dm-3.20-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull more device mapper changes from Mike Snitzer:
- Significant dm-crypt CPU scalability performance improvements thanks
to changes that enable effective use of an unbound workqueue across
all available CPUs. A large battery of tests were performed to
validate these changes, summary of results is available here:
https://www.redhat.com/archives/dm-devel/2015-February/msg00106.html
- A few additional stable fixes (to DM core, dm-snapshot and dm-mirror)
and a small fix to the dm-space-map-disk.
* tag 'dm-3.20-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm snapshot: fix a possible invalid memory access on unload
dm: fix a race condition in dm_get_md
dm crypt: sort writes
dm crypt: add 'submit_from_crypt_cpus' option
dm crypt: offload writes to thread
dm crypt: remove unused io_pool and _crypt_io_pool
dm crypt: avoid deadlock in mempools
dm crypt: don't allocate pages for a partial request
dm crypt: use unbound workqueue for request processing
dm io: reject unsupported DISCARD requests with EOPNOTSUPP
dm mirror: do not degrade the mirror on discard error
dm space map disk: fix sm_disk_count_is_more_than_one()
Pull kconfig updates from Michal Marek:
"Yann E Morin was supposed to take over kconfig maintainership, but
this hasn't happened. So I'm sending a few kconfig patches that I
collected:
- Fix for missing va_end in kconfig
- merge_config.sh displays used if given too few arguments
- s/boolean/bool/ in Kconfig files for consistency, with the plan to
only support bool in the future"
* 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
kconfig: use va_end to match corresponding va_start
merge_config.sh: Display usage if given too few arguments
kconfig: use bool instead of boolean for type definition attributes
When the snapshot target is unloaded, snapshot_dtr() waits until
pending_exceptions_count drops to zero. Then, it destroys the snapshot.
Therefore, the function that decrements pending_exceptions_count
should not touch the snapshot structure after the decrement.
pending_complete() calls free_pending_exception(), which decrements
pending_exceptions_count, and then it performs up_write(&s->lock) and it
calls retry_origin_bios() which dereferences s->origin. These two
memory accesses to the fields of the snapshot may touch the dm_snapshot
struture after it is freed.
This patch moves the call to free_pending_exception() to the end of
pending_complete(), so that the snapshot will not be destroyed while
pending_complete() is in progress.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The function dm_get_md finds a device mapper device with a given dev_t,
increases the reference count and returns the pointer.
dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the
device, tests that the device doesn't have DMF_DELETING or DMF_FREEING
flag, drops _minor_lock and returns pointer to the device. dm_get_md then
calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag,
otherwise it increments the reference count.
There is a possible race condition - after dm_find_md exits and before
dm_get is called, there are no locks held, so the device may disappear or
DMF_FREEING flag may be set, which results in BUG.
To fix this bug, we need to call dm_get while we hold _minor_lock. This
patch renames dm_find_md to dm_get_md and changes it so that it calls
dm_get while holding the lock.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
yet-another-livelock in raid5, and a problem with write errors
to 4K-block devices.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVOPorznsnt1WYoG5AQKa5A//RmLurotaJvHt8YA8jzlzo2KY/9jxIv8x
jF0CQ0Kva056GG27dlkMPBoddrVk7/WaCoz/ICBKgfSTCpiEFekNlExLkeLXrgUb
mKljS/HKhP4KQjiy519HiT2v9BRTqZD3l6P2405qVAZGTymwhelAp3y1I5ZIRFOr
k2Uo6PNT/XYvzwWAy5iEBrE7zZl3MZJG5RjW2Wq6JKaWMUQaQERu5OhugWEfCI3U
yVomILlwiYqC85WLgrVob/OqCSoA3UsIZkZKvKJCP8Y4C9QJzquEeZy4z0swgJq0
FMsu2WqmI3/ZNFvmlSozN05CEUMkZF6E8hGJcT+f1Wa1NE40zrzDkVsVvsGMD8v0
Ek7PTiA3X7AhqRl4Lt2Gs8kfDJ9MIRXzvXJv9UUO7PtVQGUhVdqmcqsodyqA7+lw
63hgSpEUtQTkpeWwcYQY6Impr/6jGxHwQKzFlLltaDvmAeOBd6gjAq2bdfPO5tox
FiSoClhor4yTc274+s5ORk9xDh61d7ZbnFJ+QN7YioyaSD2LHYIl9EkVFDe4OMXV
q39VzXt+uEYx6hxfmnpjPyPvPiyuDsnXzoTMcaaGsJL4TOazoIEj5AFGmg5i3/yw
6UVjKxJ43SMM7qg3VbgZpenGJwvRkP5mqQwMNDZICXdWulQTMwM5w3AidNqwYAaU
oEPAMfZNGvE=
=1BWV
-----END PGP SIGNATURE-----
Merge tag 'md/3.20-fixes' of git://neil.brown.name/md
Pull md bugfixes from Neil Brown:
"Three bug md fixes for 3.20
yet-another-livelock in raid5, and a problem with write errors to
4K-block devices"
* tag 'md/3.20-fixes' of git://neil.brown.name/md:
md/raid5: Fix livelock when array is both resyncing and degraded.
md/raid10: round up to bdev_logical_block_size in narrow_write_error.
md/raid1: round up to bdev_logical_block_size in narrow_write_error
Commit a7854487cd:
md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
Causes an RCW cycle to be forced even when the array is degraded.
A degraded array cannot support RCW as that requires reading all data
blocks, and one may be missing.
Forcing an RCW when it is not possible causes a live-lock and the code
spins, repeatedly deciding to do something that cannot succeed.
So change the condition to only force RCW on non-degraded arrays.
Reported-by: Manibalan P <pmanibalan@amiindia.co.in>
Bisected-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: a7854487cd
Cc: stable@vger.kernel.org (v3.7+)
Write requests are sorted in a red-black tree structure and are
submitted in the sorted order.
In theory the sorting should be performed by the underlying disk
scheduler, however, in practice the disk scheduler only accepts and
sorts a finite number of requests. To allow the sorting of all
requests, dm-crypt needs to implement its own sorting.
The overhead associated with rbtree-based sorting is considered
negligible so it is not used conditionally. Even on SSD sorting can be
beneficial since in-order request dispatch promotes lower latency IO
completion to the upper layers.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>