linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-12 15:11:50 +00:00

Author	SHA1	Message	Date
Linus Torvalds	0a135ba14d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: add __percpu sparse annotations to what's left percpu: add __percpu sparse annotations to fs percpu: add __percpu sparse annotations to core kernel subsystems local_t: Remove leftover local.h this_cpu: Remove pageset_notifier this_cpu: Page allocator conversion percpu, x86: Generic inc / dec percpu instructions local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c module: Use this_cpu_xx to dynamically allocate counters local_t: Remove cpu_local_xx macros percpu: refactor the code in pcpu_[de]populate_chunk() percpu: remove compile warnings caused by __verify_pcpu_ptr() percpu: make accessors check for percpu pointer in sparse percpu: add __percpu for sparse. percpu: make access macros universal percpu: remove per_cpu__ prefix.	2010-03-03 07:34:18 -08:00
Martin K. Petersen	8a78362c4e	block: Consolidate phys_segment and hw_segment limits Except for SCSI no device drivers distinguish between physical and hardware segment limits. Consolidate the two into a single segment limit. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-26 13:58:08 +01:00
Martin K. Petersen	086fa5ff08	block: Rename blk_queue_max_sectors to blk_queue_max_hw_sectors The block layer calling convention is blk_queue_<limit name>. blk_queue_max_sectors predates this practice, leading to some confusion. Rename the function to appropriately reflect that its intended use is to set max_hw_sectors. Also introduce a temporary wrapper for backwards compability. This can be removed after the merge window is closed. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-26 13:58:08 +01:00
Tejun Heo	a29d8b8e2d	percpu: add __percpu sparse annotations to what's left Add __percpu sparse annotations to places which didn't make it in one of the previous patches. All converions are trivial. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Borislav Petkov <borislav.petkov@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Neil Brown <neilb@suse.de>	2010-02-17 11:17:38 +09:00
Alasdair G Kergon	9307f6b19a	dm: sysfs revert add empty release function to avoid debug warning Revert commit `d2bb7df8ca` at Greg's request. Author: Milan Broz <mbroz@redhat.com> Date: Thu Dec 10 23:51:53 2009 +0000 dm: sysfs add empty release function to avoid debug warning This patch just removes an unnecessary warning: kobject: 'dm': does not have a release() function, it is broken and must be fixed. The kobject is embedded in mapped device struct, so code does not need to release memory explicitly here. Cc: Greg KH <gregkh@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-02-16 18:43:04 +00:00
Kiyoshi Ueda	9eef87da2a	dm mpath: fix stall when requeueing io This patch fixes the problem that system may stall if target's ->map_rq returns DM_MAPIO_REQUEUE in map_request(). E.g. stall happens on 1 CPU box when a dm-mpath device with queue_if_no_path bounces between all-paths-down and paths-up on I/O load. When target's ->map_rq returns DM_MAPIO_REQUEUE, map_request() requeues the request and returns to dm_request_fn(). Then, dm_request_fn() doesn't exit the I/O dispatching loop and continues processing the requeued request again. This map and requeue loop can be done with interrupt disabled, so 1 CPU system can be stalled if this situation happens. For example, commands below can stall my 1 CPU box within 1 minute or so: # dmsetup table mp mp: 0 2097152 multipath 1 queue_if_no_path 0 1 1 service-time 0 1 2 8:144 1 1 # while true; do dd if=/dev/mapper/mp of=/dev/null bs=1M count=100; done & # while true; do \ > dmsetup message mp 0 "fail_path 8:144" \ > dmsetup suspend --noflush mp \ > dmsetup resume mp \ > dmsetup message mp 0 "reinstate_path 8:144" \ > done To fix the problem above, this patch changes dm_request_fn() to exit the I/O dispatching loop once if a request is requeued in map_request(). Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-02-16 18:43:01 +00:00
Takahiro Yasui	558569aa9d	dm raid1: fix null pointer dereference in suspend When suspending a failed mirror, bios are completed by mirror_end_io() and __rh_lookup() in dm_rh_dec() returns NULL where a non-NULL return value is required by design. Fix this by not changing the state of the recovery failed region from DM_RH_RECOVERING to DM_RH_NOSYNC in dm_rh_recovery_end(). Issue On 2.6.33-rc1 kernel, I hit the bug when I suspended the failed mirror by dmsetup command. BUG: unable to handle kernel NULL pointer dereference at 00000020 IP: [<f94f38e2>] dm_rh_dec+0x35/0xa1 [dm_region_hash] ... EIP: 0060:[<f94f38e2>] EFLAGS: 00010046 CPU: 0 EIP is at dm_rh_dec+0x35/0xa1 [dm_region_hash] EAX: 00000286 EBX: 00000000 ECX: 00000286 EDX: 00000000 ESI: eff79eac EDI: eff79e80 EBP: f6915cd4 ESP: f6915cc4 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process dmsetup (pid: 2849, ti=f6914000 task=eff03e80 task.ti=f6914000) ... Call Trace: [<f9530af6>] ? mirror_end_io+0x53/0x1b1 [dm_mirror] [<f9413104>] ? clone_endio+0x4d/0xa2 [dm_mod] [<f9530aa3>] ? mirror_end_io+0x0/0x1b1 [dm_mirror] [<f94130b7>] ? clone_endio+0x0/0xa2 [dm_mod] [<c02d6bcb>] ? bio_endio+0x28/0x2b [<f952f303>] ? hold_bio+0x2d/0x62 [dm_mirror] [<f952f942>] ? mirror_presuspend+0xeb/0xf7 [dm_mirror] [<c02aa3e2>] ? vmap_page_range+0xb/0xd [<f9414c8d>] ? suspend_targets+0x2d/0x3b [dm_mod] [<f9414ca9>] ? dm_table_presuspend_targets+0xe/0x10 [dm_mod] [<f941456f>] ? dm_suspend+0x4d/0x150 [dm_mod] [<f941767d>] ? dev_suspend+0x55/0x18a [dm_mod] [<c0343762>] ? _copy_from_user+0x42/0x56 [<f9417fb0>] ? dm_ctl_ioctl+0x22c/0x281 [dm_mod] [<f9417628>] ? dev_suspend+0x0/0x18a [dm_mod] [<f9417d84>] ? dm_ctl_ioctl+0x0/0x281 [dm_mod] [<c02c3c4b>] ? vfs_ioctl+0x22/0x85 [<c02c422c>] ? do_vfs_ioctl+0x4cb/0x516 [<c02c42b7>] ? sys_ioctl+0x40/0x5a [<c0202858>] ? sysenter_do_call+0x12/0x28 Analysis When recovery process of a region failed, dm_rh_recovery_end() function changes the state of the region from RM_RH_RECOVERING to DM_RH_NOSYNC. When recovery_complete() is executed between dm_rh_update_states() and dm_writes() in do_mirror(), bios are processed with the region state, DM_RH_NOSYNC. However, the region data is freed without checking its pending count when dm_rh_update_states() is called next time. When bios are finished by mirror_end_io(), __rh_lookup() in dm_rh_dec() returns NULL even though a valid return value are expected. Solution Remove the state change of the recovery failed region from DM_RH_RECOVERING to DM_RH_NOSYNC in dm_rh_recovery_end(). We can remove the state change because: - If the region data has been released by dm_rh_update_states(), a new region data is created with the state of DM_RH_NOSYNC, and bios are processed according to the DM_RH_NOSYNC state. - If the region data has not been released by dm_rh_update_states(), a state of the region is DM_RH_RECOVERING and bios are put in the delayed_bio list. The flag change from DM_RH_RECOVERING to DM_RH_NOSYNC in dm_rh_recovery_end() was added in the following commit: dm raid1: handle resync failures author Jonathan Brassow <jbrassow@redhat.com> Thu, 12 Jul 2007 16:29:04 +0000 (17:29 +0100) http://git.kernel.org/linus/f44db678edcc6f4c2779ac43f63f0b9dfa28b724 Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-02-16 18:42:58 +00:00
Mikulas Patocka	5528d17de1	dm raid1: fail writes if errors are not handled and log fails If the mirror log fails when the handle_errors option was not selected and there is no remaining valid mirror leg, writes return success even though they weren't actually written to any device. This patch completes them with EIO instead. This code path is taken: do_writes: bio_list_merge(&ms->failures, &sync); do_failures: if (!get_valid_mirror(ms)) (false) else if (errors_handled(ms)) (false) else bio_endio(bio, 0); The logic in do_failures is based on presuming that the write was already tried: if it succeeded at least on one leg (without handle_errors) it is reported as success. Reference: https://bugzilla.redhat.com/show_bug.cgi?id=555197 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-02-16 18:42:55 +00:00
Jonathan Brassow	ebfd32bba9	dm log: userspace fix overhead_size calcuations This patch fixes two bugs that revolve around the miscalculation and misuse of the variable 'overhead_size'. 'overhead_size' is the size of the various header structures used during communication. The first bug is the use of 'sizeof' with the pointer of a structure instead of the structure itself - resulting in the wrong size being computed. This is then used in a check to see if the payload (data_size) would be to large for the preallocated structure. Since the bug produces a smaller value for the overhead, it was possible for the structure to be breached. (Although the current users of the code do not currently send enough data to trigger this bug.) The second bug is that the 'overhead_size' value is used to compute how much of the preallocated space should be cleared before populating it with fresh data. This should have simply been 'sizeof(struct cn_msg)' not overhead_size. The fact that 'overhead_size' was computed incorrectly made this problem "less bad" - leaving only a pointer's worth of space at the end uncleared. Thus, this bug was never producing a bad result, but still needs to be fixed - especially now that the value is computed correctly. Cc: stable@kernel.org Signed-off-by: Jonathan Brassow <jbrassow@redhat.com Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-02-16 18:42:53 +00:00
Mike Snitzer	55f67f2ded	dm snapshot: persistent annotate work_queue as on stack chunk_io() declares its 'struct mdata_req' on the stack and then initializes its 'struct work_struct' member. Annotate the initialization of this workqueue with INIT_WORK_ON_STACK to suppress a debugobjects warning seen when CONFIG_DEBUG_OBJECTS_WORK is enabled. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-02-16 18:42:51 +00:00
Nikanth Karthikesan	781248c1b5	dm stripe: avoid divide by zero with invalid stripe count If a table containing zero as stripe count is passed into stripe_ctr the code attempts to divide by zero. This patch changes DM_TABLE_LOAD to return -EINVAL if the stripe count is zero. We now get the following error messages: device-mapper: table: 253:0: striped: Invalid stripe count device-mapper: ioctl: error adding target to table Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-02-16 18:42:47 +00:00
NeilBrown	ef286f6fa6	md: fix some lockdep issues between md and sysfs. ====== This fix is related to http://bugzilla.kernel.org/show_bug.cgi?id=15142 but does not address that exact issue. ====== sysfs does like attributes being removed while they are being accessed (i.e. read or written) and waits for the access to complete. As accessing some md attributes takes the same lock that is held while removing those attributes a deadlock can occur. This patch addresses 3 issues in md that could lead to this deadlock. Two relate to calling flush_scheduled_work while the lock is held. This is probably a bad idea in general and as we use schedule_work to delete various sysfs objects it is particularly bad. In one case flush_scheduled_work is called from md_alloc (called by md_probe) called from do_md_run which holds the lock. This call is only present to ensure that ->gendisk is set. However we can be sure that gendisk is always set (though possibly we couldn't when that code was originally written. This is because do_md_run is called in three different contexts: 1/ from md_ioctl. This requires that md_open has succeeded, and it fails if ->gendisk is not set. 2/ from writing a sysfs attribute. This can only happen if the mddev has been registered in sysfs which happens in md_alloc after ->gendisk has been set. 3/ from autorun_array which is only called by autorun_devices, which checks for ->gendisk to be set before calling autorun_array. So the call to md_probe in do_md_run can be removed, and the check on ->gendisk can also go. In the other case flush_scheduled_work is being called in do_md_stop, purportedly to wait for all md_delayed_delete calls (which delete the component rdevs) to complete. However there really isn't any need to wait for them - they have already been disconnected in all important ways. The third issue is that raid5->stop() removes some attribute names while the lock is held. There is already some infrastructure in place to delay attribute removal until after the lock is released (using schedule_work). So extend that infrastructure to remove the raid5_attrs_group. This does not address all lockdep issues related to the sysfs "s_active" lock. The rest can be address by splitting that lockdep context between symlinks and non-symlinks which hopefully will happen. Signed-off-by: NeilBrown <neilb@suse.de>	2010-02-10 11:26:09 +11:00
NeilBrown	9eb07c2592	md: fix 'degraded' calculation when starting a reshape. This code was written long ago when it was not possible to reshape a degraded array. Now it is so the current level of degraded-ness needs to be taken in to account. Also newly addded devices should only reduce degradedness if they are deemed to be in-sync. In particular, if you convert a RAID5 to a RAID6, and increase the number of devices at the same time, then the 5->6 conversion will make the array degraded so the current code will produce a wrong value for 'degraded' - "-1" to be precise. If the reshape runs to completion end_reshape will calculate a correct new value for 'degraded', but if a device fails during the reshape an incorrect decision might be made based on the incorrect value of "degraded". This patch is suitable for 2.6.32-stable and if they are still open, 2.6.31-stable and 2.6.30-stable as well. Cc: stable@kernel.org Reported-by: Michael Evans <mjevans1983@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-02-09 16:34:29 +11:00
Martin K. Petersen	b27d7f16d3	DM: Fix device mapper topology stacking Make DM use bdev_stack_limits() function so that partition offsets get taken into account when calculating alignment. Clarify stacking warnings. Also remove obsolete clearing of final alignment_offset and misalignment flag. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair G. Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-01-11 14:29:20 +01:00
NeilBrown	404e4b43fd	md: allow a resync that is waiting for other resync to complete, to be aborted. If two arrays share a device, then they will not both resync at the same time. One will wait for the other to complete. While waiting, the MD_RECOVERY_INTR flag is not checked so a device failure, which would make the resync pointless, does not cause the resync to abort, so the failed device cannot be removed (as it cannot be remove while a resync is happening). So add a test for MD_RECOVERY_INTR. Reported-by: Brett Russ <bruss@netezza.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-30 15:25:23 +11:00
NeilBrown	7fb9dadc91	md: remove unnecessary code from do_md_run Since commit `dfc7064500`, ->hot_remove_disks has not removed non-failed devices from an array until recovery is no longer possible. So the code in do_md_run to get around the fact that md_check_recovery (which calls ->hot_remove_disks) would remove partially-in-sync devices is no longer needed. So remove it. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-30 15:20:43 +11:00
Dan Williams	a2d79c324a	md: make recovery started by do_md_run() visible via sync_action By default md_do_sync() will perform recovery if no other actions are specified. However, action_show() relies on MD_RECOVERY_RECOVER to be set otherwise it returns 'idle'. So, add a missing set MD_RECOVERY_RECOVER when starting recovery. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-30 15:20:31 +11:00
NeilBrown	0f9552b5dc	md: fix small irregularity with start_ro module parameter The start_ro modules parameter can be used to force arrays to be started in 'auto-readonly' in which they are read-only until the first write. This ensures that no resync/recovery happens until something else writes to the device. This is important for resume-from-disk off an md array. However if an array is started 'readonly' (by writing 'readonly' to the 'array_state' sysfs attribute) we want it to be really 'readonly', not 'auto-readonly'. So strengthen the condition to only set auto-readonly if the array is not already read-only. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-30 15:20:12 +11:00
NeilBrown	cbd1998377	md: Fix unfortunate interaction with evms evms configures md arrays by: open device send ioctl close device for each different ioctl needed. Since 2.6.29, the device can disappear after the 'close' unless a significant configuration has happened to the device. The change made by "SET_ARRAY_INFO" can too minor to stop the device from disappearing, but important enough that losing the change is bad. So: make sure SET_ARRAY_INFO sets mddev->ctime, and keep the device active as long as ctime is non-zero (it gets zeroed with lots of other things when the array is stopped). This is suitable for -stable kernels since 2.6.29. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2009-12-30 12:08:49 +11:00
Linus Torvalds	53365383c4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (80 commits) dm snapshot: use merge origin if snapshot invalid dm snapshot: report merge failure in status dm snapshot: merge consecutive chunks together dm snapshot: trigger exceptions in remaining snapshots during merge dm snapshot: delay merging a chunk until writes to it complete dm snapshot: queue writes to chunks being merged dm snapshot: add merging dm snapshot: permit only one merge at once dm snapshot: support barriers in snapshot merge target dm snapshot: avoid allocating exceptions in merge dm snapshot: rework writing to origin dm snapshot: add merge target dm exception store: add merge specific methods dm snapshot: create function for chunk_is_tracked wait dm snapshot: make bio optional in __origin_write dm mpath: reject messages when device is suspended dm: export suspended state to targets dm: rename dm_suspended to dm_suspended_md dm: swap target postsuspend call and setting suspended flag dm crypt: add plain64 iv ...	2009-12-15 09:12:01 -08:00
Joe Perches	7b75c2f8cf	drivers/md/md.c: use %pU to print UUIDs Signed-off-by: Joe Perches <joe@perches.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:33 -08:00
André Goddard Rosa	e7d2860b69	tree-wide: convert open calls to remove spaces to skip_spaces() lib function Makes use of skip_spaces() defined in lib/string.c for removing leading spaces from strings all over the tree. It decreases lib.a code size by 47 bytes and reuses the function tree-wide: text data bss dec hex filename 64688 584 592 65864 10148 (TOTALS-BEFORE) 64641 584 592 65817 10119 (TOTALS-AFTER) Also, while at it, if we see (str && isspace(str)), we can be sure to remove the first condition (str) as the second one (isspace(str)) also evaluates to 0 whenever str == 0, making it redundant. In other words, "a char equals zero is never a space". Julia Lawall tried the semantic patch (http://coccinelle.lip6.fr) below, and found occurrences of this pattern on 3 more files: drivers/leds/led-class.c drivers/leds/ledtrig-timer.c drivers/video/output.c @@ expression str; @@ ( // ignore skip_spaces cases while (str && isspace(str)) { \(str++;\\|++str;\) } \| - str && isspace(*str) ) Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Cc: Julia Lawall <julia@diku.dk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Neil Brown <neilb@suse.de> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: David Howells <dhowells@redhat.com> Cc: <linux-ext4@vger.kernel.org> Cc: Samuel Ortiz <samuel@sortiz.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:32 -08:00
Dan Williams	06e3c817b7	md: add 'recovery_start' per-device sysfs attribute Enable external metadata arrays to manage rebuild checkpointing via a md/dev-XXX/recovery_start attribute which reflects rdev->recovery_offset Also update resync_start_store to allow 'none' to be written, for consistency. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:58:57 +11:00
Dan Williams	4e59ca7da0	md: rcu_read_lock() walk of mddev->disks in md_do_sync() Other walks of this list are either under rcu_read_lock() or the list mutation lock (mddev_lock()). This protects against the improbable case of a disk being removed from the array at the start of md_do_sync(). Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-12-14 12:57:43 +11:00
NeilBrown	93be75ffde	md: integrate spares into array at earliest opportunity. As v1.x metadata can record that a member of the array is not completely recovered, it make sense to record that a spare has become a regular member of the array at the earliest opportunity. So remove the tests on "recovery_offset > 0" in super_1_sync as they really aren't needed, and schedule a metadata update immediately after adding spares to a degraded array. This means that if a crash happens immediately after a recovery starts, the new device will be included in the array and recovery will continue from wherever it was up to. Previously this didn't happen unless recovery was at least 1/16 of the way through. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
Arnd Bergmann	aa98aa3198	md: move compat_ioctl handling into md.c The RAID ioctls are only implemented in md.c, so the handling for them should also be moved there from fs/compat_ioctl.c. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Neil Brown <neilb@suse.de> Cc: Andre Noll <maan@systemlinux.org> Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	93bd89a6d5	md: revise Kconfig help for MD_MULTIPATH Make it clear in the config message that MD_MULTIPATH is not under active development. Cc: Oren Held <orenhe@il.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	0efb9e6191	md: add MODULE_DESCRIPTION for all md related modules. Suggested by Oren Held <orenhe@il.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
Robert Becker	1e50915fe0	raid: improve MD/raid10 handling of correctable read errors. We've noticed severe lasting performance degradation of our raid arrays when we have drives that yield large amounts of media errors. The raid10 module will queue each failed read for retry, and also will attempt call fix_read_error() to perform the read recovery. Read recovery is performed while the array is frozen, so repeated recovery attempts can degrade the performance of the array for extended periods of time. With this patch I propose adding a per md device max number of corrected read attempts. Each rdev will maintain a count of read correction attempts in the rdev->read_errors field (not used currently for raid10). When we enter fix_read_error() we'll check to see when the last read error occurred, and divide the read error count by 2 for every hour since the last read error. If at that point our read error count exceeds the read error threshold, we'll fail the raid device. In addition in this patch I add sysfs nodes (get/set) for the per md max_read_errors attribute, the rdev->read_errors attribute, and added some printk's to indicate when fix_read_error fails to repair an rdev. For testing I used debugfs->fail_make_request to inject IO errors to the rdev while doing IO to the raid array. Signed-off-by: Robert Becker <Rob.Becker@riverbed.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
Robert Becker	67b8dc4b06	md/raid10: print more useful messages on device failure. When we get a read error on a device in a RAID10, and attempting to repair the error fails, print more useful messages about why it failed. Signed-off-by: Robert Becker <Rob.Becker@riverbed.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	ffa23322b1	md/bitmap: update dirty flag when bitmap bits are explicitly set. There is a sysfs file which allows bits in the write-intent bitmap to be explicit set - indicating that the block is thought to be 'dirty'. When this happens we should really set recovery_cp backwards to include the block to reflect this dirtiness. In particular, a 'resync' process will refuse to start if recovery_cp is beyond the end of the array, so this is needed to allow a resync to be triggered. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	ece5cff0da	md: Support write-intent bitmaps with externally managed metadata. In this case, the metadata needs to not be in the same sector as the bitmap. md will not read/write any bitmap metadata. Config must be done via sysfs and when a recovery makes the array non-degraded again, writing 'true' to 'bitmap/can_clear' will allow bits in the bitmap to be cleared again. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	624ce4f565	md/bitmap: move setting of daemon_lastrun out of bitmap_read_sb Setting daemon_lastrun really has nothing to do with reading the bitmap superblock, it just happens to be needed at the same time. bitmap_read_sb is about to become options, so move that code out to after the call to bitmap_read_sb. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	43a705076e	md: support updating bitmap parameters via sysfs. A new attribute directory 'bitmap' in 'md' is created which contains files for configuring the bitmap. 'location' identifies where the bitmap is, either 'none', or 'file' or 'sector offset from metadata'. Writing 'location' can create or remove a bitmap. Adding a 'file' bitmap this way is not yet supported. 'chunksize' and 'time_base' must be set before 'location' can be set. 'chunksize' can be set before creating a bitmap, but is currently always over-ridden by the bitmap superblock. 'time_base' and 'backlog' can be updated at any time. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Andre Noll <maan@systemlinux.org>	2009-12-14 12:51:41 +11:00
NeilBrown	72e02075a3	md: factor out parsing of fixed-point numbers safe_delay_store can parse fixed point numbers (for fractions of a second). We will want to do that for another sysfs file soon, so factor out the code. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	f6af949c56	md: support bitmap offset appropriate for external-metadata arrays. For md arrays were metadata is managed externally, the kernel does not know about a superblock so the superblock offset is 0. If we want to have a write-intent-bitmap near the end of the devices of such an array, we should support sector_t sized offset. We need offset be possibly negative for when the bitmap is before the metadata, so use loff_t instead. Also add sanity check that bitmap does not overlap with data. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	9cd30fdc33	md: remove needless setting of thread->timeout in raid10_quiesce As bitmap_create and bitmap_destroy already set thread->timeout as appropriate, there is no need to do it in raid10_quiesce. There is a possible need to wake the thread after the timeout has been set low, but it is better to do that where the timeout is actually set low, in bitmap_create. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	1b04be96f6	md: change daemon_sleep to be in 'jiffies' rather than 'seconds'. This removes a lot of multiplications by HZ. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	42a04b5078	md: move offset, daemon_sleep and chunksize out of bitmap structure ... and into bitmap_info. These are all configuration parameters that need to be set before the bitmap is created. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	c3d9714e88	md: collect bitmap-specific fields into one structure. In preparation for making bitmap fields configurable via sysfs, start tidying up by making a single structure to contain the configuration fields. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	709ae4879a	md/raid1: add takeover support for raid5->raid1 A 2-device raid5 array can now be converted to raid1. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	6eef4b21ff	md: add honouring of suspend_{lo,hi} to raid1. This will allow us to stop writeout to portions of the array while they are resynced by someone else - e.g. another node in a cluster. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:40 +11:00
NeilBrown	729a18663a	md/raid5: don't complete make_request on barrier until writes are scheduled The post-barrier-flush is sent by md as soon as make_request on the barrier write completes. For raid5, the data might not be in the per-device queues yet. So for barrier requests, wait for any pre-reading to be done so that the request will be in the per-device queues. We use the 'preread_active' count to check that nothing is still in the preread phase, and delay the decrement of this count until after write requests have been submitted to the underlying devices. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:40 +11:00
NeilBrown	a2826aa92e	md: support barrier requests on all personalities. Previously barriers were only supported on RAID1. This is because other levels requires synchronisation across all devices and so needed a different approach. Here is that approach. When a barrier arrives, we send a zero-length barrier to every active device. When that completes - and if the original request was not empty - we submit the barrier request itself (with the barrier flag cleared) and then submit a fresh load of zero length barriers. The barrier request itself is asynchronous, but any subsequent request will block until the barrier completes. The reason for clearing the barrier flag is that a barrier request is allowed to fail. If we pass a non-empty barrier through a striping raid level it is conceivable that part of it could succeed and part could fail. That would be way too hard to deal with. So if the first run of zero length barriers succeed, we assume all is sufficiently well that we send the request and ignore errors in the second run of barriers. RAID5 needs extra care as write requests may not have been submitted to the underlying devices yet. So we flush the stripe cache before proceeding with the barrier. Note that the second set of zero-length barriers are submitted immediately after the original request is submitted. Thus when a personality finds mddev->barrier to be set during make_request, it should not return from make_request until the corresponding per-device request(s) have been queued. That will be done in later patches. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Andre Noll <maan@systemlinux.org>	2009-12-14 12:49:49 +11:00
NeilBrown	efa593390e	md: don't reset curr_resync_completed after an interrupted resync If a resync/recovery/check/repair is interrupted for some reason, it can be useful to know exactly where it got up to. So in that case, do not clear curr_resync_completed. Initialise it when starting a resync/recovery/... instead. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:49:49 +11:00
NeilBrown	c07b70ad32	md: adjust resync_min usefully when resync aborts. When a 'check' or 'repair' finished we should clear resync_min so that a future check/repair will cover the whole array (by default). However if it is interrupted, we should update resync_min to where we got up to, so that when the check/repair continues it just does the remainder of the array. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:49:48 +11:00
NeilBrown	7820f9e1dd	md: remove sparse warning:symbol XXX was not declared. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:49:47 +11:00
NeilBrown	8553fe7ec7	md/raid5: remove some sparse warnings. qd_idx is previously declared and given exactly the same value! Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:49:47 +11:00
NeilBrown	aa5cbd1038	md/bitmap: protect against bitmap removal while being updated. A write intent bitmap can be removed from an array while the array is active. When this happens, all IO is suspended and flushed before the bitmap is removed. However it is possible that bitmap_daemon_work is still running to clear old bits from the bitmap. If it is, it can dereference the bitmap after it has been freed. So introduce a new mutex to protect bitmap_daemon_work and get it before destroying a bitmap. This is suitable for any current -stable kernel. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2009-12-14 12:49:46 +11:00
Mikulas Patocka	d2fdb776e0	dm snapshot: use merge origin if snapshot invalid If the snapshot we are merging became invalid (e.g. it ran out of space) redirect all I/O directly to the origin device. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:36 +00:00
Mike Snitzer	d8ddb1cfff	dm snapshot: report merge failure in status Set 'merge_failed' flag if a snapshot fails to merge. Update snapshot_status() to report "Merge failed" if 'merge_failed' is set. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:35 +00:00
Mike Snitzer	8a2d528620	dm snapshot: merge consecutive chunks together s->store->type->prepare_merge returns the number of chunks that can be copied linearly working backwards from the returned chunk number. For example, if it returns 3 chunks with old_chunk == 10 and new_chunk == 20, then chunk 20 can be copied to 10, chunk 19 to 9 and 18 to 8. Until now kcopyd only copied one chunk at a time. This patch now copies the full set at once. Consequently, snapshot_merge_process() needs to delay the merging of all chunks if any have writes in progress, not just the first chunk in the region that is to be merged. snapshot-merge's performance is now comparable to the original snapshot-origin target. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:34 +00:00
Mikulas Patocka	73dfd078cf	dm snapshot: trigger exceptions in remaining snapshots during merge When there is one merging snapshot and other non-merging snapshots, snapshot_merge_process() must make exceptions in the non-merging snapshots. Use a sequence count to resolve the race between I/O to chunks that are about to be merged. The count increases each time an exception reallocation finishes. Use wait_event() to wait until the count changes. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:34 +00:00
Mikulas Patocka	17aa03326d	dm snapshot: delay merging a chunk until writes to it complete Track writes to chunks that are currently being merged and delay merging a chunk until all writes to that chunk finish. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:33 +00:00
Mikulas Patocka	9fe8625488	dm snapshot: queue writes to chunks being merged While a set of chunks is being merged, any overlapping writes need to be queued. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:33 +00:00
Mikulas Patocka	1e03f97e43	dm snapshot: add merging Merging is started when origin is resumed and it is stopped when origin is suspended or when the merging snapshot is destroyed or errors are detected. Merging is not yet interlocked with writes: this will be handled in subsequent patches. The code relies on callbacks from a private kcopyd thread. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:32 +00:00
Mikulas Patocka	9d3b15c4c7	dm snapshot: permit only one merge at once Merging more than one snapshot is not supported, so prevent this happening. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:32 +00:00
Mike Snitzer	10b8106a70	dm snapshot: support barriers in snapshot merge target Sets num_flush_requests=2 to support flushing both the origin and cow devices used by the snapshot-merge target. Also, snapshot_ctr() now gets the origin device using FMODE_WRITE if the target is snapshot-merge (which writes to the origin device). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:31 +00:00
Mikulas Patocka	3452c2a1eb	dm snapshot: avoid allocating exceptions in merge The snapshot-merge target should not allocate new exceptions because the intent is to merge all of its exceptions as quickly and safely as possible. This patch introduces the snapshot-merge mapping function and updates __origin_write() so that it doesn't allocate exceptions on any snapshots that are being merged. If a write request to a merging snapshot device is to be dispatched directly to the origin (because the chunk is not remapped or was already merged), snapshot_merge_map() must make exceptions in other snapshots so calls do_origin(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:31 +00:00
Mikulas Patocka	515ad66cc4	dm snapshot: rework writing to origin To track the completion of exceptions relating to the same location on the device, the current code selects one exception as primary_pe, links the other exceptions to it and uses reference counting to wait until all the reallocations are complete. It is considered too complicated to extend this code to handle the new snapshot-merge target, where sets of non-overlapping chunks would also need to become linked. Instead, a simpler (but less efficient) approach is taken. Bios are linked to one exception. When it completes, bios are simply retried, and if other related exceptions are still outstanding, they'll get queued again to wait for another one. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:30 +00:00
Mikulas Patocka	d698aa4500	dm snapshot: add merge target The snapshot-merge target allows a snapshot to be merged back into the snapshot's origin device. One anticipated use of snapshot merging is the rollback of filesystems to back out problematic system upgrades. This patch adds snapshot-merge target management to both dm_snapshot_init() and dm_snapshot_exit(). As an initial place-holder, snapshot-merge is identical to the snapshot target. Documentation is provided. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:30 +00:00
Mikulas Patocka	4454a6216f	dm exception store: add merge specific methods Add functions that decide how many consecutive chunks of snapshot to merge back into the origin next and to update the metadata afterwards. prepare_merge provides a pointer to the most recent still-to-be-merged chunk and returns how many previous ones are consecutive and can be processed together. commit_merge removes the nr_merged most-recent chunks permanently from the exception store. The number must not exceed that returned by prepare_merge. Introduce NUM_SNAPSHOT_HDR_CHUNKS to show where the snapshot header chunk is accounted for. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:29 +00:00
Mike Snitzer	615d1eb9ca	dm snapshot: create function for chunk_is_tracked wait Move the __chunk_is_tracked() loop into a separate function as we will also need to call it from the write path in the rare case of conflicting writes to the same chunk. Originally introduced in commit `a8d41b59f3` ("dm snapshot: fix race during exception creation"). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:29 +00:00
Mikulas Patocka	9eaae8ffbc	dm snapshot: make bio optional in __origin_write To support the merging of snapshots back into their origin we need to trigger exceptions in other snapshots not being merged without any incoming bio on the origin device. The bio parameter to __origin_write() becomes optional and the sector needs supplying separately. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:28 +00:00
Kiyoshi Ueda	c2f3d24b78	dm mpath: reject messages when device is suspended This patch rejects messages that can generate I/O while the device itself is suspended. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:27 +00:00
Kiyoshi Ueda	64dbce580d	dm: export suspended state to targets This patch adds the exported dm_suspended() function so that targets can check whether or not they are suspended. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:27 +00:00
Kiyoshi Ueda	4f186f8bbf	dm: rename dm_suspended to dm_suspended_md This patch renames dm_suspended() to dm_suspended_md() and keeps it internal to dm. No functional change. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:26 +00:00
Kiyoshi Ueda	4d4471cb5c	dm: swap target postsuspend call and setting suspended flag This patch moves DMF_SUSPENDED flag set before postsuspend. No one should care about the ordering, because the flag set and the postsuspend are protected by a single lock, md->suspend_lock, and all strict flag-checkers take the lock. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:26 +00:00
Milan Broz	61afef614b	dm crypt: add plain64 iv The default plain IV is 32-bit only. This plain64 IV provides a compatible mode for encrypted devices bigger than 4TB. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:25 +00:00
Jun'ichi Nomura	6db4ccd635	dm: trace request based remapping This patch adds a remapping trace to request-based dm. BIO-based dm already has the equivalent tracepoint. For example, under this dm stack (linear LV on multipath): # dmsetup ls --tree -o ascii vg-lv0 (253:1) `-mpath0 (253:0) \|- (8:160) \|- (66:80) \|- (65:176) `- (65:160) Trace of 'dd of=/dev/vg/lv0 bs=128k count=1 oflag=direct' looks like this: without the patch: dd-6674 [000] 539.727384: block_bio_queue: 253,1 WS 0 + 256 [dd] dd-6674 [000] 539.727392: block_remap: 253,0 WS 384 + 256 <- (253,1) 0 dd-6674 [000] 539.727394: block_bio_queue: 253,0 WS 384 + 256 [dd] dd-6674 [000] 539.727405: block_getrq: 253,0 WS 384 + 256 [dd] dd-6674 [000] 539.727409: block_plug: [dd] dd-6674 [000] 539.727410: block_rq_insert: 253,0 W 0 () 384 + 256 [dd] dd-6674 [000] 539.727416: block_rq_issue: 253,0 W 0 () 384 + 256 [dd] dd-6674 [000] 539.727426: block_rq_insert: 65,176 W 0 () 384 + 256 [dd] dd-6674 [000] 539.727427: block_rq_issue: 65,176 W 0 () 384 + 256 [dd] ... and with the patch: (the line with '' is the trace added by this patch) dd-6617 [002] 162.914301: block_bio_queue: 253,1 WS 0 + 256 [dd] dd-6617 [002] 162.914314: block_remap: 253,0 WS 384 + 256 <- (253,1) 0 dd-6617 [002] 162.914316: block_bio_queue: 253,0 WS 384 + 256 [dd] dd-6617 [002] 162.914331: block_getrq: 253,0 WS 384 + 256 [dd] dd-6617 [002] 162.914335: block_plug: [dd] dd-6617 [002] 162.914337: block_rq_insert: 253,0 W 0 () 384 + 256 [dd] dd-6617 [002] 162.914347: block_rq_issue: 253,0 W 0 () 384 + 256 [dd] dd-6617 [002] 162.914356: block_rq_remap: 65,176 W 384 + 256 <- (253,0) 384 dd-6617 [002] 162.914358: block_rq_insert: 65,176 W 0 () 384 + 256 [dd] dd-6617 [002] 162.914359: block_rq_issue: 65,176 W 0 () 384 + 256 [dd] ... Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:25 +00:00
Mike Snitzer	c1f0c183f6	dm snapshot: allow live exception store handover between tables Permit in-use snapshot exception data to be 'handed over' from one snapshot instance to another. This is a pre-requisite for patches that allow the changes made in a snapshot device to be merged back into its origin device and also allows device resizing. The basic call sequence is: dmsetup load new_snapshot (referencing the existing in-use cow device) - the ctr code detects that the cow is already in use and allows the two snapshot target instances to be linked together dmsetup suspend original_snapshot dmsetup resume new_snapshot - the new_snapshot becomes live, and if anything now tries to access the original one it will receive -EIO dmsetup remove original_snapshot (There can only be two snapshot targets referencing the same cow device simultaneously.) Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:24 +00:00
Alasdair G Kergon	042d2a9bcd	dm: keep old table until after resume succeeded When swapping a new table into place, retain the old table until its replacement is in place. An old check for an empty table is removed because this is enforced in populate_table(). __unbind() becomes redundant when followed by __bind(). Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:24 +00:00
Alasdair G Kergon	a794015597	dm: bind new table before destroying old When replacing a mapped device's table during a 'resume', delay the destruction of the old table until the new one is successfully in place. This will make it easier for a later patch to transfer internal state information from the old table to the new one (something we do not currently support) while giving us more options for reversion if a later part of the operation fails. Devices are always in the suspended state during dm_swap_table(). This patch reinforces the requirement that all I/O must have been flushed from the table targets while in this state (including any in workqueues). In the case of 'noflush' suspending, unprocessed I/O should have been 'pushed back' to the dm core prior to this point, for resubmission after the new table is in place. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:23 +00:00
Mike Snitzer	1d0f3ce832	dm ioctl: retrieve status from inactive table Add the flag DM_QUERY_INACTIVE_TABLE_FLAG to the ioctls to return infomation about the loaded-but-not-yet-active table instead of the live table. Prior to this patch it was impossible to obtain this information until the device had been 'resumed'. Userspace dmsetup and libdevmapper support the flag as of version 1.02.40. e.g. dmsetup info --inactive vg1-lv1 Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:22 +00:00
Mikulas Patocka	12fc0f49dc	dm io: handle empty barriers Accept empty barriers in dm-io. dm-io will process empty write barrier requests just like the other read/write requests. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:22 +00:00
Mike Anderson	67a46dad25	dm mpath: prevent io from work queue while suspended Reject messages that can generate I/O while the device itself is suspended. Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:21 +00:00
Mike Anderson	6380f26f04	dm mpath: add mutex to synchronize adding and flushing work Add a mutex to allow possible creators of new work to synchronize with flushing work queues. Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:21 +00:00
Mike Anderson	c50abeb380	dm ioctl: forbid messages to devices being deleted Once we begin deleting a device, prevent any further messages being sent to targets of its table (to avoid races). Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:20 +00:00
Mike Anderson	432a212c0d	dm: add dm_deleting_md function Add dm_deleting_md to check whether or not a given mapped device is currently being deleted. Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:20 +00:00
Kiyoshi Ueda	6df400ab64	dm mpath: flush workqueues before suspend completes This patch stops the remaining dm-mpath activity during the suspend sequence by flushing workqueues in postsuspend function. The current dm-mpath target may not be quiet even after suspend completes because some workqueues (e.g. device_handler's work, event handling) are not flushed during the suspend sequence, even though suspended devices/targets are supposed to be quiet in this state. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:19 +00:00
Alasdair G Kergon	7c6664114b	dm: rename dm_get_table to dm_get_live_table Rename dm_get_table to dm_get_live_table. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:19 +00:00
Kiyoshi Ueda	d0bcb87865	dm: add request based barrier support This patch adds barrier support for request-based dm. CORE DESIGN The design is basically same as bio-based dm, which emulates barrier by mapping empty barrier bios before/after a barrier I/O. But request-based dm has been using struct request_queue for I/O queueing, so the block-layer's barrier mechanism can be used. o Summary of the block-layer's behavior (which is depended by dm-core) Request-based dm uses QUEUE_ORDERED_DRAIN_FLUSH ordered mode for I/O barrier. It means that when an I/O requiring barrier is found in the request_queue, the block-layer makes pre-flush request and post-flush request just before and just after the I/O respectively. After the ordered sequence starts, the block-layer waits for all in-flight I/Os to complete, then gives drivers the pre-flush request, the barrier I/O and the post-flush request one by one. It means that the request_queue is stopped automatically by the block-layer until drivers complete each sequence. o dm-core For the barrier I/O, treats it as a normal I/O, so no additional code is needed. For the pre/post-flush request, flushes caches by the followings: 1. Make the number of empty barrier requests required by target's num_flush_requests, and map them (dm_rq_barrier()). 2. Waits for the mapped barriers to complete (dm_rq_barrier()). If error has occurred, save the error value to md->barrier_error (dm_end_request()). (*) Basically, the first reported error is taken. But -EOPNOTSUPP supersedes any error and DM_ENDIO_REQUEUE follows. 3. Requeue the pre/post-flush request if the error value is DM_ENDIO_REQUEUE. Otherwise, completes with the error value (dm_rq_barrier_work()). The pre/post-flush work above is done in the kernel thread (kdmflush) context, since memory allocation which might sleep is needed in dm_rq_barrier() but sleep is not allowed in dm_request_fn(), which is an irq-disabled context. Also, clones of the pre/post-flush request share an original, so such clones can't be completed using the softirq context. Instead, complete them in the context of underlying device drivers. It should be safe since there is no I/O dispatching during the completion of such clones. For suspend, the workqueue of kdmflush needs to be flushed after the request_queue has been stopped. Otherwise, the next flush work can be kicked even after the suspend completes. TARGET INTERFACE No new interface is added. Just use the existing num_flush_requests in struct target_type as same as bio-based dm. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:18 +00:00
Kiyoshi Ueda	980691e5f3	dm: move dm_end_request This patch moves dm_end_request() to make the next patch more readable. No functional change. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:17 +00:00
Kiyoshi Ueda	11a68244e1	dm: refactor request based completion functions This patch factors out the clone completion code, dm_done(), from dm_softirq_done() in preparation for a subsequent patch. No functional change. dm_done() will be used in barrier completion, which can't use and doesn't need softirq. The softirq_done callback needs to get a clone from an original request but it can't in the case of barrier, where an original request is shared by multiple clones. On the other hand, the completion of barrier clones doesn't involve re-submitting requests, which was the primary reason of the need for softirq. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:17 +00:00
Kiyoshi Ueda	b4324feeae	dm: use md pending for in flight IO counting This patch changes the counter for the number of in_flight I/Os to md->pending from q->in_flight in preparation for a later patch. No functional change. Request-based dm used q->in_flight to count the number of in-flight clones assuming the counter is always incremented for an in-flight original request and original:clone is 1:1 relationship. However, it this no longer true for barrier requests. So use md->pending to count the number of in-flight clones. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:16 +00:00
Kiyoshi Ueda	9f518b27cf	dm: simplify request based suspend The semantics of bio-based dm were changed recently in the case of suspend with "--nolockfs" but without "--noflush". Before 2.6.30, I/Os submitted before the suspend invocation were always flushed. From 2.6.30 onwards, I/Os submitted before the suspend invocation might not be flushed. (For details, see http://marc.info/?t=123994433400003&r=1&w=2) This patch brings the behaviour of request-based dm into line with bio-based dm, simplifying the code and preparing for a subsequent patch that will wait for all in_flight I/Os to complete without stopping request_queue and use dm_wait_for_completion() for it. This change in semantics simplifies the suspend code as follows: o Suspend is implemented as stopping request_queue in request-based dm, and all I/Os are queued in the request_queue even after suspend is invoked. o In the old semantics, we had to track whether I/Os were queued before or after the suspend invocation, so a special barrier-like request called 'suspend marker' was introduced. o With the new semantics, we don't need to flush any I/O so we can remove the marker and the code related to the marker handling and I/O flushing. After removing this codes, the suspend sequence is now: 1. Flush all I/Os by lock_fs() if needed. 2. Stop dispatching any I/O by stopping the request_queue. 3. Wait for all in-flight I/Os to be completed or requeued. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:16 +00:00
Kiyoshi Ueda	6facdaff22	dm: abstract clone_rq This patch factors out the request cloning code in dm_prep_fn() as clone_rq(). No functional change. This patch is a preparation for a later patch in this series which needs to make clones from an original barrier request. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:15 +00:00
Kiyoshi Ueda	0888564393	dm: pass gfp_mask to alloc_rq_tio This patch adds the gfp_mask argument to alloc_rq_tio(). No functional change. This patch is a preparation for a later patch in this series which needs to allocate tio (for barrier I/O) with different allocation flag (GFP_NOIO) from the one in the normal I/O code path. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:15 +00:00
Kiyoshi Ueda	598de40947	dm: use clone in map_request function This patch changes the argument of map_request() to clone request from original request. No functional change. This patch is a preparation for PATCH 9, which needs to use map_request() for clones sharing an original barrier request. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:14 +00:00
Kiyoshi Ueda	90abb8c4ce	dm: abstract dm_in_flight function This patch adds md_in_flight() to get the number of in_flight I/Os. No functional change. This patch is a preparation for a later patch in this series, which changes I/O counter to md->pending from q->in_flight in request-based dm. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:13 +00:00
Mikulas Patocka	9ca170a3c0	dm kcopyd: accept zero size jobs dm-kcopyd: accept zero-size jobs This patch changes dm-kcopyd so that it accepts zero-size jobs and completes them immediatelly via its completion thread. It is needed for multisnapshots snapshot resizing. When we are writing to a chunk beyond origin end, no copying is done. To simplify the code, we submit an empty request to kcopyd and let kcopyd complete it. If we didn't submit a request to kcopyd and called the completion routine immediatelly, it would violate the principle that completion is called only from one thread and it would need additional locking. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:13 +00:00
Mike Snitzer	c26655ca3c	dm snapshot: track suspended state in target Keep track of whether or not the device is suspended within the snapshot target module, the same as we do in dm-raid1. We will use this later to enforce the correct sequence of ioctls to transfer the in-core exceptions from a snapshot target instance in one table to a replacement one capable of merging them back into the origin. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:12 +00:00
Mike Snitzer	fc56f6fbcc	dm snapshot: move cow ref from exception store to snap core Store the reference to the snapshot cow device in the core snapshot code instead of each exception store. It can be accessed through the new function dm_snap_cow(). Exception stores should each now maintain a reference to their parent snapshot struct. This is cleaner and makes part of the forthcoming snapshot merge code simpler. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Cc: Mikulas Patocka <mpatocka@redhat.com>	2009-12-10 23:52:12 +00:00
Mike Snitzer	985903bb3a	dm snapshot: add allocated metadata to snapshot status Add number of sectors used by metadata to the end of the snapshot's status line. Renamed dm_exception_store_type's 'fraction_full' to 'usage'. Renamed arguments to be clearer about what is being returned. Also added 'metadata_sectors'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:11 +00:00
Jon Brassow	3510cb94ff	dm snapshot: rename exception functions Rename exception functions. Preparing to pull them out of dm-snap.c for broader use. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:11 +00:00
Jon Brassow	191437a53c	dm snapshot: rename exception_table to dm_exception_table Rename exception_table for broader use outside dm-snap.c Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:10 +00:00
Jon Brassow	1d4989c858	dm snapshot: rename dm_snap_exception to dm_exception The exception structure is not necessarily just a snapshot element (especially after we pull it out of dm-snap.c). Renaming appropriately. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:10 +00:00
Jon Brassow	d32a6ea65f	dm snapshot: consolidate insert exception functions Consolidate the insert_*exception functions. 'insert_completed_exception' already contains all the logic to handle 'insert_exception' (via check for a hash_shift of 0), so remove redundant function. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:09 +00:00
Mikulas Patocka	7e201b3513	dm snapshot: abstract minimum_chunk_size fn The origin needs to find minimum chunksize of all snapshots. This logic is moved to a separate function because it will be used at another place in the snapshot merge patches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:08 +00:00
Mikulas Patocka	102c6ddb1d	dm snapshot: simplify sector_to_chunk expression Removed unnecessary 'and' masking: The right shift discards the lower bits so there is no need to clear them. (A later patch needs this change to support a 32-bit chunk_mask.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:08 +00:00
Jon Brassow	f5acc83428	dm snapshot: avoid else clause in persistent_read_metadata Minor code touch-up. We don't need the 'else'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:07 +00:00
Roel Kluin	a518b86d0b	dm ioctl: prefer strlcpy over strncpy strlcpy() will always null terminate the string. The code should already guarantee this as the last bytes are already NULs and the string lengths were restricted before being stored in hc. Removing the '-1' becomes necessary so strlcpy() doesn't lose the last character of a maximum-length string. - agk Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:07 +00:00
Mikulas Patocka	5339fc2d47	dm raid1: explicitly initialise bio_lists Explicitly initialize bio lists instead of relying on kzalloc. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:06 +00:00
Mikulas Patocka	929be8fcb4	dm raid1: hold all write bios when leg fails Hold all write bios when leg fails and errors are handled When using a userspace daemon such as dmeventd to handle errors, we must delay completing bios until it has done its job. This patch prevents the following race: - primary leg fails - write "1" fail, the write is held, secondary leg is set default - write "2" goes straight to the secondary leg Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:06 +00:00
Mikulas Patocka	60f355ead3	dm raid1: hold write bios when errors are handled Hold all write bios when errors are handled. Previously the failures list was used only when handling errors with a userspace daemon such as dmeventd. Now, it is always used for all bios. The regions where some writes failed must be marked as nosync. This can only be done in process context (i.e. in raid1 workqueue), not in the write_callback function. Previously the write would succeed if writing to at least one leg succeeded. This is wrong because data from the failed leg may be replicated to the correct leg. Now, if using a userspace daemon, the write with some failures will be held until the daemon has done its job and reconfigured the array. If not using a daemon, the write still succeeds if at least one leg succeeds. This is bad, but it is consistent with current behavior. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:05 +00:00
Mikulas Patocka	c58098be97	dm raid1: remove bio_endio from dm_rh_mark_nosync Move bio completion out of dm_rh_mark_nosync in preparation for the next patch. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:05 +00:00
Mikulas Patocka	87968ddd2f	dm raid1: abstract get_valid_mirror function Move the logic to get a valid mirror leg into a function for re-use in a later patch. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:04 +00:00
Mikulas Patocka	0f398a8403	dm raid1: use hold framework in do_failures Use the hold framework in do_failures. This patch doesn't change the bio processing logic, it just simplifies failure handling and avoids periodically polling the failures list. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:04 +00:00
Mikulas Patocka	0478850768	dm raid1: add framework to hold bios during suspend Add framework to delay bios until a suspend and then resubmit them with either DM_ENDIO_REQUEUE (if the suspend was noflush) or complete them with -EIO. I/O barrier support will use this. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:03 +00:00
Mikulas Patocka	64b30c46e8	dm raid1: report flush errors separately in status Report flush errors as 'F' instead of 'D' for log and mirror devices. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:02 +00:00
Mikulas Patocka	c0da3748b9	dm raid1: implement mirror_flush Implement flush callee. It uses dm_io to send zero-size barrier synchronously and concurrently to all the mirror legs. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:02 +00:00
Mikulas Patocka	076010e2e6	dm log: use flush callback fn Call the flush callback from the log. If flush failed, we have no alternative but to mark the whole log as dirty. Also we set the variable flush_failed to prevent any bits ever being marked as clean again. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:01 +00:00
Mikulas Patocka	87a8f240e9	dm log: add flush callback fn Introduce a callback pointer from the log to dm-raid1 layer. Before some region is set as "in-sync", we need to flush hardware cache on all the disks. But the log module doesn't have access to the mirror_set structure. So it will use this callback. So far the callback is unused, it will be used in further patches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:01 +00:00
Mikulas Patocka	5adc78d0d2	dm log: introduce flush_failed variable Introduce "flush failed" variable. When a flush before clearing a bit in the log fails, we don't know anything about which which regions are in-sync and which not. So we need to set all regions as not-in-sync and set the variable "flush_failed" to prevent setting the in-sync bit in the future. A target reload is the only way to get out of this situation. The variable will be set in following patches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:00 +00:00
Mikulas Patocka	20a34a8ecc	dm log: add flush_header function Introduce flush_header and use it to flush the log device. Note that we don't have to flush if all the regions transition from "dirty" to "clean" state. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:52:00 +00:00
Mikulas Patocka	b09acf1aa7	dm raid1: split touched state into two Split the variable "touched" into two, "touched_dirtied" and "touched_cleaned", set when some region was dirtied or cleaned. This will be used to optimize flushes. After a transition from "dirty" to "clean" state we don't have flush hardware cache on the log device. After a transition from "clean" to "dirty" the cache must be flushed. Before a transition from "clean" to "dirty" state we don't have to flush all the raid legs. Before a transition from "dirty" to "clean" we must flush all the legs to make sure that they are really in sync. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:51:59 +00:00
Mikulas Patocka	4184153f9e	dm raid1: support flush Flush support for dm-raid1. When it receives an empty barrier, submit it to all the devices via dm-io. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:51:59 +00:00
Mikulas Patocka	f1e5398746	dm io: remove extra bi_io_vec region hack Remove the hack where we allocate an extra bi_io_vec to store additional private data. This hack prevents us from supporting barriers in dm-raid1 without first making another little block layer change. Instead of doing that, this patch eliminates the bi_io_vec abuse by storing the region number directly in the low bits of bi_private. We need to store two things for each bio, the pointer to the main io structure and, if parallel writes were requested, an index indicating which of these writes this bio belongs to. There can be at most BITS_PER_LONG regions - 32 or 64. The index (region number) was stored in the last (hidden) bio vector and the pointer to struct io was stored in bi_private. This patch now aligns "struct io" on BITS_PER_LONG bytes and stores the region number in the low BITS_PER_LONG bits of bi_private. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:51:58 +00:00
Mikulas Patocka	952b355760	dm io: use slab for struct io Allocate "struct io" from a slab. This patch changes dm-io, so that "struct io" is allocated from a slab cache. It used to be allocated with kmalloc. Allocating from a slab will be needed for the next patch, because it requires a special alignment of "struct io" and kmalloc cannot meet this alignment. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:51:57 +00:00
Milan Broz	542da31766	dm crypt: make wipe message also wipe essiv key The "wipe key" message is used to wipe the volume key from memory temporarily, for example when suspending to RAM. But the initialisation vector in ESSIV mode is calculated from the hashed volume key, so the wipe message should wipe this IV key too and reinitialise it when the volume key is reinstated. This patch adds an IV wipe method called from a wipe message callback. ESSIV is then reinitialised using the init function added by the last patch. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:51:57 +00:00
Milan Broz	b95bf2d3d5	dm crypt: separate essiv allocation from initialisation This patch separates the construction of IV from its initialisation. (For ESSIV it is a hash calculation based on volume key.) Constructor code now preallocates hash tfm and salt array and saves it in a private IV structure. The next patch requires this to reinitialise the wiped IV without reallocating memory when resuming a suspended device. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:51:56 +00:00
Milan Broz	5861f1be00	dm crypt: restructure essiv error path Use kzfree for salt deallocation because it is derived from the volume key. Use a common error path in ESSIV constructor. Required by a later patch which fixes the way key material is wiped from memory. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:51:56 +00:00
Milan Broz	6047359277	dm crypt: move private iv fields to structs Define private structures for IV so it's easy to add further attributes in a following patch which fixes the way key material is wiped from memory. Also move ESSIV destructor and remove unnecessary 'status' operation. There are no functional changes in this patch. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:51:55 +00:00
Milan Broz	0b4309581b	dm crypt: make wipe message also wipe tfm key The "wipe key" message is used to wipe a volume key from memory temporarily, for example when suspending to RAM. There are two instances of the key in memory (inside crypto tfm) but only one got wiped. This patch wipes them both. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:51:55 +00:00
Mikulas Patocka	8e87b9b81b	dm snapshot: cope with chunk size larger than origin Under some special conditions the snapshot hash_size is calculated as zero. This patch instead sets a minimum value of 64, the same as for the pending exception table. rounddown_pow_of_two(0) is an undefined operation (it expands to shift by -1). init_exception_table with an argument of 0 would fail with -ENOMEM. The way to trigger the problem is to create a snapshot with a chunk size that is larger than the origin device. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:51:54 +00:00
Mikulas Patocka	94e76572b5	dm snapshot: only take lock for statustype info not table Take snapshot lock only for STATUSTYPE_INFO, not STATUSTYPE_TABLE. Commit `4c6fff445d` (dm-snapshot-lock-snapshot-while-supplying-status.patch) introduced this use of the lock, but userspace applications using libdevmapper have been found to request STATUSTYPE_TABLE while the device is suspended and the lock is already held, leading to deadlock. Since the lock is not necessary in this case, don't try to take it. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:51:53 +00:00
Milan Broz	d2bb7df8ca	dm: sysfs add empty release function to avoid debug warning This patch just removes an unnecessary warning: kobject: 'dm': does not have a release() function, it is broken and must be fixed. The kobject is embedded in mapped device struct, so code does not need to release memory explicitly here. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:51:53 +00:00
Julia Lawall	613978f871	dm exception store: free tmp_store on persistent flag error Error handling code following a kmalloc should free the allocated data. Cc: stable@kernel.org Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:51:52 +00:00
Mikulas Patocka	6076905b5e	dm: avoid _hash_lock deadlock Fix a reported deadlock if there are still unprocessed multipath events on a device that is being removed. _hash_lock is held during dev_remove while trying to send the outstanding events. Sending the events requests the _hash_lock again in dm_copy_name_and_uuid. This patch introduces a separate lock around regions that modify the link to the hash table (dm_set_mdptr) or the name or uuid so that dm_copy_name_and_uuid no longer needs _hash_lock. Additionally, dm_copy_name_and_uuid can only be called if md exists so we can drop the dm_get() and dm_put() which can lead to a BUG() while md is being freed. The deadlock: #0 [ffff8106298dfb48] schedule at ffffffff80063035 #1 [ffff8106298dfc20] __down_read at ffffffff8006475d #2 [ffff8106298dfc60] dm_copy_name_and_uuid at ffffffff8824f740 #3 [ffff8106298dfc90] dm_send_uevents at ffffffff88252685 #4 [ffff8106298dfcd0] event_callback at ffffffff8824c678 #5 [ffff8106298dfd00] dm_table_event at ffffffff8824dd01 #6 [ffff8106298dfd10] __hash_remove at ffffffff882507ad #7 [ffff8106298dfd30] dev_remove at ffffffff88250865 #8 [ffff8106298dfd60] ctl_ioctl at ffffffff88250d80 #9 [ffff8106298dfee0] do_ioctl at ffffffff800418c4 #10 [ffff8106298dff00] vfs_ioctl at ffffffff8002fab9 #11 [ffff8106298dff40] sys_ioctl at ffffffff8004bdaf #12 [ffff8106298dff80] tracesys at ffffffff8005d28d (via system_call) Cc: stable@kernel.org Reported-by: guy keren <choo@actcom.co.il> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-12-10 23:51:52 +00:00
Linus Torvalds	4ef58d4e2a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (42 commits) tree-wide: fix misspelling of "definition" in comments reiserfs: fix misspelling of "journaled" doc: Fix a typo in slub.txt. inotify: remove superfluous return code check hdlc: spelling fix in find_pvc() comment doc: fix regulator docs cut-and-pasteism mtd: Fix comment in Kconfig doc: Fix IRQ chip docs tree-wide: fix assorted typos all over the place drivers/ata/libata-sff.c: comment spelling fixes fix typos/grammos in Documentation/edac.txt sysctl: add missing comments fs/debugfs/inode.c: fix comment typos sgivwfb: Make use of ARRAY_SIZE. sky2: fix sky2_link_down copy/paste comment error tree-wide: fix typos "couter" -> "counter" tree-wide: fix typos "offest" -> "offset" fix kerneldoc for set_irq_msi() spidev: fix double "of of" in comment comment typo fix: sybsystem -> subsystem ...	2009-12-09 19:43:33 -08:00
Linus Torvalds	382f51fe2f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (222 commits) [SCSI] zfcp: Remove flag ZFCP_STATUS_FSFREQ_TMFUNCNOTSUPP [SCSI] zfcp: Activate fc4s attributes for zfcp in FC transport class [SCSI] zfcp: Block scsi_eh thread for rport state BLOCKED [SCSI] zfcp: Update FSF error reporting [SCSI] zfcp: Improve ELS ADISC handling [SCSI] zfcp: Simplify handling of ct and els requests [SCSI] zfcp: Remove ZFCP_DID_MASK [SCSI] zfcp: Move WKA port to zfcp FC code [SCSI] zfcp: Use common code definitions for FC CT structs [SCSI] zfcp: Use common code definitions for FC ELS structs [SCSI] zfcp: Update FCP protocol related code [SCSI] zfcp: Dont fail SCSI commands when transitioning to blocked fc_rport [SCSI] zfcp: Assign scheduled work to driver queue [SCSI] zfcp: Remove STATUS_COMMON_REMOVE flag as it is not required anymore [SCSI] zfcp: Implement module unloading [SCSI] zfcp: Merge trace code for fsf requests in one function [SCSI] zfcp: Access ports and units with container_of in sysfs code [SCSI] zfcp: Remove suspend callback [SCSI] zfcp: Remove global config_mutex [SCSI] zfcp: Replace local reference counting with common kref ...	2009-12-09 19:42:25 -08:00
Linus Torvalds	1557d33007	Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits) security/tomoyo: Remove now unnecessary handling of security_sysctl. security/tomoyo: Add a special case to handle accesses through the internal proc mount. sysctl: Drop & in front of every proc_handler. sysctl: Remove CTL_NONE and CTL_UNNUMBERED sysctl: kill dead ctl_handler definitions. sysctl: Remove the last of the generic binary sysctl support sysctl net: Remove unused binary sysctl code sysctl security/tomoyo: Don't look at ctl_name sysctl arm: Remove binary sysctl support sysctl x86: Remove dead binary sysctl support sysctl sh: Remove dead binary sysctl support sysctl powerpc: Remove dead binary sysctl support sysctl ia64: Remove dead binary sysctl support sysctl s390: Remove dead sysctl binary support sysctl frv: Remove dead binary sysctl support sysctl mips/lasat: Remove dead binary sysctl support sysctl drivers: Remove dead binary sysctl support sysctl crypto: Remove dead binary sysctl support sysctl security/keys: Remove dead binary sysctl support sysctl kernel: Remove binary sysctl logic ...	2009-12-08 07:38:50 -08:00
Jiri Kosina	d014d04386	Merge branch 'for-next' into for-linus Conflicts: kernel/irq/chip.c	2009-12-07 18:36:35 +01:00
Chandra Seetharaman	3ae31f6a7b	[SCSI] scsi_dh: Change the scsidh_activate interface to be asynchronous Make scsi_dh_activate() function asynchronous, by taking in two additional parameters, one is the callback function and the other is the data to call the callback function with. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2009-12-04 12:00:46 -06:00
NeilBrown	d0e260782c	md: revert incorrect fix for read error handling in raid1. commit `4706b349f` was a forward port of a fix that was needed for SLES10. But in fact it is not needed in mainline because the earlier commit `dd00a99e7a` fixes the same problem in a better way. Further, this commit introduces a bug in the way it interacts with the automatic read-error-correction. If, after a read error is successfully corrected, the same disk is chosen to re-read - the re-read won't be attempted but an error will be returned instead. After reverting that commit, there is the possibility that a read error on a read-only array (where read errors cannot be corrected as that requires a write) will repeatedly read the same device and continue to get an error. So in the "Array is readonly" case, fail the drive immediately on a read error. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2009-12-01 17:30:59 +11:00
Eric W. Biederman	6d4561110a	sysctl: Drop & in front of every proc_handler. For consistency drop & in front of every proc_handler. Explicity taking the address is unnecessary and it prevents optimizations like stubbing the proc_handlers to NULL. Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joe Perches <joe@perches.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-18 08:37:40 -08:00
Eric W. Biederman	bb9074ff58	Merge commit 'v2.6.32-rc7' Resolve the conflict between v2.6.32-rc7 where dn_def_dev_handler gets a small bug fix and the sysctl tree where I am removing all sysctl strategy routines.	2009-11-17 01:01:34 -08:00
NeilBrown	c148ffdcda	md/raid5: Allow dirty-degraded arrays to be assembled when only party is degraded. Normally is it not safe to allow a raid5 that is both dirty and degraded to be assembled without explicit request from that admin, as it can cause hidden data corruption. This is because 'dirty' means that the parity cannot be trusted, and 'degraded' means that the parity needs to be used. However, if the device that is missing contains only parity, then there is no issue and assembly can continue. This particularly applies when a RAID5 is being converted to a RAID6 and there is an unclean shutdown while the conversion is happening. So check for whether the degraded space only contains parity, and in that case, allow the assembly. Signed-off-by: NeilBrown <neilb@suse.de>	2009-11-13 17:47:00 +11:00
NeilBrown	7ef90146a1	Don't unconditionally set in_sync on newly added device in raid5_reshape When a reshape finds that it can add spare devices into the array, those devices might already be 'in_sync' if they are beyond the old size of the array, or they might not if they are within the array. The first case happens when we change an N-drive RAID5 to an N+1-drive RAID5. The second happens when we convert an N-drive RAID5 to an N+1-drive RAID6. So set the flag more carefully. Also, ->recovery_offset is only meaningful when the flag is clear, so only set it in that case. This change needs the preceding two to ensure that the non-in_sync device doesn't get evicted from the array when it is stopped, in the case where v0.90 metadata is used. Signed-off-by: NeilBrown <neilb@suse.de>	2009-11-13 17:40:51 +11:00
NeilBrown	0261cd9f1c	md: allow v0.91 metadata to record devices as being active but not in-sync. This is a combination that didn't really make sense before. However when a reshape is converting e.g. raid5 -> raid6, the extra device is not fully in-sync, but is certainly active and contains important data. So allow that start to be meaningful and in particular get the 'recovery_offset' value (which is needed for any non-in-sync active device) from the reshape_position. Signed-off-by: NeilBrown <neilb@suse.de>	2009-11-13 17:40:48 +11:00
Eric W. Biederman	894d249115	sysctl drivers: Remove dead binary sysctl support Now that sys_sysctl is a wrapper around /proc/sys all of the binary sysctl support elsewhere in the tree is dead code. Cc: Jens Axboe <axboe@kernel.dk> Cc: Corey Minyard <minyard@acm.org> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Matt Mackall <mpm@selenic.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Neil Brown <neilb@suse.de> Cc: "James E.J. Bottomley" <James.Bottomley@suse.de> Acked-by: Clemens Ladisch <clemens@ladisch.de> for drivers/char/hpet.c Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-12 02:04:58 -08:00
NeilBrown	5e8651060c	md: factor out updating of 'recovery_offset'. Each device has its own 'recovery_offset' showing how far recovery has progressed on the device. As the only real significance of this is that fact that it can be stored in the metadata and recovered at restart, and as only 1.x metadata can do this, we were only updating 'recovery_offset' to 'curr_resync_completed' when updating v1.x metadata. But this is wrong, and we will shortly make limited use of this field in v0.90 metadata. So move the update into common code. Signed-off-by: NeilBrown <neilb@suse.de>	2009-11-12 12:08:04 +11:00
Dirk Hohndel	06fe9fb418	tree-wide: fix a very frequent spelling mistake something-bility is spelled as something-blity so a grep for 'blit' would find these lines this is so trivial that I didn't split it by subsystem / copy additional maintainers - all changes are to comments The only purpose is to get fewer false positives when grepping around the kernel sources. Signed-off-by: Dirk Hohndel <hohndel@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-11-09 09:40:54 +01:00
NeilBrown	8dee721146	md/raid5: make sure curr_sync_completes is uptodate when reshape starts This value is visible through sysfs and is used by mdadm when it manages a reshape (backing up data that is about to be rearranged). So it is important that it is always correct. Current it does not get updated properly when a reshape starts which can cause problems when assembling an array that is in the middle of being reshaped. This is suitable for 2.6.31.y stable kernels. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-11-06 14:59:29 +11:00
NeilBrown	24395a85d8	md: don't clear endpoint for resync when resync is interrupted. If a 'sync_max' has been set (via sysfs), it is wrong to clear it until a resync (or reshape or recovery ...) actually reached that point. So if a resync is interrupted (e.g. by device failure), leave 'resync_max' unchanged. This is particularly important for 'reshape' operations that do not change the size of the array. For such operations mdadm needs to monitor the reshape taking rolling backups of the section being reshaped. If resync_max gets cleared, the reshape can get ahead of mdadm and then the backups that mdadm creates are useless. This is suitable for 2.6.31.y stable kernels. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-11-06 14:59:27 +11:00
Linus Torvalds	bf699c9bac	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: async_tx: fix asynchronous raid6 recovery for ddf layouts async_pq: rename scribble page async_pq: kill a stray dma_map() call and other cleanups md/raid6: kill a gcc-4.0.1 'uninitialized variable' warning raid6/async_tx: handle holes in block list in async_syndrome_val md/async: don't pass a memory pointer as a page pointer. md: Fix handling of raid5 array which is being reshaped to fewer devices. md: fix problems with RAID6 calculations for DDF. md/raid456: downlevel multicore operations to raid_run_ops md: drivers/md/unroll.pl replaced with awk analog md: remove clumsy usage of do_sync_mapping_range from bitmap code md: raid1/raid10: handle allocation errors during array setup. md/raid5: initialize conf->device_lock earlier md/raid1/raid10: add a cond_resched Revert "md: do not progress the resync process if the stripe was blocked"	2009-10-31 12:12:19 -07:00
David Woodhouse	e5d84970a5	async_tx: Move ASYNC_RAID6_TEST option to crypto/async_tx/, fix dependencies Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2009-10-29 16:41:49 +00:00
David Woodhouse	f5e70d0fe3	md: Factor out RAID6 algorithms into lib/ We'll want to use these in btrfs too. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2009-10-29 14:38:47 +00:00
Dan Williams	6629542e79	md/raid6: kill a gcc-4.0.1 'uninitialized variable' warning Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-10-19 18:09:41 -07:00
Mikulas Patocka	c1cc65caa1	dm snapshot: allow chunk size to be less than page size Allow the snapshot chunk size to be smaller than the page size The code is now capable of handling this due to some previous fixes and enhancements. As the page size varies between computers, prior to this patch, the chunk size of a snapshot dictated which machines could read it: Snapshots created on one machine might not be readable on another. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-10-16 23:18:22 +01:00
Mikulas Patocka	df96eee679	dm snapshot: use unsigned integer chunk size Use unsigned integer chunk size. Maximum chunk size is 512kB, there won't ever be need to use 4GB chunk size, so the number can be 32-bit. This fixes compiler failure on 32-bit systems with large block devices. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-10-16 23:18:17 +01:00
Mikulas Patocka	4c6fff445d	dm snapshot: lock snapshot while supplying status This patch locks the snapshot when returning status. It fixes a race when it could return an invalid number of free chunks if someone was simultaneously modifying it. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-10-16 23:18:16 +01:00
Mikulas Patocka	0e8c4e4e3e	dm exception store: fix failed set_chunk_size error path Properly close the device if failing because of an invalid chunk size. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-10-16 23:18:16 +01:00
Mikulas Patocka	3f2412dc85	dm snapshot: require non zero chunk size by end of ctr If we are creating snapshot with memory-stored exception store, fail if the user didn't specify chunk size. Zero chunk size would probably crash a lot of places in the rest of snapshot code. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-10-16 23:18:16 +01:00
Kiyoshi Ueda	f88fb98118	dm: dec_pending needs locking to save error value Multiple instances of dec_pending() can run concurrently so a lock is needed when it saves the first error code. I have never experienced actual problem without locking and just found this during code inspection while implementing the barrier support patch for request-based dm. This patch adds the locking. I've done compile, boot and basic I/O testings. Cc: stable@kernel.org Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-10-16 23:18:15 +01:00
Zdenek Kabelac	03022c54b9	dm: add missing del_gendisk to alloc_dev error path Add missing del_gendisk() to error path when creation of workqueue fails. Otherwice there is a resource leak and following warning is shown: WARNING: at fs/sysfs/dir.c:487 sysfs_add_one+0xc5/0x160() sysfs: cannot create duplicate filename '/devices/virtual/block/dm-0' Cc: stable@kernel.org Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-10-16 23:18:15 +01:00
Andrew Morton	bca915aae8	dm log: userspace fix incorrect luid cast in userspace_ctr mips: drivers/md/dm-log-userspace-base.c: In function `userspace_ctr': drivers/md/dm-log-userspace-base.c:159: warning: cast from pointer to integer of different size Cc: stable@kernel.org Cc: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-10-16 23:18:15 +01:00
Jonathan Brassow	034a186d29	dm snapshot: free exception store on init failure While initializing the snapshot module, if we fail to register the snapshot target then we must back-out the exception store module initialization. Cc: stable@kernel.org Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-10-16 23:18:14 +01:00
Mikulas Patocka	6d45d93ead	dm snapshot: sort by chunk size to fix race Avoid a race causing corruption when snapshots of the same origin have different chunk sizes by sorting the internal list of snapshots by chunk size, largest first. https://bugzilla.redhat.com/show_bug.cgi?id=182659 For example, let's have two snapshots with different chunk sizes. The first snapshot (1) has small chunk size and the second snapshot (2) has large chunk size. Let's have chunks A, B, C in these snapshots: snapshot1: ====A==== ====B==== snapshot2: ==========C========== (Chunk size is a power of 2. Chunks are aligned.) A write to the origin at a position within A and C comes along. It triggers reallocation of A, then reallocation of C and links them together using A as the 'primary' exception. Then another write to the origin comes along at a position within B and C. It creates pending exception for B. C already has a reallocation in progress and it already has a primary exception (A), so nothing is done to it: B and C are not linked. If the reallocation of B finishes before the reallocation of C, because there is no link with the pending exception for C it does not know to wait for it and, the second write is dispatched to the origin and causes data corruption in the chunk C in snapshot2. To avoid this situation, we maintain snapshots sorted in descending order of chunk size. This leads to a guaranteed ordering on the links between the pending exceptions and avoids the problem explained above - both A and B now get linked to C. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-10-16 23:18:14 +01:00
NeilBrown	5dd33c9a4c	md/async: don't pass a memory pointer as a page pointer. md/raid6 passes a list of 'struct page *' to the async_tx routines, which then either DMA map them for offload, or take the page_address for CPU based calculations. For RAID6 we sometime leave 'blanks' in the list of pages. For CPU based calcs, we want to treat theses as a page of zeros. For offloaded calculations, we simply don't pass a page to the hardware. Currently the 'blanks' are encoded as a pointer to raid6_empty_zero_page. This is a 4096 byte memory region, not a 'struct page'. This is mostly handled correctly but is rather ugly. So change the code to pass and expect a NULL pointer for the blanks. When taking page_address of a page, we need to check for a NULL and in that case use raid6_empty_zero_page. Signed-off-by: NeilBrown <neilb@suse.de>	2009-10-16 16:40:25 +11:00
NeilBrown	5e5e3e78ed	md: Fix handling of raid5 array which is being reshaped to fewer devices. When a raid5 (or raid6) array is being reshaped to have fewer devices, conf->raid_disks is the latter and hence smaller number of devices. However sometimes we want to use a number which is the total number of currently required devices - the larger of the 'old' and 'new' sizes. Before we implemented reducing the number of devices, this was always 'new' i.e. ->raid_disks. Now we need max(raid_disks, previous_raid_disks) in those places. This particularly affects assembling an array that was shutdown while in the middle of a reshape to fewer devices. md.c needs a similar fix when interpreting the md metadata. Signed-off-by: NeilBrown <neilb@suse.de>	2009-10-16 16:35:30 +11:00
NeilBrown	e4424fee18	md: fix problems with RAID6 calculations for DDF. Signed-off-by: NeilBrown <neilb@suse.de>	2009-10-16 16:27:34 +11:00
Dan Williams	417b8d4ac8	md/raid456: downlevel multicore operations to raid_run_ops The percpu conversion allowed a straightforward handoff of stripe processing to the async subsytem that initially showed some modest gains (+4%). However, this model is too simplistic and leads to stripes bouncing between raid5d and the async thread pool for every invocation of handle_stripe(). As reported by Holger this can fall into a pathological situation severely impacting throughput (6x performance loss). By downleveling the parallelism to raid_run_ops the pathological stripe_head bouncing is eliminated. This version still exhibits an average 11% throughput loss for: mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6 echo 1024 > /sys/block/md0/md/stripe_cache_size dd if=/dev/zero of=/dev/md0 bs=1024k count=2048 ...but the results are at least stable and can be used as a base for further multicore experimentation. Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-10-16 16:25:22 +11:00
Vladimir Dronnikov	dce3a7a42d	md: drivers/md/unroll.pl replaced with awk analog drivers/md/unroll.pl replaced by awk script to drop build-time dependency on perl Signed-off-by: Vladimir Dronnikov <dronnikov@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-10-16 16:25:19 +11:00
NeilBrown	ae8fa2831b	md: remove clumsy usage of do_sync_mapping_range from bitmap code and replace with vfs_fsync which is much neater (but wasn't exported, or even in existence at the time the code was written). Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>	2009-10-16 15:56:01 +11:00
NeilBrown	ed9bfdf1a4	md: raid1/raid10: handle allocation errors during array setup. Both raid1 and raid10 create a mempool during startup. If the 'alloc' function for this mempool fails, unplug_slaves is called. If that happens when the pool is being initialised, unplug_slaves will try to use the 'conf' structure that isn't filled in yet, and badness will happen. So ensure that unplug_slaves doesn't get called unless we know that the conf structure if fully initialised. Signed-off-by: NeilBrown <neilb@suse.de>	2009-10-16 15:55:44 +11:00
Dan Williams	f5efd45ae5	md/raid5: initialize conf->device_lock earlier Deallocating a raid5_conf_t structure requires taking 'device_lock'. Ensure it is initialized before it is used, i.e. initialize the lock before attempting any further initializations that might fail. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-10-16 15:55:38 +11:00
NeilBrown	1d9d52416c	md/raid1/raid10: add a cond_resched During 'check' of a raid1 or raid10 it is possible for the management thread to spend a lot of time running 'memcmp' on blocks from different devices, so make sure the thread has a chance to schedule. raid5d already has a cond_resched (in process_stripe). Reported-By: Lee Howard <faxguy@howardsilvan.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-10-16 15:55:32 +11:00
NeilBrown	1442577bf6	Revert "md: do not progress the resync process if the stripe was blocked" This reverts commit `df10cfbc4d`. This patch was based on a misunderstanding and risks introducing a busy-wait loop. So revert it. Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-10-16 15:55:25 +11:00
Nikanth Karthikesan	316d315bff	block: Seperate read and write statistics of in_flight requests v2 Commit `a9327cac44` added seperate read and write statistics of in_flight requests. And exported the number of read and write requests in progress seperately through sysfs. But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange output from "iostat -kx 2". Global values for service time and utilization were garbage. For interval values, utilization was always 100%, and service time is higher than normal. So this was reverted by commit `0f78ab9899` The problem was in part_round_stats_single(), I missed the following: if (now == part->stamp) return; - if (part->in_flight) { + if (part_in_flight(part)) { __part_stat_add(cpu, part, time_in_queue, part_in_flight(part) * (now - part->stamp)); __part_stat_add(cpu, part, io_ticks, (now - part->stamp)); With this chunk included, the reported regression gets fixed. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> -- Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-06 20:16:55 +02:00
Linus Torvalds	58e57fbd1c	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: (41 commits) Revert "Seperate read and write statistics of in_flight requests" cfq-iosched: don't delay async queue if it hasn't dispatched at all block: Topology ioctls cfq-iosched: use assigned slice sync value, not default cfq-iosched: rename 'desktop' sysfs entry to 'low_latency' cfq-iosched: implement slower async initiate and queue ramp up cfq-iosched: delay async IO dispatch, if sync IO was just done cfq-iosched: add a knob for desktop interactiveness Add a tracepoint for block request remapping block: allow large discard requests block: use normal I/O path for discard requests swapfile: avoid NULL pointer dereference in swapon when s_bdev is NULL fs/bio.c: move EXPORT* macros to line after function Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs cciss: fix build when !PROC_FS block: Do not clamp max_hw_sectors for stacking devices block: Set max_sectors correctly for stacking devices cciss: cciss_host_attr_groups should be const cciss: Dynamically allocate the drive_info_struct for each logical drive. cciss: Add usage_count attribute to each logical drive in /sys ...	2009-10-04 12:39:14 -07:00
Jens Axboe	0f78ab9899	Revert "Seperate read and write statistics of in_flight requests" This reverts commit `a9327cac44`. Corrado Zoccolo <czoccolo@gmail.com> reports: "with 2.6.32-rc1 I started getting the following strange output from "iostat -kx 2": Linux 2.6.31bisect (et2) 04/10/2009 _i686_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 10,70 0,00 3,16 15,75 0,00 70,38 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 18,22 0,00 0,67 0,01 14,77 0,02 43,94 0,01 10,53 39043915,03 2629219,87 sdb 60,89 9,68 50,79 3,04 1724,43 50,52 65,95 0,70 13,06 488437,47 2629219,87 avg-cpu: %user %nice %system %iowait %steal %idle 2,72 0,00 0,74 0,00 0,00 96,53 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 avg-cpu: %user %nice %system %iowait %steal %idle 6,68 0,00 0,99 0,00 0,00 92,33 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 avg-cpu: %user %nice %system %iowait %steal %idle 4,40 0,00 0,73 1,47 0,00 93,40 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 sdb 0,00 4,00 0,00 3,00 0,00 28,00 18,67 0,06 19,50 333,33 100,00 Global values for service time and utilization are garbage. For interval values, utilization is always 100%, and service time is higher than normal. I bisected it down to: [`a9327cac44`] Seperate read and write statistics of in_flight requests and verified that reverting just that commit indeed solves the issue on 2.6.32-rc1." So until this is debugged, revert the bad commit. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-04 21:04:38 +02:00
Philipp Reisner	24836479a1	dm/connector: Only process connector packages from privileged processes Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-02 10:54:10 -07:00
Philipp Reisner	18366b05a0	connector/dm: Fixed a compilation warning Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-02 10:54:04 -07:00
Philipp Reisner	7069331dbe	connector: Provide the sender's credentials to the callback Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-02 10:54:01 -07:00
NeilBrown	4b3df5668c	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx into for-linus	2009-09-23 18:31:11 +10:00
Dmitry Monakhov	1ef04fefe2	md: raid-1/10: fix RW bits manipulation Recently Jens has changed bio_rw_flagged() logic by following commit `1f98a13f62`. Now it returns bool instead of int. This broke raid1/raid10 RW bits manipulation logic. One of visible result is BUG_ON triggering due to empty barrier here scsi_lib.c:1108 scsi_setup_fs_cmnd() Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-09-23 18:20:15 +10:00
NeilBrown	f28f4e2728	md: remove unnecessary memset from multipath. Recent commit `bbba809e96` replaced mempool_create_kzalloc_pool with mempool_create_kmalloc_pool plus a memset. This memset is not needed (and we didn't need kzalloc in the first place). Ever field of the allocated structure (struct multipath_bh) is initialised immediately except retry_list, and memset does not initial a list_head anyway. To remove the memset. Signed-off-by: NeilBrown <neilb@suse.de>	2009-09-23 18:16:31 +10:00
NeilBrown	3fa841d7e7	md: report device as congested when suspended This should writeback from coming when the device is temporarily suspended. Signed-off-by: NeilBrown <neilb@suse.de>	2009-09-23 18:10:29 +10:00
NeilBrown	0da3c6194e	md: Improve name of threads created by md_register_thread The management thread for raid4,5,6 arrays are all called mdX_raid5, independent of the actual raid level, which is wrong and can be confusion. So change md_register_thread to use the name from the personality unless no alternate name (like 'resync' or 'reshape') is given. This is simpler and more correct. Cc: Jinzc <zhenchengjin@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-09-23 18:09:45 +10:00
NeilBrown	ee305acef5	md: remove sparse warnings about lock context. There was a real error here on a failure path where we incorrectly call rcu_read_unlock. Signed-off-by: NeilBrown <neilb@suse.de>	2009-09-23 18:06:44 +10:00
NeilBrown	a9f326ebf2	md: remove sparse waring "symbol xxx shadows an earlier one" Rename some variable and remove some duplicate definitions to avoid there warnings. None of them are actual errors. Signed-off-by: NeilBrown <neilb@suse.de>	2009-09-23 18:06:41 +10:00
Linus Torvalds	342ff1a1b5	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) trivial: fix typo in aic7xxx comment trivial: fix comment typo in drivers/ata/pata_hpt37x.c trivial: typo in kernel-parameters.txt trivial: fix typo in tracing documentation trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c trivial: remove unnecessary semicolons trivial: Fix duplicated word "options" in comment trivial: kbuild: remove extraneous blank line after declaration of usage() trivial: improve help text for mm debug config options trivial: doc: hpfall: accept disk device to unload as argument trivial: doc: hpfall: reduce risk that hpfall can do harm trivial: SubmittingPatches: Fix reference to renumbered step trivial: fix typos "man[ae]g?ment" -> "management" trivial: media/video/cx88: add __init/__exit macros to cx88 drivers trivial: fix typo in CONFIG_DEBUG_FS in gcov doc trivial: fix missing printk space in amd_k7_smp_check trivial: fix typo s/ketymap/keymap/ in comment trivial: fix typo "to to" in multiple files trivial: fix typos in comments s/DGBU/DBGU/ ...	2009-09-22 07:51:45 -07:00
Sage Weil	bbba809e96	md: avoid use of broken kzalloc mempool The kzalloc mempool does not re-zero items that have been used and then returned to the pool. Manually zero the allocated multipath_bh instead. Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:35 -07:00
Alexey Dobriyan	83d5cde47d	const: make block_device_operations const Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:25 -07:00
Anand Gadiyar	411c940385	trivial: fix typo "for for" in multiple files trivial: fix typo "for for" in multiple files Signed-off-by: Anand Gadiyar <gadiyar@ti.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-09-21 15:14:54 +02:00
Kay Sievers	e454cea20b	Driver-Core: extend devnode callbacks to provide permissions This allows subsytems to provide devtmpfs with non-default permissions for the device node. Instead of the default mode of 0600, null, zero, random, urandom, full, tty, ptmx now have a mode of 0666, which allows non-privileged processes to access standard device nodes in case no other userspace process applies the expected permissions. This also fixes a wrong assignment in pktcdvd and a checkpatch.pl complain. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-09-19 12:50:38 -07:00
Dan Williams	6c910a78e4	md/raid6: cleanup ops_run_compute6_2 Neil says: "It is correct as it stands, but the fact that every branch in the 'if' part ends with a 'return' isn't immediately obvious, so it is clearer if we are explicit about the if / then / else structure." Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-09-16 12:24:54 -07:00
Dan Williams	2d6e4ecc87	md/raid6: eliminate BUG_ON with side effect As pointed out by Neil it should be possible to build a driver with all BUG_ON statements deleted. It's bad form to have a BUG_ON with a side effect. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-09-16 12:11:54 -07:00
Linus Torvalds	355bbd8cb8	Merge branch 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits) block: use blkdev_issue_discard in blk_ioctl_discard Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads block: don't assume device has a request list backing in nr_requests store block: Optimal I/O limit wrapper cfq: choose a new next_req when a request is dispatched Seperate read and write statistics of in_flight requests aoe: end barrier bios with EOPNOTSUPP block: trace bio queueing trial only when it occurs block: enable rq CPU completion affinity by default cfq: fix the log message after dispatched a request block: use printk_once cciss: memory leak in cciss_init_one() splice: update mtime and atime on files block: make blk_iopoll_prep_sched() follow normal 0/1 return convention cfq-iosched: get rid of must_alloc flag block: use interrupts disabled version of raise_softirq_irqoff() block: fix comment in blk-iopoll.c block: adjust default budget for blk-iopoll block: fix long lines in block/blk-iopoll.c block: add blk-iopoll, a NAPI like approach for block devices ...	2009-09-14 17:55:15 -07:00
Linus Torvalds	39695224bd	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (209 commits) [SCSI] fix oops during scsi scanning [SCSI] libsrp: fix memory leak in srp_ring_free() [SCSI] libiscsi, bnx2i: make bound ep check common [SCSI] libiscsi: add completion function for drivers that do not need pdu processing [SCSI] scsi_dh_rdac: changes for rdac debug logging [SCSI] scsi_dh_rdac: changes to collect the rdac debug information during the initialization [SCSI] scsi_dh_rdac: move the init code from rdac_activate to rdac_bus_attach [SCSI] sg: fix oops in the error path in sg_build_indirect() [SCSI] mptsas : Bump version to 3.04.12 [SCSI] mptsas : FW event thread and scsi mid layer deadlock in SYNCHRONIZE CACHE command [SCSI] mptsas : Send DID_NO_CONNECT for pending IOs of removed device [SCSI] mptsas : PAE Kernel more than 4 GB kernel panic [SCSI] mptsas : NULL pointer on big endian systems causing Expander not to tear off [SCSI] mptsas : Sanity check for phyinfo is added [SCSI] scsi_dh_rdac: Add support for Sun StorageTek ST2500, ST2510 and ST2530 [SCSI] pmcraid: PMC-Sierra MaxRAID driver to support 6Gb/s SAS RAID controller [SCSI] qla2xxx: Update version number to 8.03.01-k6. [SCSI] qla2xxx: Properly delete rports attached to a vport. [SCSI] qla2xxx: Correct various NPIV issues. [SCSI] qla2xxx: Correct qla2x00_eh_wait_on_command() to wait correctly. ...	2009-09-14 17:53:36 -07:00
Martin K. Petersen	3c5820c743	block: Optimal I/O limit wrapper Implement blk_limits_io_opt() and make blk_queue_io_opt() a wrapper around it. DM needs this to avoid poking at the queue_limits directly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-14 08:24:52 +02:00
Nikanth Karthikesan	a9327cac44	Seperate read and write statistics of in_flight requests Currently, there is a single in_flight counter measuring the number of requests in the request_queue. But some monitoring tools would like to know how many read requests and write requests are in progress. Split the current in_flight counter into two seperate counters for read and write. This information is exported as a sysfs attribute, as changing the currently available stat files would break the existing tools. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-14 08:24:52 +02:00
Jens Axboe	1f98a13f62	bio: first step in sanitizing the bio->bi_rw flag testing Get rid of any functions that test for these bits and make callers use bio_rw_flagged() directly. Then it is at least directly apparent what variable and flag they check. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:31 +02:00
Geert Uytterhoeven	0d03d59d9b	md: Fix "strchr" [drivers/md/dm-log-userspace.ko] undefined! Commit `b8313b6da7` ("dm log: remove incorrect field from userspace table output") added a call to strstr() with a single-character "needle" string parameter. Unfortunately some versions of gcc replace such calls to strstr() by calls to strchr() behind our back. This causes linking errors if strchr() is defined as an inline function in <asm/string.h> (e.g. on m68k): \| WARNING: "strchr" [drivers/md/dm-log-userspace.ko] undefined! Avoid this by explicitly calling strchr() instead. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-10 14:55:01 -07:00
Dan Williams	9134d02bc0	Merge commit 'md/for-linus' into async-tx-next Conflicts: drivers/md/raid5.c	2009-09-08 17:55:54 -07:00
Dan Williams	bbb20089a3	Merge branch 'dmaengine' into async-tx-next Conflicts: crypto/async_tx/async_xor.c drivers/dma/ioat/dma_v2.h drivers/dma/ioat/pci.c drivers/md/raid5.c	2009-09-08 17:55:21 -07:00
Dan Williams	0403e38277	dmaengine: add fence support Some engines optimize operation by reading ahead in the descriptor chain such that descriptor2 may start execution before descriptor1 completes. If descriptor2 depends on the result from descriptor1 then a fence is required (on descriptor2) to disable this optimization. The async_tx api could implicitly identify dependencies via the 'depend_tx' parameter, but that would constrain cases where the dependency chain only specifies a completion order rather than a data dependency. So, provide an ASYNC_TX_FENCE to explicitly identify data dependencies. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-09-08 17:42:50 -07:00
Dan Williams	f9dd213437	Merge branch 'md-raid6-accel' into ioat3.2 Conflicts: include/linux/dmaengine.h	2009-09-08 17:42:29 -07:00
Mikulas Patocka	ae0b7448e9	dm snapshot: fix on disk chunk size validation Fix some problems seen in the chunk size processing when activating a pre-existing snapshot. For a new snapshot, the chunk size can either be supplied by the creator or a default value can be used. For an existing snapshot, the chunk size in the snapshot header on disk should always be used. If someone attempts to load an existing snapshot and has the 'default chunk size' option set, the kernel uses its default value even when it is incorrect for the snapshot being loaded. This patch ensures the correct on-disk value is always used. Secondly, when the code does use the chunk size stored on the disk it is prudent to revalidate it, so the code can exit cleanly if it got corrupted as happened in https://bugzilla.redhat.com/show_bug.cgi?id=461506 . Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-09-04 20:40:43 +01:00
Mikulas Patocka	2defcc3fb4	dm exception store: split set_chunk_size Break the function set_chunk_size to two functions in preparation for the fix in the following patch. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-09-04 20:40:41 +01:00
Mikulas Patocka	61578dcd3f	dm snapshot: fix header corruption race on invalidation If a persistent snapshot fills up, a race can corrupt the on-disk header which causes a crash on any future attempt to activate the snapshot (typically while booting). This patch fixes the race. When the snapshot overflows, __invalidate_snapshot is called, which calls snapshot store method drop_snapshot. It goes to persistent_drop_snapshot that calls write_header. write_header constructs the new header in the "area" location. Concurrently, an existing kcopyd job may finish, call copy_callback and commit_exception method, that goes to persistent_commit_exception. persistent_commit_exception doesn't do locking, relying on the fact that callbacks are single-threaded, but it can race with snapshot invalidation and overwrite the header that is just being written while the snapshot is being invalidated. The result of this race is a corrupted header being written that can lead to a crash on further reactivation (if chunk_size is zero in the corrupted header). The fix is to use separate memory areas for each. See the bug: https://bugzilla.redhat.com/show_bug.cgi?id=461506 Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-09-04 20:40:39 +01:00
Mikulas Patocka	02d2fd31de	dm snapshot: refactor zero_disk_area to use chunk_io Refactor chunk_io to prepare for the fix in the following patch. Pass an area pointer to chunk_io and simplify zero_disk_area to use chunk_io. No functional change. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-09-04 20:40:37 +01:00
Jonathan Brassow	7ec23d5094	dm log: userspace add luid to distinguish between concurrent log instances Device-mapper userspace logs (like the clustered log) are identified by a universally unique identifier (UUID). This identifier is used to associate requests from the kernel to a specific log in userspace. The UUID must be unique everywhere, since multiple machines may use this identifier when communicating about a particular log, as is the case for cluster logs. Sometimes, device-mapper/LVM may re-use a UUID. This is the case during pvmoves, when moving from one segment of an LV to another, or when resizing a mirror, etc. In these cases, a new log is created with the same UUID and loaded in the "inactive" slot. When a device-mapper "resume" is issued, the "live" table is deactivated and the new "inactive" table becomes "live". (The "inactive" table can also be removed via a device-mapper 'clear' command.) The above two issues were colliding. More than one log was being created with the same UUID, and there was no way to distinguish between them. So, sometimes the wrong log would be swapped out during the exchange. The solution is to create a locally unique identifier, 'luid', to go along with the UUID. This new identifier is used to determine exactly which log is being referenced by the kernel when the log exchange is made. The identifier is not universally safe, but it does not need to be, since create/destroy/suspend/resume operations are bound to a specific machine; and these are the operations that make up the exchange. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-09-04 20:40:34 +01:00
Jonathan Brassow	d2b698644c	dm raid1: do not allow log_failure variable to unset after being set This patch fixes a bug which was triggering a case where the primary leg could not be changed on failure even when the mirror was in-sync. The case involves the failure of the primary device along with the transient failure of the log device. The problem is that bios can be put on the 'failures' list (due to log failure) before 'fail_mirror' is called due to the primary device failure. Normally, this is fine, but if the log device failure is transient, a subsequent iteration of the work thread, 'do_mirror', will reset 'log_failure'. The 'do_failures' function then resets the 'in_sync' variable when processing bios on the failures list. The 'in_sync' variable is what is used to determine if the primary device can be switched in the event of a failure. Since this has been reset, the primary device is incorrectly assumed to be not switchable. The case has been seen in the cluster mirror context, where one machine realizes the log device is dead before the other machines. As the responsibilities of the server migrate from one node to another (because the mirror is being reconfigured due to the failure), the new server may think for a moment that the log device is fine - thus resetting the 'log_failure' variable. In any case, it is inappropiate for us to reset the 'log_failure' variable. The above bug simply illustrates that it can actually hurt us. Cc: stable@kernel.org Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-09-04 20:40:32 +01:00
Jonathan Brassow	b8313b6da7	dm log: remove incorrect field from userspace table output The output of 'dmsetup table' includes an internal field that should not be there. This patch removes it. To make the fix simpler, we first reorder a constructor argument The 'device size' argument is generated internally. Currently it is placed as the last space-separated word of the constructor string. However, we need to use a version of the string without this word, so we move it to the beginning instead so it is trivial to skip past it. We keep a copy of the arguments passed to userspace for creating a log, just in case we need to resend them. These are the same arguments that are desired in the STATUSTYPE_TABLE request, except for one. When creating the userspace log, the userspace daemon must know the size of the mirror, so that is added to the arguments given in the constructor table. We were printing this extra argument out as well, which is a mistake. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-09-04 20:40:30 +01:00
Jonathan Brassow	4142a96917	dm log: fix userspace status output Fix 'dmsetup table' output. There is a missing ' ' at the end of the string causing two words to run together. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-09-04 20:40:28 +01:00
Mike Snitzer	40bea43127	dm stripe: expose correct io hints Set sensible I/O hints for striped DM devices in the topology infrastructure added for 2.6.31 for userspace tools to obtain via sysfs. Add .io_hints to 'struct target_type' to allow the I/O hints portion (io_min and io_opt) of the 'struct queue_limits' to be set by each target and implement this for dm-stripe. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-09-04 20:40:25 +01:00
Mike Snitzer	a963a95622	dm table: add more context to terse warning messages A couple of recent warning messages make it difficult for the reader to determine exactly what is wrong. This patch adds more information to those messages. The messages were added by these commits: `5dea271b6d` ("dm table: pass correct dev area size to device_area_is_valid") `ea9df47cc9` ("dm table: fix blk_stack_limits arg to use bytes not sectors") The patch also corrects references to logical_block_size in printk format strings from %hu to %u. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-09-04 20:40:24 +01:00
Mikulas Patocka	f6a1ed1086	dm table: fix queue_limit checking device iterator The logic to check for valid device areas is inverted relative to proper use with iterate_devices. The iterate_devices method calls its callback for every underlying device in the target. If any callback returns non-zero, iterate_devices exits immediately. But the callback device_area_is_valid() returns 0 on error and 1 on success. The overall effect without is that an error is issued only if every device is invalid. This patch renames device_area_is_valid to device_area_is_invalid and inverts the logic so that one invalid device is sufficient to raise an error. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-09-04 20:40:22 +01:00
Mike Snitzer	8811f46c1f	dm snapshot: implement iterate devices Implement the .iterate_devices for the origin and snapshot targets. dm-snapshot's lack of .iterate_devices resulted in the inability to properly establish queue_limits for both targets. With 4K sector drives: an unfortunate side-effect of not establishing proper limits in either targets' DM device was that IO to the devices would fail even though both had been created without error. Commit `af4874e03e` ("dm target:s introduce iterate devices fn") in 2.6.31-rc1 should have implemented .iterate_devices for dm-snap.c's origin and snapshot targets. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-09-04 20:40:19 +01:00
Kiyoshi Ueda	a77e28c7e1	dm multipath: fix oops when request based io fails when no paths The patch posted at http://marc.info/?l=dm-devel&m=124539787228784&w=2 which was merged into `cec47e3d4a` ("dm: prepare for request based option") introduced a regression in request-based dm. If map_request() calls dm_kill_unmapped_request() to complete a cloned bio without dispatching it, clone->bio is still set when dm_end_request() is called and the BUG_ON(clone->bio) is incorrect. The patch fixes this bug by freeing bio in dm_end_request() if the clone has bio. I've redone my tests to cover all I/O paths and confirmed there's no other regression. Here is the oops I hit in request-based dm when I do I/O to a multipath device which doesn't have any active path nor queue_if_no_path setting: ------------[ cut here ]------------ kernel BUG at /root/2.6.31-rc4.rqdm/drivers/md/dm.c:828! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map CPU 1 Modules linked in: autofs4 sunrpc cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log dm_service_time dm_multipath scsi_dh dm_mod video output sbs sbshc battery ac sg sr_mod e1000e button cdrom serio_raw rtc_cmos rtc_core rtc_lib piix lpfc scsi_transport_fc ata_piix libata megaraid_sas sd_mod scsi_mod crc_t10dif ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 7, comm: ksoftirqd/1 Not tainted 2.6.31-rc4.rqdm #1 Express5800/120Lj [N8100-1417] RIP: 0010:[<ffffffffa023629d>] [<ffffffffa023629d>] dm_softirq_done+0xbd/0x100 [dm_mod] RSP: 0018:ffff8800280a1f08 EFLAGS: 00010282 RAX: ffffffffa02544e0 RBX: ffff8802aa1111d0 RCX: ffff8802aa1111e0 RDX: ffff8802ab913e70 RSI: 0000000000000000 RDI: ffff8802ab913e70 RBP: ffff8800280a1f28 R08: ffffc90005457040 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: 00000000fffffffb R13: ffff8802ab913e88 R14: ffff8802ab9c1438 R15: 0000000000000100 FS: 0000000000000000(0000) GS:ffff88002809e000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000003d54a98640 CR3: 000000029f0a1000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process ksoftirqd/1 (pid: 7, threadinfo ffff8802ae50e000, task ffff8802ae4f8040) Stack: ffff8800280a1f38 0000000000000020 ffffffff814f30a0 0000000000000004 <0> ffff8800280a1f58 ffffffff8116b245 ffff8800280a1f38 ffff8800280a1f38 <0> ffff8800280a1f58 0000000000000001 ffff8800280a1fa8 ffffffff810477bc Call Trace: <IRQ> [<ffffffff8116b245>] blk_done_softirq+0x75/0x90 [<ffffffff810477bc>] __do_softirq+0xcc/0x210 [<ffffffff81047170>] ? ksoftirqd+0x0/0x110 [<ffffffff8100ce7c>] call_softirq+0x1c/0x50 <EOI> [<ffffffff8100e785>] do_softirq+0x65/0xa0 [<ffffffff81047170>] ? ksoftirqd+0x0/0x110 [<ffffffff810471e0>] ksoftirqd+0x70/0x110 [<ffffffff81059559>] kthread+0x99/0xb0 [<ffffffff8100cd7a>] child_rip+0xa/0x20 [<ffffffff8100c73c>] ? restore_args+0x0/0x30 [<ffffffff810594c0>] ? kthread+0x0/0xb0 [<ffffffff8100cd70>] ? child_rip+0x0/0x20 Code: 44 89 e6 48 89 df e8 23 fb f2 e0 be 01 00 00 00 4c 89 f7 e8 f6 fd ff ff 5b 41 5c 41 5d 41 5e c9 c3 4c 89 ef e8 85 fe ff ff eb ed <0f> 0b eb fe 41 8b 85 dc 00 00 00 48 83 bb 10 01 00 00 00 89 83 RIP [<ffffffffa023629d>] dm_softirq_done+0xbd/0x100 [dm_mod] RSP <ffff8800280a1f08> ---[ end trace 16af0a1d8542da55 ]--- Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-09-04 20:40:16 +01:00
Dan Williams	07a3b417dc	md/raid456: distribute raid processing over multiple cores Now that the resources to handle stripe_head operations are allocated percpu it is possible for raid5d to distribute stripe handling over multiple cores. This conversion also adds a call to cond_resched() in the non-multicore case to prevent one core from getting monopolized for raid operations. Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:13:13 -07:00
Yuri Tikhonov	b774ef491b	md/raid6: remove synchronous infrastructure These routines have been replaced by there asynchronous counterparts. Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:13:13 -07:00
Yuri Tikhonov	6c0069c0ae	md/raid6: asynchronous handle_stripe6 1/ Use STRIPE_OP_BIOFILL to offload completion of read requests to raid_run_ops 2/ Implement a handler for sh->reconstruct_state similar to the raid5 case (adds handling of Q parity) 3/ Prevent handle_parity_checks6 from running concurrently with 'compute' operations 4/ Hook up raid_run_ops Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:13:13 -07:00
Dan Williams	d82dfee0ad	md/raid6: asynchronous handle_parity_check6 [ Based on an original patch by Yuri Tikhonov ] Implement the state machine for handling the RAID-6 parities check and repair functionality. Note that the raid6 case does not need to check for new failures, like raid5, as it will always writeback the correct disks. The raid5 case can be updated to check zero_sum_result to avoid getting confused by new failures rather than retrying the entire check operation. Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:13:13 -07:00
Yuri Tikhonov	a9b39a741a	md/raid6: asynchronous handle_stripe_dirtying6 In the synchronous implementation of stripe dirtying we processed a degraded stripe with one call to handle_stripe_dirtying6(). I.e. compute the missing blocks from the other drives, then copy in the new data and reconstruct the parities. In the asynchronous case we do not perform stripe operations directly. Instead, operations are scheduled with flags to be later serviced by raid_run_ops. So, for the degraded case the final reconstruction step can only be carried out after all blocks have been brought up to date by being read, or computed. Like the raid5 case schedule_reconstruction() sets STRIPE_OP_RECONSTRUCT to request a parity generation pass and through operation chaining can handle compute and reconstruct in a single raid_run_ops pass. [dan.j.williams@intel.com: fixup handle_stripe_dirtying6 gating] Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:13:12 -07:00
Yuri Tikhonov	5599becca4	md/raid6: asynchronous handle_stripe_fill6 Modify handle_stripe_fill6 to work asynchronously by introducing fetch_block6 as the raid6 analog of fetch_block5 (schedule compute operations for missing/out-of-sync disks). [dan.j.williams@intel.com: compute D+Q in one pass] Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:13:12 -07:00
Yuri Tikhonov	c0f7bddbe6	md/raid5,6: common schedule_reconstruction for raid5/6 Extend schedule_reconstruction5 for reuse by the raid6 path. Add support for generating Q and BUG() if a request is made to perform 'prexor'. Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:13:12 -07:00
Dan Williams	ac6b53b6e6	md/raid6: asynchronous raid6 operations [ Based on an original patch by Yuri Tikhonov ] The raid_run_ops routine uses the asynchronous offload api and the stripe_operations member of a stripe_head to carry out xor+pq+copy operations asynchronously, outside the lock. The operations performed by RAID-6 are the same as in the RAID-5 case except for no support of STRIPE_OP_PREXOR operations. All the others are supported: STRIPE_OP_BIOFILL - copy data into request buffers to satisfy a read request STRIPE_OP_COMPUTE_BLK - generate missing blocks (1 or 2) in the cache from the other blocks STRIPE_OP_BIODRAIN - copy data out of request buffers to satisfy a write request STRIPE_OP_RECONSTRUCT - recalculate parity for new data that has entered the cache STRIPE_OP_CHECK - verify that the parity is correct The flow is the same as in the RAID-5 case, and reuses some routines, namely: 1/ ops_complete_postxor (renamed to ops_complete_reconstruct) 2/ ops_complete_compute (updated to set up to 2 targets uptodate) 3/ ops_run_check (renamed to ops_run_check_p for xor parity checks) [neilb@suse.de: fixes to get it to pass mdadm regression suite] Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:13:12 -07:00
Dan Williams	4e7d2c0aef	md/raid5: factor out mark_uptodate from ops_complete_compute5 ops_complete_compute5 can be reused in the raid6 path if it is updated to generically handle a second target. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:13:11 -07:00
Dan Williams	cb3c82992f	async_tx: raid6 recovery self test Port drivers/md/raid6test/test.c to use the async raid6 recovery routines. This is meant as a unit test for raid6 acceleration drivers. In addition to the 16-drive test case this implements tests for the 4-disk and 5-disk special cases (dma devices can not generically handle less than 2 sources), and adds a test for the D+Q case. Reviewed-by: Andre Noll <maan@systemlinux.org> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:09:28 -07:00
Dan Williams	ad283ea4a3	async_tx: add sum check flags Replace the flat zero_sum_result with a collection of flags to contain the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed solomon syndrome) zero-sum result. Use the SUM_CHECK_ namespace instead of DMA_ since these flags will be used on non-dma-zero-sum enabled platforms. Reviewed-by: Andre Noll <maan@systemlinux.org> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:09:26 -07:00
Dan Williams	d6f38f31f3	md/raid5,6: add percpu scribble region for buffer lists Use percpu memory rather than stack for storing the buffer lists used in parity calculations. Include space for dma address conversions and pass that to async_tx via the async_submit_ctl.scribble pointer. [ Impact: move memory pressure from stack to heap ] Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:09:26 -07:00
Dan Williams	36d1c6476b	md/raid6: move the spare page to a percpu allocation In preparation for asynchronous handling of raid6 operations move the spare page to a percpu allocation to allow multiple simultaneous synchronous raid6 recovery operations. Make this allocation cpu hotplug aware to maximize allocation efficiency. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:09:26 -07:00
Chandra Seetharaman	2bfd2e1337	[SCSI] scsi_dh: Use scsi_dh_set_params() in multipath. Use scsi_dh_set_params() set parameters provided. Save the parameters in parse_hw_handler() and use it in parse_path(). Reported-by: Eddie Williams <Eddie.Williams@steeleye.com> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Tested-by: Eddie Williams <Eddie.Williams@steeleye.com> Cc: Alasdair G Kergon <agk@redhat.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2009-08-22 17:52:15 -05:00
Linus Torvalds	435a71d9ef	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: Fix new incorrect error return from do_md_stop.	2009-08-18 13:54:08 -07:00
NeilBrown	80ffb3ccea	Fix new incorrect error return from do_md_stop. Recent commit `c8c00a6915` changed the exit paths in do_md_stop and was not quite careful enough. There is one path were 'err' now needs to be cleared but it isn't. So setting an array to readonly (with mdadm --readonly) will work, but will incorrectly report and error: ENXIO. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-18 10:35:26 +10:00
Randy Dunlap	894ef820b1	dm-log-userspace: fix printk format warning drivers/md/dm-log-userspace-transfer.c:110: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'size_t' Previously posted and acked, but apparently lost. http://lkml.indiana.edu/hypermail/linux/kernel/0906.2/02074.html Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: dm-devel@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-08-16 08:35:58 -07:00
NeilBrown	4d484a4a7a	md: allow upper limit for resync/reshape to be set when array is read-only Normally we only allow the upper limit for a reshape to be decreased when the array not performing a sync/recovery/reshape, otherwise there could be races. But if an array is part-way through a reshape when it is assembled the reshape is started immediately leaving no window to set an upper bound. If the array is started read-only, the reshape will be suspended until the array becomes writable, so that provides a window during which it is perfectly safe to reduce the upper limit of a reshape. So: allow the upper limit (sync_max) to be reduced even if the reshape thread is running, as long as the array is still read-only. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-13 10:41:50 +10:00
NeilBrown	1a67dde0ab	md/raid5: Properly remove excess drives after shrinking a raid5/6 We were removing the drives, from the array, but not removing symlinks from /sys/.... and not marking the device as having been removed. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-13 10:41:49 +10:00
NeilBrown	a639755cf8	md/raid5: make sure a reshape restarts at the correct address. This "if" don't allow for the possibility that the number of devices doesn't change, and so sector_nr isn't set correctly in that case. So change '>' to '>='. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-13 10:13:00 +10:00
NeilBrown	67ac6011db	md/raid5: allow new reshape modes to be restarted in the middle. md/raid5 doesn't allow a reshape to restart if it involves writing over the same part of disk that it would be reading from. This happens at the beginning of a reshape that increases the number of devices, at the end of a reshape that decreases the number of devices, and continuously for a reshape that does not change the number of devices. The current code is correct for the "increase number of devices" case as the critical section at the start is handled by userspace performing a backup. It does not work for reducing the number of devices, or the no-change case. For 'reducing', we need to invert the test. For no-change we cannot really be sure things will be safe, so simply require the array to be read-only, which is how the user-space code which carefully starts such arrays works. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-13 10:06:24 +10:00
NeilBrown	51d5668cb2	md: never advance 'events' counter by more than 1. When assembling arrays, md allows two devices to have different event counts as long as the difference is only '1'. This is to cope with a system failure between updating the metadata on two difference devices. However there are currently times when we update the event count by 2. This was done to keep the event count even when the array is clean and odd when it is dirty, which allows us to avoid writing common update to spare devices and so allow those spares to go to sleep. This is bad for the above reason. So change it to never increase by two. This means that the alignment between 'odd/even' and 'clean/dirty' might take a little longer to attain, but that is only a small cost. The spares will get a few more updates but that will still be spared (;-) most updates and can still go to sleep. Prior to this patch there was a small chance that after a crash an array would fail to assemble due to the overly large event count mismatch. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-13 09:54:02 +10:00
NeilBrown	c8c00a6915	Remove deadlock potential in md_open A recent commit: commit `449aad3e25` introduced the possibility of an A-B/B-A deadlock between bd_mutex and reconfig_mutex. __blkdev_get holds bd_mutex while calling md_open which takes reconfig_mutex, do_md_run is always called with reconfig_mutex held, and it now takes bd_mutex in the call the revalidate_disk. This potential deadlock was not caught by lockdep due to the use of mutex_lock_interruptible_nexted which was introduced by commit `d63a5a74de` do avoid a warning of an impossible deadlock. It is quite possible to split reconfig_mutex in to two locks. One protects the array data structures while it is being reconfigured, the other ensures that an array is never even partially open while it is being deactivated. In particular, the second lock prevents an open from completing between the time when do_md_stop checks if there are any active opens, and the time when the array is either set read-only, or when ->pers is set to NULL. So we can be certain that no IO is in flight as the array is being destroyed. So create a new lock, open_mutex, just to ensure exclusion between 'open' and 'stop'. This avoids the deadlock and also avoids the lockdep warning mentioned in commit `d63a5a74d` Reported-by: "Mike Snitzer" <snitzer@gmail.com> Reported-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-10 12:50:52 +10:00
NeilBrown	449aad3e25	md: Use revalidate_disk to effect changes in size of device. As revalidate_disk calls check_disk_size_change, it will cause any capacity change of a gendisk to be propagated to the blockdev inode. So use that instead of mucking about with locks and i_size_write. Also add a call to revalidate_disk in do_md_run and a few other places where the gendisk capacity is changed. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:58 +10:00
NeilBrown	64bd660b51	md: allow raid5_quiesce to work properly when reshape is happening. The ->quiesce method is not supposed to stop resync/recovery/reshape, just normal IO. But in raid5 we don't have a way to know which stripes are being used for normal IO and which for resync etc, so we need to wait for all stripes to be idle to be sure that all writes have completed. However reshape keeps at least some stripe busy for an extended period of time, so a call to raid5_quiesce can block for several seconds needlessly. So arrange for reshape etc to pause briefly while raid5_quiesce is trying to quiesce the array so that the active_stripes count can drop to zero. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:58 +10:00
NeilBrown	e516402c0d	md/raid5: set reshape_position correctly when reshape starts. As the internal reshape_progress counter is the main driver for reshape, the fact that reshape_position sometimes starts with the wrong value has minimal effect. It is visible in sysfs and that is all. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:57 +10:00
NeilBrown	70471dafe3	md: Handle growth of v1.x metadata correctly. The v1.x metadata does not have a fixed size and can grow when devices are added. If it grows enough to require an extra sector of storage, we need to update the 'sb_size' to match. Without this, md can write out an incomplete superblock with a bad checksum, which will be rejected when trying to re-assemble the array. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:57 +10:00
NeilBrown	3673f305fa	md: avoid array overflow with bad v1.x metadata We trust the 'desc_nr' field in v1.x metadata enough to use it as an index in an array. This isn't really safe. So range-check the value first. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:56 +10:00
NeilBrown	3a981b03f3	md: when a level change reduces the number of devices, remove the excess. When an array is changed from RAID6 to RAID5, fewer drives are needed. So any device that is made superfluous by the level conversion must be marked as not-active. For the RAID6->RAID5 conversion, this will be a drive which only has 'Q' blocks on it. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:55 +10:00
Andre Noll	ac5e7113e7	md: Push down data integrity code to personalities. This patch replaces md_integrity_check() by two new public functions: md_integrity_register() and md_integrity_add_rdev() which are both personality-independent. md_integrity_register() is called from the ->run and ->hot_remove methods of all personalities that support data integrity. The function iterates over the component devices of the array and determines if all active devices are integrity capable and if their profiles match. If this is the case, the common profile is registered for the mddev via blk_integrity_register(). The second new function, md_integrity_add_rdev() is called from the ->hot_add_disk methods, i.e. whenever a new device is being added to a raid array. If the new device does not support data integrity, or has a profile different from the one already registered, data integrity for the mddev is disabled. For raid0 and linear, only the call to md_integrity_register() from the ->run method is necessary. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:47 +10:00
Dan Williams	95fc17aac4	md/raid6: release spare page at ->stop() Add missing call to safe_put_page from stop() by unifying open coded raid5_conf_t de-allocation under free_conf(). Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-07-31 12:39:15 +10:00
Mike Snitzer	5dea271b6d	dm table: pass correct dev area size to device_area_is_valid Incorrect device area lengths are being passed to device_area_is_valid(). The regression appeared in 2.6.31-rc1 through commit `754c5fc7eb`. With the dm-stripe target, the size of the target (ti->len) was used instead of the stripe_width (ti->len/#stripes). An example of a consequent incorrect error message is: device-mapper: table: 254:0: sdb too small for target Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-07-23 20:30:42 +01:00
Mike Snitzer	a732c207d1	dm: remove queue next_ordered workaround for barriers This patch removes DM's bio-based vs request-based conditional setting of next_ordered. For bio-based DM the next_ordered check is no longer a concern (as that check is now in the __make_request path). For request-based DM the default of QUEUE_ORDERED_NONE is now appropriate. bio-based DM was changed to work-around the previously misplaced next_ordered check with this commit: `99360b4c18` request-based DM does not yet support barriers but reacted to the above bio-based DM change with this commit: `5d67aa2366` The above changes are no longer needed given Neil Brown's recent fix to put the next_ordered check in the __make_request path: `db64f680ba` Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: NeilBrown <neilb@suse.de> Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-07-23 20:30:40 +01:00
Mikulas Patocka	69885683d2	dm raid1: wake kmirrord when requeueing delayed bios after remote recovery The recent commit `7513c2a761` (dm raid1: add is_remote_recovering hook for clusters) changed do_writes() to update the ms->writes list but forgot to wake up kmirrord to process it. The rule is that when anything is being added on ms->reads, ms->writes or ms->failures and the list was empty before we must call wakeup_mirrord (for immediate processing) or delayed_wake (for delayed processing). Otherwise the bios could sit on the list indefinitely. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> CC: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-07-23 20:30:37 +01:00
Dan Williams	a11034b428	md/raid6: release spare page at ->stop() Add missing call to safe_put_page from stop() by unifying open coded raid5_conf_t de-allocation under free_conf(). Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-07-14 11:48:16 -07:00
Jens Axboe	8aa7e847d8	Fix congestion_wait() sync/async vs read/write confusion Commit `1faa16d228` accidentally broke the bdi congestion wait queue logic, causing us to wait on congestion for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-07-10 20:31:53 +02:00
Joe Perches	ad361c9884	Remove multiple KERN_ prefixes from printk formats Commit `5fd29d6ccb` ("printk: clean up handling of log-levels and newlines") changed printk semantics. printk lines with multiple KERN_<level> prefixes are no longer emitted as before the patch. <level> is now included in the output on each additional use. Remove all uses of multiple KERN_<level>s in formats. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-07-08 10:30:03 -07:00
Linus Torvalds	2027bd9f92	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: cfq-iosched: remove redundant check for NULL cfqq in cfq_set_request() blocK: Restore barrier support for md and probably other virtual devices. block: get rid of queue-private command filter block: Create bip slabs with embedded integrity vectors cfq-iosched: get rid of the need for __GFP_NOFAIL in cfq_find_alloc_queue() cfq-iosched: move cfqq initialization out of cfq_find_alloc_queue() Trivial typo fixes in Documentation/block/data-integrity.txt.	2009-07-01 10:41:09 -07:00

... 3 4 5 6 7 ...

1794 Commits