linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-23 12:42:02 +00:00

Author	SHA1	Message	Date
Shaohua Li	bd18f6462f	md: skip resync for raid array with journal If a raid array has journal, the journal can guarantee the consistency, we can skip resync after a unclean shutdown. The exception is raid creation or user initiated resync, which we still do a raid resync. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	3069aa8def	md: override md superblock recovery_offset for journal device Journal device stores data in a log structure. We need record the log start. Here we override md superblock recovery_offset for this purpose. This field of a journal device is meaningless otherwise. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Song Liu	bac624f3f8	MD: add a new disk role to present write journal device Next patches will use a disk as raid5/6 journaling. We need a new disk role to present the journal device and add MD_FEATURE_JOURNAL to feature_map for backward compability. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Goldwyn Rodrigues	70bcecdb15	md-cluster: Improve md_reload_sb to be less error prone md_reload_sb is too simplistic and it explicitly needs to determine the changes made by the writing node. However, there are multiple areas where a simple reload could fail. Instead, read the superblock of one of the "good" rdevs and update the necessary information: - read the superblock into a newly allocated page, by temporarily swapping out rdev->sb_page and calling ->load_super. - if that fails return - if it succeeds, call check_sb_changes 1. iterates over list of active devices and checks the matching dev_roles[] value. If that is 'faulty', the device must be marked as faulty - call md_error to mark the device as faulty. Make sure not to set CHANGE_DEVS and wakeup mddev->thread or else it would initiate a resync process, which is the responsibility of the "primary" node. - clear the Blocked bit - Call remove_and_add_spares() to hot remove the device. If the device is 'spare': - call remove_and_add_spares() to get the number of spares added in this operation. - Reduce mddev->degraded to mark the array as not degraded. 2. reset recovery_cp - read the rest of the rdevs to update recovery_offset. If recovery_offset is equal to MaxSector, call spare_active() to set it In_sync This required that recovery_offset be initialized to MaxSector, as opposed to zero so as to communicate the end of sync for a rdev. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 01:34:48 -05:00
Kent Overstreet	8ae126660f	block: kill merge_bvec_fn() completely As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own ->merge_bvec_fn() callback. Remove every invocation completely. Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@kernel.org> Cc: ceph-devel@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: also remove ->merge_bvec_fn() in dm-thin as well as dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:31:57 -06:00
Tejun Heo	66114cad64	writeback: separate out include/linux/backing-dev-defs.h With the planned cgroup writeback support, backing-dev related declarations will be more widely used across block and cgroup; unfortunately, including backing-dev.h from include/linux/blkdev.h makes cyclic include dependency quite likely. This patch separates out backing-dev-defs.h which only has the essential definitions and updates blkdev.h to include it. c files which need access to more backing-dev details now include backing-dev.h directly. This takes backing-dev.h off the common include dependency chain making it a lot easier to use it across block and cgroup. v2: fs/fat build failure fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:34 -06:00
NeilBrown	09314799e4	md: remove 'go_faster' option from ->sync_request() This option is not well justified and testing suggests that it hardly ever makes any difference. The comment suggests there might be a need to wait for non-resync activity indicated by ->nr_waiting, however raise_barrier() already waits for all of that. So just remove it to simplify reasoning about speed limiting. This allows us to remove a 'FIXME' comment from raid5.c as that never used the flag. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:40 +10:00
Goldwyn Rodrigues	57d051dcca	md: Export and rename find_rdev_nr_rcu This is required by the clustering module (patches to follow) to find the device to remove or re-add. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues	fb56dfef4e	md: Export and rename kick_rdev_from_array This export is required for clustering module in order to co-ordinate remove/readd a rdev from all nodes. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues	1aee41f637	Add new disk to clustered array Algorithm: 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) 2. Node 1 sends NEWDISK with uuid and slot number 3. Other nodes issue kobject_uevent_env with uuid and slot number (Steps 4,5 could be a udev rule) 4. In userspace, the node searches for the disk, perhaps using blkid -t SUB_UUID="" 5. Other nodes issue either of the following depending on whether the disk was found: ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and disc.number set to slot number) ioctl(CLUSTERED_DISK_NACK) 6. Other nodes drop lock on no-new-devs (CR) if device is found 7. Node 1 attempts EX lock on no-new-devs 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk as SpareLocal 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED 10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 09:59:07 -06:00
Goldwyn Rodrigues	1d7e3e9611	Reload superblock if METADATA_UPDATED is received Re-reads the devices by invalidating the cache. Since we don't write to faulty devices, this is detected using events recorded in the devices. If it is old as compared to the mddev mark it is faulty. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 09:59:06 -06:00
Goldwyn Rodrigues	cf921cc19c	Add node recovery callbacks DLM offers callbacks when a node fails and the lock remastery is performed: 1. recover_prep: called when DLM discovers a node is down 2. recover_slot: called when DLM identifies the node and recovery can start 3. recover_done: called when all nodes have completed recover_slot recover_slot() and recover_done() are also called when the node joins initially in order to inform the node with its slot number. These slot numbers start from one, so we deduct one to make it start with zero which the cluster-md code uses. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 07:30:11 -06:00
Goldwyn Rodrigues	c4ce867fda	Introduce md_cluster_info md_cluster_info stores the cluster information in the MD device. The join() is called when mddev detects it is a clustered device. The main responsibilities are: 1. Setup a DLM lockspace 2. Setup all initial locks such as super block locks and bitmap lock (will come later) The leave() clears up the lockspace and all the locks held. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 07:28:42 -06:00
Goldwyn Rodrigues	edb39c9ded	Introduce md_cluster_operations to handle cluster functions This allows dynamic registering of cluster hooks. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 07:28:42 -06:00
Hannes Reinecke	dfe15ac1c6	md: wakeup thread upon rdev_dec_pending() After each call to rdev_dec_pending() we should wakeup the md thread if the device is found to be faulty. Otherwise we'll incur heavy delays on failing devices. Signed-off-by: Neil Brown <nfbrown@suse.de> Signed-off-by: Hannes Reinecke <hare@suse.de>	2015-02-06 09:32:57 +11:00
NeilBrown	5c47daf6e7	md: move mddev_lock and related to md.h The one which is not inline (mddev_unlock) gets EXPORTed. This makes the locking available to personality modules so that it doesn't have to be imposed upon them. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-06 09:32:56 +11:00
NeilBrown	23da422b19	md: use mddev->lock to protect updates to resync_{min,max}. There are interdependencies between these two sysfs attributes and whether a resync is currently running. Rather than depending on reconfig_mutex to ensure no races when testing these interdependencies are met, use the spinlock. This will allow the mutex to be remove from protecting this code in a subsequent patch. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-06 09:32:56 +11:00
NeilBrown	4af1a04176	md: move GET_BITMAP_FILE ioctl out from mddev_lock. It makes more sense to report bitmap_info->file, rather than bitmap->file (the later is only available once the array is active). With that change, use mddev->lock to protect bitmap_info being set to NULL, and we can call get_bitmap_file() without taking the mutex. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-06 09:32:56 +11:00
NeilBrown	978a7a47ca	md/bitmap: protect clearing of ->bitmap by mddev->lock This makes it safe to inspect the struct while holding only the spinlock. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-06 09:32:55 +11:00
NeilBrown	36d091f475	md: protect ->pers changes with mddev->lock ->pers is already protected by ->reconfig_mutex, and cannot possibly change when there are threads running or outstanding IO. However there are some places where we access ->pers not in a thread or IO context, and where ->reconfig_mutex is unnecessarily heavy-weight: level_show and md_seq_show(). So protect all changes, and those accesses, with ->lock. This is a step toward taking those accesses out from under reconfig_mutex. [Fixed missing "mddev->pers" -> "pers" conversion, thanks to Dan Carpenter <dan.carpenter@oracle.com>] Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:53 +11:00
NeilBrown	afa0f557cb	md: rename ->stop to ->free Now that the ->stop function only frees the private data, rename is accordingly. Also pass in the private pointer as an arg rather than using mddev->private. This flexibility will be useful in level_store(). Finally, don't clear ->private. It doesn't make sense to clear it seeing that isn't what we free, and it is no longer necessary to clear ->private (it was some time ago before ->to_remove was introduced). Setting ->to_remove in ->free() is a bit of a wart, but not a big problem at the moment. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:52 +11:00
NeilBrown	64590f45dd	md: make merge_bvec_fn more robust in face of personality changes. There is no locking around calls to merge_bvec_fn(), so it is possible that calls which coincide with a level (or personality) change could go wrong. So create a central dispatch point for these functions and use rcu_read_lock(). If the array is suspended, reject any merge that can be rejected. If not, we know it is safe to call the function. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:52 +11:00
NeilBrown	5c675f83c6	md: make ->congested robust against personality changes. There is currently no locking around calls to the 'congested' bdi function. If called at an awkward time while an array is being converted from one level (or personality) to another, there is a tiny chance of running code in an unreferenced module etc. So add a 'congested' function to the md_personality operations structure, and call it with appropriate locking from a central 'mddev_congested'. When the array personality is changing the array will be 'suspended' so no IO is processed. If mddev_congested detects this, it simply reports that the array is congested, which is a safe guess. As mddev_suspend calls synchronize_rcu(), mddev_congested can avoid races by included the whole call inside an rcu_read_lock() region. This require that the congested functions for all subordinate devices can be run under rcu_lock. Fortunately this is the case. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:52 +11:00
NeilBrown	85572d7c75	md: rename mddev->write_lock to mddev->lock This lock is used for (slightly) more than helping with writing superblocks, and it will soon be extended further. So the name is inappropriate. Also, the _irq variant hasn't been needed since 2.6.37 as it is never taking from interrupt or bh context. So: -rename write_lock to lock -document what it protects -remove _irq ... except in md_flush_request() as there is no wait_event_lock() (with no _irq). This can be cleaned up after appropriate changes to wait.h. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:52 +11:00
NeilBrown	f72ffdd686	md: remove unwanted white space from md.c My editor shows much of this is RED. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:29 +11:00
NeilBrown	035328c202	md/bitmap: don't abuse i_writecount for bitmap files. md bitmap code currently tries to use i_writecount to stop any other process from writing to out bitmap file. But that is really an abuse and has bit-rotted so locking is all wrong. So discard that - root should be allowed to shoot self in foot. Still use it in a much less intrusive way to stop the same file being used as bitmap on two different array, and apply other checks to ensure the file is at least vaguely usable for bitmap storage (is regular, is open for write. Support for ->bmap is already checked elsewhere). Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: NeilBrown <neilb@suse.de>	2014-04-09 12:26:59 +10:00
Linus Torvalds	d3bad75a6d	Driver core / sysfs patches for 3.14-rc1 Here's the big driver core and sysfs patch set for 3.14-rc1. There's a lot of work here moving sysfs logic out into a "kernfs" to allow other subsystems to also have a virtual filesystem with the same attributes of sysfs (handle device disconnect, dynamic creation / removal as needed / unneeded, etc. This is primarily being done for the cgroups filesystem, but the goal is to also move debugfs to it when it is ready, solving all of the known issues in that filesystem as well. The code isn't completed yet, but all should be stable now (there is a big section that was reverted due to problems found when testing.) There's also some other smaller fixes, and a driver core addition that allows for a "collection" of objects, that the DRM people will be using soon (it's in this tree to make merges after -rc1 easier.) All of this has been in linux-next with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iEYEABECAAYFAlLdh0cACgkQMUfUDdst+ylv4QCfeDKDgLo4LsaBIIrFSxLoH/c7 UUsAoMPRwA0h8wy+BQcJAg4H4J4maKj3 =0pc0 -----END PGP SIGNATURE----- Merge tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core / sysfs patches from Greg KH: "Here's the big driver core and sysfs patch set for 3.14-rc1. There's a lot of work here moving sysfs logic out into a "kernfs" to allow other subsystems to also have a virtual filesystem with the same attributes of sysfs (handle device disconnect, dynamic creation / removal as needed / unneeded, etc) This is primarily being done for the cgroups filesystem, but the goal is to also move debugfs to it when it is ready, solving all of the known issues in that filesystem as well. The code isn't completed yet, but all should be stable now (there is a big section that was reverted due to problems found when testing) There's also some other smaller fixes, and a driver core addition that allows for a "collection" of objects, that the DRM people will be using soon (it's in this tree to make merges after -rc1 easier) All of this has been in linux-next with no reported issues" * tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (113 commits) kernfs: associate a new kernfs_node with its parent on creation kernfs: add struct dentry declaration in kernfs.h kernfs: fix get_active failure handling in kernfs_seq_() Revert "kernfs: fix get_active failure handling in kernfs_seq_()" Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq" Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()" Revert "kernfs: remove KERNFS_REMOVED" Revert "kernfs: restructure removal path to fix possible premature return" Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()" Revert "kernfs: remove kernfs_addrm_cxt" Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed" Revert "kernfs: implement kernfs_{de\|re}activate[_self]()" Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers" Revert "pci: use device_remove_file_self() instead of device_schedule_callback()" Revert "scsi: use device_remove_file_self() instead of device_schedule_callback()" Revert "s390: use device_remove_file_self() instead of device_schedule_callback()" Revert "sysfs, driver-core: remove unused {sysfs\|device}_schedule_callback_owner()" Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()" kernfs: remove unnecessary NULL check in __kernfs_remove() drivers/base: provide an infrastructure for componentised subsystems ...	2014-01-20 15:49:44 -08:00
NeilBrown	8313b8e57f	md: fix problem when adding device to read-only array with bitmap. If an array is started degraded, and then the missing device is found it can be re-added and a minimal bitmap-based recovery will bring it fully up-to-date. If the array is read-only a recovery would not be allowed. But also if the array is read-only and the missing device was present very recently, then there could be no need for any recovery at all, so we simply include the device in the read-only array without any recovery. However... if the missing device was removed a little longer ago it could be missing some updates, but if a bitmap is present it will be conditionally accepted pending a bitmap-based update. We don't currently detect this case properly and will include that old device into the read-only array with no recovery even though it really needs a recovery. This patch keeps track of whether a bitmap-based-recovery is really needed or not in the new Bitmap_sync rdev flag. If that is set, then the device will not be added to a read-only array. Cc: Andrei Warkentin <andreiw@vmware.com> Fixes: `d70ed2e4fa` Cc: stable@vger.kernel.org (3.2+) Signed-off-by: NeilBrown <neilb@suse.de>	2014-01-14 16:44:08 +11:00
Tejun Heo	324a56e16e	kernfs: s/sysfs_dirent/kernfs_node/ and rename its friends accordingly kernfs has just been separated out from sysfs and we're already in full conflict mode. Nothing can make the situation any worse. Let's take the chance to name things properly. This patch performs the following renames. * s/sysfs_elem_dir/kernfs_elem_dir/ * s/sysfs_elem_symlink/kernfs_elem_symlink/ * s/sysfs_elem_attr/kernfs_elem_file/ * s/sysfs_dirent/kernfs_node/ * s/sd/kn/ in kernfs proper * s/parent_sd/parent/ * s/target_sd/target/ * s/dir_sd/parent/ * s/to_sysfs_dirent()/rb_to_kn()/ * misc renames of local vars when they conflict with the above Because md, mic and gpio dig into sysfs details, this patch ends up modifying them. All are sysfs_dirent renames and trivial. While we can avoid these by introducing a dummy wrapping struct sysfs_dirent around kernfs_node, given the limited usage outside kernfs and sysfs proper, I don't think such workaround is called for. This patch is strictly rename only and doesn't introduce any functional difference. - mic / gpio renames were missing. Spotted by kbuild test robot. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Neil Brown <neilb@suse.de> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-12-11 15:28:36 -08:00
Linus Torvalds	0910c0bdf7	Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-block Pull block IO core updates from Jens Axboe: "This is the pull request for the core changes in the block layer for 3.13. It contains: - The new blk-mq request interface. This is a new and more scalable queueing model that marries the best part of the request based interface we currently have (which is fully featured, but scales poorly) and the bio based "interface" which the new drivers for high IOPS devices end up using because it's much faster than the request based one. The bio interface has no block layer support, since it taps into the stack much earlier. This means that drivers end up having to implement a lot of functionality on their own, like tagging, timeout handling, requeue, etc. The blk-mq interface provides all these. Some drivers even provide a switch to select bio or rq and has code to handle both, since things like merging only works in the rq model and hence is faster for some workloads. This is a huge mess. Conversion of these drivers nets us a substantial code reduction. Initial results on converting SCSI to this model even shows an 8x improvement on single queue devices. So while the model was intended to work on the newer multiqueue devices, it has substantial improvements for "classic" hardware as well. This code has gone through extensive testing and development, it's now ready to go. A pull request is coming to convert virtio-blk to this model will be will be coming as well, with more drivers scheduled for 3.14 conversion. - Two blktrace fixes from Jan and Chen Gang. - A plug merge fix from Alireza Haghdoost. - Conversion of __get_cpu_var() from Christoph Lameter. - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven. - A fix for a race between request completion and the timeout handling from Jeff Moyer. This is what caused the merge conflict with blk-mq/core, in case you are looking at that. - A dm stacking fix from Mike Snitzer. - A code consolidation fix and duplicated code removal from Kent Overstreet. - A handful of block bug fixes from Mikulas Patocka, fixing a loop crash and memory corruption on blk cg. - Elevator switch bug fix from Tomoki Sekiyama. A heads-up that I had to rebase this branch. Initially the immutable bio_vecs had been queued up for inclusion, but a week later, it became clear that it wasn't fully cooked yet. So the decision was made to pull this out and postpone it until 3.14. It was a straight forward rebase, just pruning out the immutable series and the later fixes of problems with it. The rest of the patches applied directly and no further changes were made" * 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits) block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: Do not call sector_div() with a 64-bit divisor kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup() block: Consolidate duplicated bio_trim() implementations block: Use rw_copy_check_uvector() block: Enable sysfs nomerge control for I/O requests in the plug list block: properly stack underlying max_segment_size to DM device elevator: acquire q->sysfs_lock in elevator_change() elevator: Fix a race in elevator switching and md device initialization block: Replace __get_cpu_var uses bdi: test bdi_init failure block: fix a probe argument to blk_register_region loop: fix crash if blk_alloc_queue fails blk-core: Fix memory corruption if blkcg_init_queue fails block: fix race between request completion and timeout handling blktrace: Send BLK_TN_PROCESS events to all running traces blk-mq: don't disallow request merges for req->special being set blk-mq: mq plug list breakage blk-mq: fix for flush deadlock ...	2013-11-14 12:08:14 +09:00
Kent Overstreet	6678d83f18	block: Consolidate duplicated bio_trim() implementations Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on, we should know better than this. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-08 09:02:31 -07:00
Tejun Heo	388975ccca	sysfs: clean up sysfs_get_dirent() The pre-existing sysfs interfaces which take explicit namespace argument are weird in that they place the optional @ns in front of @name which is contrary to the established convention. For example, we end up forcing vast majority of sysfs_get_dirent() users to do sysfs_get_dirent(parent, NULL, name), which is silly and error-prone especially as @ns and @name may be interchanged without causing compilation warning. This renames sysfs_get_dirent() to sysfs_get_dirent_ns() and swap the positions of @name and @ns, and sysfs_get_dirent() is now a wrapper around sysfs_get_dirent_ns(). This makes confusions a lot less likely. There are other interfaces which take @ns before @name. They'll be updated by following patches. This patch doesn't introduce any functional changes. v2: EXPORT_SYMBOL_GPL() wasn't updated leading to undefined symbol error on module builds. Reported by build test robot. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-09-26 15:33:18 -07:00
NeilBrown	260fa034ef	md: avoid deadlock when dirty buffers during md_stop. When the last process closes /dev/mdX sync_blockdev will be called so that all buffers get flushed. So if it is then opened for the STOP_ARRAY ioctl to be sent there will be nothing to flush. However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just moments before some other process which was writing closes their file descriptor, then there won't be a 'last close' and the buffers might not get flushed. So do_md_stop() calls sync_blockdev(). However at this point it is holding ->reconfig_mutex. So if the array is currently 'clean' then the writes from sync_blockdev() will not complete until the array can be marked dirty and that won't happen until some other thread can get ->reconfig_mutex. So we deadlock. We need to move the sync_blockdev() call to before we take ->reconfig_mutex. However then some other thread could open /dev/mdX and write to it after we call sync_blockdev() and before we actually stop the array. This can leave dirty data in the page cache which is awkward. So introduce new flag MD_STILL_CLOSED. Set it before calling sync_blockdev(), clear it if anyone does open the file, and abort the STOP_ARRAY attempt if it gets set before we lock against further opens. It is still possible to get problems if you open /dev/mdX, write to it, then issue the STOP_ARRAY ioctl. Just don't do that. Signed-off-by: NeilBrown <neilb@suse.de>	2013-08-27 16:45:00 +10:00
NeilBrown	7a0a5355cb	md: Don't test all of mddev->flags at once. mddev->flags is mostly used to record if an update of the metadata is needed. Sometimes the whole field is tested instead of just the important bits. This makes it difficult to introduce more state bits. So replace all bare tests of mddev->flags with tests for the bits that actually need testing. Signed-off-by: NeilBrown <neilb@suse.de>	2013-08-27 16:28:23 +10:00
Jonathan Brassow	c4a3955145	MD: Remember the last sync operation that was performed MD: Remember the last sync operation that was performed This patch adds a field to the mddev structure to track the last sync operation that was performed. This is especially useful when it comes to what is recorded in mismatch_cnt in sysfs. If the last operation was "data-check", then it reports the number of descrepancies found by the user-initiated check. If it was a "repair" operation, then it is reporting the number of descrepancies repaired. etc. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-26 12:38:24 +10:00
Jonathan Brassow	a91d5ac048	MD: Export 'md_reap_sync_thread' function MD: Export 'md_reap_sync_thread' function Make 'md_reap_sync_thread' available to other files, specifically dm-raid.c. - rename reap_sync_thread to md_reap_sync_thread - move the fn after md_check_recovery to match md.h declaration placement - export md_reap_sync_thread Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-04-24 11:42:43 +10:00
Jonathan Brassow	90584fc93d	MD: Prevent sysfs operations on uninitialized kobjects MD: Prevent sysfs operations on uninitialized kobjects Device-mapper does not use sysfs; but when device-mapper is leveraging MD's RAID personalities, MD sometimes attempts to update sysfs. This patch adds checks for 'mddev-kobj.sd' in sysfs_[un]link_rdev to ensure it is about to operate on something valid. This patch also checks for 'mddev->kobj.sd' before calling 'sysfs_notify' in 'remove_and_add_spares'. Although 'sysfs_notify' already makes this check, doing so in 'remove_and_add_spares' prevents an additional mutex operation. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-03-20 13:17:57 +11:00
Linus Torvalds	ea88eeac0c	md update for 3.8 Mostly just little fixes. Probably biggest part is AVX accelerated RAID6 calculations. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIVAwUAUM/w2Dnsnt1WYoG5AQKXlg/9F5juv4CjRkRRFLqZgOPBLmn/s/2Vspgh 2Kv8Jcyixd8jUQNbobZv0ahlJH/iSU61kpOE8QjLbKi5Y42vAbM0ZU2aHJ6nqGZy HiTI8K+7kTvCK3ZXLcUQ+4oPPBNTcoTZbLWaEOmIqB1ruLddoIR7M9fG3PspVeG0 jijnXR8IfL6mr4YDXnJkEhFrneTysVik05RkKYZKyM/9r3stAoMJ9o0/EFy3OFxb lO6mLEtvjVArXcnuf1RMCw2YKgki9Y4r73HCplgQsVFvcxcpsya4gFF+lRR5j7cO /eMYbSQ89iWEYKh1dJ9u1nofc8fX5ia71QQyO1fkO4GXRHXPVIyBgKSbe7SaL6iG JUMm7idUV2rZGeq3ln3k8Yor4QqHvN1n7pRKKUF+ZdsPoQ1B/TABu+qpsAdo5ZhP fxDsULsHrzEaxgetd4V8F2Uptca9ni43sMI8mwsvVlA0p6SOzMIyoJLC9xAZpx11 b3H3+7Oje/fasmszBoq5B9uAlSt9XXVN4DDn2q6cX+S96JSX6jcsN1c6cJBO+ZxB OU6a6P5mnU6HuxU02rspe7G8BeU+ybaonErOW+GdyC4r7M/cImC0dSp0NGHK2211 oqu0xBx/Q/ddTFwKQqa4HzR2ws09+LhKbjdqYIhCEKttIbLIAjf73ARZ19XPSRRX pDR/ey2CB6E= =uK52 -----END PGP SIGNATURE----- Merge tag 'md-3.8' of git://neil.brown.name/md Pull md update from Neil Brown: "Mostly just little fixes. Probably biggest part is AVX accelerated RAID6 calculations." * tag 'md-3.8' of git://neil.brown.name/md: md/raid5: add blktrace calls md/raid5: use async_tx_quiesce() instead of open-coding it. md: Use ->curr_resync as last completed request when cleanly aborting resync. lib/raid6: build proper files on corresponding arch lib/raid6: Add AVX2 optimized gen_syndrome functions lib/raid6: Add AVX2 optimized recovery functions md: Update checkpoint of resync/recovery based on time. md:Add place to update ->recovery_cp. md.c: re-indent various 'switch' statements. md: close race between removing and adding a device. md: removed unused variable in calc_sb_1_csm.	2012-12-18 09:32:44 -08:00
majianpeng	0a19caabf0	md: Use ->curr_resync as last completed request when cleanly aborting resync. If a resync is aborted cleanly, ->curr_resync is a reliable record of where we got up to. If there was an error it is less reliable but we always know that ->curr_resync_completed is safe. So add a flag MD_RECOVERY_ERROR to differentiate between these cases and set recovery_cp accordingly. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-12-13 19:52:11 +11:00
Lukas Czerner	eed8c02e68	wait: add wait_event_lock_irq() interface New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit moves the private wait_event_lock_irq() macro from MD to regular wait includes, introduces new macro wait_event_lock_irq_cmd() instead of using the old method with omitting cmd parameter which is ugly and makes a use of new macros in the MD. It also introduces the _interruptible_ variant. The use of new interface is when one have a special lock to protect data structures used in the condition, or one also needs to invoke "cmd" before putting it to sleep. All new macros are expected to be called with the lock taken. The lock is released before sleep and is reacquired afterwards. We will leave the macro with the lock held. Note to DM: IMO this should also fix theoretical race on waitqueue while using simultaneously wait_event_lock_irq() and wait_event() because of lack of locking around current state setting and wait queue removal. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Neil Brown <neilb@suse.de> Cc: David Howells <dhowells@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-11-30 11:47:57 +01:00
Jianpeng Ma	7f7583d420	Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races Now that multiple threads can handle stripes, it is safer to use an atomic64_t for resync_mismatches, to avoid update races. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 14:17:59 +11:00
Shaohua Li	4ed8731d8e	MD: change the parameter of md thread Change the thread parameter, so the thread can carry extra info. Next patch will use it. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 13:34:00 +11:00
NeilBrown	74018dc306	blk: pass from_schedule to non-request unplug functions. This will allow md/raid to know why the unplug was called, and will be able to act according - if !from_schedule it is safe to perform tasks which could themselves schedule. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-07-31 09:08:15 +02:00
NeilBrown	9cbb175088	blk: centralize non-request unplug handling. Both md and umem has similar code for getting notified on an blk_finish_plug event. Centralize this code in block/ and allow each driver to provide its distinctive difference. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-07-31 09:08:14 +02:00
NeilBrown	0021b7bc04	md: remove plug_cnt feature of plugging. This seemed like a good idea at the time, but after further thought I cannot see it making a difference other than very occasionally and testing to try to exercise the case it is most likely to help did not show any performance difference by removing it. So remove the counting of active plugs and allow 'pending writes' to be activated at any time, not just when no plugs are active. This is only relevant when there is a write-intent bitmap, and the updating of the bitmap will likely introduce enough delay that the single-threading of bitmap updates will be enough to collect large numbers of updates together. Removing this will make it easier to centralise the unplug code, and will clear the other for other unplug enhancements which have a measurable effect. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-07-31 09:08:14 +02:00
NeilBrown	6409bb05a9	md/bitmap: add new 'space' attribute for bitmaps. If we are to allow bitmaps to be resized when the array is resized, we need to know how much space there is. So create an attribute to store this information and set appropriate defaults. It can be set more precisely via sysfs, or future metadata extensions may allow it to be recorded. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-22 13:55:07 +10:00
NeilBrown	545c87957f	md: dm-raid should call helper function to clear rdev. dm-raid currently open-codes the freeing of some members of and rdev. It is more maintainable to have it call common code from md.c which does this for all call-sites. So remove free_disk_sb to md_rdev_clear, export it, and use it in dm-raid.c Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-22 13:54:30 +10:00
NeilBrown	c6563a8c38	md: add possibility to change data-offset for devices. When reshaping we can avoid costly intermediate backup by changing the 'start' address of the array on the device (if there is enough room). So as a first step, allow such a change to be requested through sysfs, and recorded in v1.x metadata. (As we didn't previous check that all 'pad' fields were zero, we need a new FEATURE flag for this. A (belatedly) check that all remaining 'pad' fields are zero to avoid a repeat of this) The new data offset must be requested separately for each device. This allows each to have a different change in the data offset. This is not likely to be used often but as data_offset can be set per-device, new_data_offset should be too. This patch also removes the 'acknowledged' arg to rdev_set_badblocks as it is never used and never will be. At the same time we add a new arg ('in_new') which is currently always zero but will be used more soon. When a reshape finishes we will need to update the data_offset and rdev->sectors. So provide an exported function to do that. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-21 09:27:00 +10:00
NeilBrown	2c810cddc4	md: allow a reshape operation to be reversed. Currently a reshape operation always progresses from the start of the array to the end unless the number of devices is being reduced, in which case it progressed in the opposite direction. To reverse a partial reshape which changes the number of devices you can stop the array and re-assemble with the raid-disks numbers reversed and it will undo. However for a reshape that does not change the number of devices it is not possible to reverse the reshape in the middle - you have to wait until it completes. So add a 'reshape_direction' attribute with is either 'forwards' or 'backwards' and can be explicitly set when delta_disks is zero. This will become more important when we allow the data_offset to change in a reshape. Then the explicit statement of what direction is being used will be more useful. This can be enabled in raid5 trivially as it already supports reverse reshape and just needs to use a different trigger to request it. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-21 09:27:00 +10:00
NeilBrown	050b66152f	md/raid10: handle merge_bvec_fn in member devices. Currently we don't honour merge_bvec_fn in member devices so if there is one, we force all requests to be single-page at most. This is not ideal. So enhance the raid10 merge_bvec_fn to check that function in children as well. This introduces a small problem. There is no locking around calls the ->merge_bvec_fn and subsequent calls to ->make_request. So a device added between these could end up getting a request which violates its merge_bvec_fn. Currently the best we can do is synchronize_sched(). This will work providing no preemption happens. If there is preemption, we just have to hope that new devices are largely consistent with old devices. Signed-off-by: NeilBrown <neilb@suse.de>	2012-03-19 12:46:39 +11:00
NeilBrown	dafb20fa34	md: tidy up rdev_for_each usage. md.h has an 'rdev_for_each()' macro for iterating the rdevs in an mddev. However it uses the 'safe' version of list_for_each_entry, and so requires the extra variable, but doesn't include 'safe' in the name, which is useful documentation. Consequently some places use this safe version without needing it, and many use an explicity list_for_each entry. So: - rename rdev_for_each to rdev_for_each_safe - create a new rdev_for_each which uses the plain list_for_each_entry, - use the 'safe' version only where needed, and convert all other list_for_each_entry calls to use rdev_for_each. Signed-off-by: NeilBrown <neilb@suse.de>	2012-03-19 12:46:39 +11:00
NeilBrown	2d78f8c451	md: create externally visible flags for supporting hot-replace. hot-replace is a feature being added to md which will allow a device to be replaced without removing it from the array first. With hot-replace a spare can be activated and recovery can start while the original device is still in place, thus allowing a transition from an unreliable device to a reliable device without leaving the array degraded during the transition. It can also be use when the original device is still reliable but it not wanted for some reason. This will eventually be supported in RAID4/5/6 and RAID10. This patch adds a super-block flag to distinguish the replacement device. If an old kernel sees this flag it will reject the device. It also adds two per-device flags which are viewable and settable via sysfs. "want_replacement" can be set to request that a device be replaced. "replacement" is set to show that this device is replacing another device. The "rd%d" links in /sys/block/mdXx/md only apply to the original device, not the replacement. We currently don't make links for the replacement - there doesn't seem to be a need. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:51 +11:00
NeilBrown	b8321b68d1	md: change hot_remove_disk to take an rdev rather than a number. Soon an array will be able to have multiple devices with the same raid_disk number (an original and a replacement). So removing a device based on the number won't work. So pass the actual device handle instead. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:51 +11:00
Linus Torvalds	b4fdcb02f1	Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block * 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits) block: don't call blk_drain_queue() if elevator is not up blk-throttle: use queue_is_locked() instead of lockdep_is_held() blk-throttle: Take blkcg->lock while traversing blkcg->policy_list blk-throttle: Free up policy node associated with deleted rule block: warn if tag is greater than real_max_depth. block: make gendisk hold a reference to its queue blk-flush: move the queue kick into blk-flush: fix invalid BUG_ON in blk_insert_flush block: Remove the control of complete cpu from bio. block: fix a typo in the blk-cgroup.h file block: initialize the bounce pool if high memory may be added later block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown block: drop @tsk from attempt_plug_merge() and explain sync rules block: make get_request[_wait]() fail if queue is dead block: reorganize throtl_get_tg() and blk_throtl_bio() block: reorganize queue draining block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() block: pass around REQ_* flags instead of broken down booleans during request alloc/free block: move blk_throtl prototypes to block/blk.h block: fix genhd refcounting in blkio_policy_parse_and_set() ... Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion and making the request functions be of type "void" instead of "int" in - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c} - drivers/staging/zram/zram_drv.c	2011-11-04 17:06:58 -07:00
Jens Axboe	5c04b426f2	Merge branch 'v3.1-rc10' into for-3.2/core Conflicts: block/blk-core.c include/linux/blkdev.h Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:30:42 +02:00
NeilBrown	84fc4b56db	md: rename "mdk_personality" to "md_personality" "mdk" doesn't mean anything any more. Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:49:58 +11:00
NeilBrown	2b8bf3451d	md: remove typedefs: mdk_thread_t -> struct md_thread Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:48:23 +11:00
NeilBrown	fd01b88c75	md: remove typedefs: mddev_t -> struct mddev Having mddev_t and 'struct mddev_s' is ugly and not preferred Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:47:53 +11:00
NeilBrown	3cb0300200	md: removing typedefs: mdk_rdev_t -> struct md_rdev The typedefs are just annoying. 'mdk' probably refers to 'md_k.h' which used to be an include file that defined this thing. Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:45:26 +11:00
Wang Sheng-Hui	7e84152626	trival: md_k.h should be md.h in the beginning comment of file md.h Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-09-21 15:37:46 +10:00
NeilBrown	01f96c0a99	md: Avoid waking up a thread after it has been freed. Two related problems: 1/ some error paths call "md_unregister_thread(mddev->thread)" without subsequently clearing ->thread. A subsequent call to mddev_unlock will try to wake the thread, and crash. 2/ Most calls to md_wakeup_thread are protected against the thread disappeared either by: - holding the ->mutex - having an active request, so something else must be keeping the array active. However mddev_unlock calls md_wakeup_thread after dropping the mutex and without any certainty of an active request, so the ->thread could theoretically disappear. So we need a spinlock to provide some protections. So change md_unregister_thread to take a pointer to the thread pointer, and ensure that it always does the required locking, and clears the pointer properly. Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de> cc: stable@kernel.org	2011-09-21 15:30:20 +10:00
Christoph Hellwig	5a7bbad27a	block: remove support for bio remapping from ->make_request There is very little benefit in allowing to let a ->make_request instance update the bios device and sector and loop around it in __generic_make_request when we can archive the same through calling generic_make_request from the driver and letting the loop in generic_make_request handle it. Note that various drivers got the return value from ->make_request and returned non-zero values for errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 12:12:01 +02:00
NeilBrown	de393cdea6	md: make it easier to wait for bad blocks to be acknowledged. It is only safe to choose not to write to a bad block if that bad block is safely recorded in metadata - i.e. if it has been 'acknowledged'. If it hasn't we need to wait for the acknowledgement. We support that using rdev->blocked wait and md_wait_for_blocked_rdev by introducing a new device flag 'BlockedBadBlock'. This flag is only advisory. It is cleared whenever we acknowledge a bad block, so that a waiter can re-check the particular bad blocks that it is interested it. It should be set by a caller when they find they need to wait. This (set after test) is inherently racy, but as md_wait_for_blocked_rdev already has a timeout, losing the race will have minimal impact. When we clear "Blocked" was also clear "BlockedBadBlocks" incase it was set incorrectly (see above race). We also modify the way we manage 'Blocked' to fit better with the new handling of 'BlockedBadBlocks' and to make it consistent between externally managed and internally managed metadata. This requires that each raidXd loop checks if the metadata needs to be written and triggers a write (md_check_recovery) if needed. Otherwise a queued write request might cause raidXd to wait for the metadata to write, and only that thread can write it. Before writing metadata, we set FaultRecorded for all devices that are Faulty, then after writing the metadata we clear Blocked for any device for which the Fault was certainly Recorded. The 'faulty' device flag now appears in sysfs if the device is faulty or it has unacknowledged bad blocks. So user-space which does not understand bad blocks can continue to function correctly. User space which does, should not assume a device is faulty until it sees the 'faulty' flag, and then sees the list of unacknowledged bad blocks is empty. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:48 +10:00
NeilBrown	d7a9d443bc	md: add 'write_error' flag to component devices. If a device has ever seen a write error, we will want to handle known-bad-blocks differently. So create an appropriate state flag and export it via sysfs. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:31:48 +10:00
NeilBrown	d2eb35acfd	md/raid1: avoid reading from known bad blocks. Now that we have a bad block list, we should not read from those blocks. There are several main parts to this: 1/ read_balance needs to check for bad blocks, and return not only the chosen device, but also how many good blocks are available there. 2/ fix_read_error needs to avoid trying to read from bad blocks. 3/ read submission must be ready to issue multiple reads to different devices as different bad blocks on different devices could mean that a single large read cannot be served by any one device, but can still be served by the array. This requires keeping count of the number of outstanding requests per bio. This count is stored in 'bi_phys_segments' 4/ retrying a read needs to also be ready to submit a smaller read and queue another request for the rest. This does not yet handle bad blocks when reading to perform resync, recovery, or check. 'md_trim_bio' will also be used for RAID10, so put it in md.c and export it. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:48 +10:00
NeilBrown	2699b67223	md: load/store badblock list from v1.x metadata Space must have been allocated when array was created. A feature flag is set when the badblock list is non-empty, to ensure old kernels don't load and trust the whole device. We only update the on-disk badblocklist when it has changed. If the badblocklist (or other metadata) is stored on a bad block, we don't cope very well. If metadata has no room for bad block, flag bad-blocks as disabled, and do the same for 0.90 metadata. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:47 +10:00
NeilBrown	2230dfe4cc	md: beginnings of bad block management. This the first step in allowing md to track bad-blocks per-device so that we can fail individual blocks rather than the whole device. This patch just adds a data structure for recording bad blocks, with routines to add, remove, search the list. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:31:46 +10:00
Jonathan Brassow	3520fa4db7	MD bitmap: Revert DM dirty log hooks Revert most of commit `e384e58549` md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log. MD should not need to use DM's dirty log - we decided to use md's bitmaps instead. Keeping the DIV_ROUND_UP clean-ups that were part of commit `e384e58549`, however. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:37 +10:00
NeilBrown	5389042ffa	md: change managed of recovery_disabled. If we hit a read error while recovering a mirror, we want to abort the recovery without necessarily failing the disk - as having a disk this a read error is better than not having an array at all. Currently this is managed with a per-array flag "recovery_disabled" and is only implemented for RAID1. For RAID10 we will need finer grained control as we might want to disable recovery for individual devices separately. So push more of the decision making into the personality. 'recovery_disabled' is now a 'cookie' which is copied when the personality want to disable recovery and is changed when a device is added to the array as this is used as a trigger to 'try recovery again'. This will allow RAID10 to get the control that it needs. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Namhyung Kim	36fad858a7	md: introduce link/unlink_rdev() helpers There are places where sysfs links to rdev are handled in a same way. Add the helper functions to consolidate them. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Jonathan Brassow	9c81075f43	MD: support initial bitmap creation in-kernel Add bitmap support to the device-mapper specific metadata area. This patch allows the creation of the bitmap metadata area upon initial array creation via device-mapper. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-09 11:41:36 +10:00
Jonathan Brassow	076f968b37	MD: add sync_super to mddev_t struct Add the 'sync_super' function pointer to MD array structure (struct mddev_s) If device-mapper (dm-raid.c) is to define its own on-disk superblock and be able to load it, there must still be a way for MD to initiate superblock updates. The simplest way to make this happen is to provide a pointer in the MD array structure that can be set by device-mapper (or other module) with a function to do this. If the function has been set, it will be used; otherwise, the method with be looked up via 'super_types' as usual. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-08 15:11:31 +10:00
NeilBrown	97658cdd3a	md: provide generic support for handling unplug callbacks. When an md device adds a request to a queue, it can call mddev_check_plugged. If this succeeds then we know that the md thread will be woken up shortly, and ->plug_cnt will be non-zero until then, so some processing can be delayed. If it fails, then no unplug callback is expected and the make_request function needs to do whatever is required to make the request happen. Signed-off-by: NeilBrown <neilb@suse.de>	2011-04-18 18:25:42 +10:00
NeilBrown	482c083492	md - remove old plugging code. md has some plugging infrastructure for RAID5 to use because the normal plugging infrastructure required a 'request_queue', and when called from dm, RAID5 doesn't have one of those available. This relied on the ->unplug_fn callback which doesn't exist any more. So remove all of that code, both in md and raid5. Subsequent patches with restore the plugging functionality. Signed-off-by: NeilBrown <neilb@suse.de>	2011-04-18 18:25:42 +10:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
NeilBrown	f0b4f7e2f2	md: Fix - again - partition detection when array becomes active Revert `b821eaa572` and `f3b99be19d` When I wrote the first of these I had a wrong idea about the lifetime of 'struct block_device'. It can disappear at any time that the block device is not open if it falls out of the inode cache. So relying on the 'size' recorded with it to detect when the device size has changed and so we need to revalidate, is wrong. Rather, we really do need the 'changed' attribute stored directly in the mddev and set/tested as appropriate. Without this patch, a sequence of: mknod / open / close / unlink (which can cause a block_device to be created and then destroyed) will result in a rescan of the partition table and consequence removal and addition of partitions. Several of these in a row can get udev racing to create and unlink and other code can get confused. With the patch, the rescan is only performed when needed and so there are no races. This is suitable for any stable kernel from 2.6.35. Reported-by: "Wojcik, Krzysztof" <krzysztof.wojcik@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2011-02-24 17:26:41 +11:00
NeilBrown	f21e9ff7f7	md: Remove the AllReserved flag for component devices. This flag is not needed and is used badly. Devices that are included in a native-metadata array are reserved exclusively for that array - and currently have AllReserved set. They all are bd_claimed for the rdev and so cannot be shared. Devices that are included in external-metadata arrays can be shared among multiple arrays - providing there is no overlap. These are bd_claimed for md in general - not for a particular rdev. When changing the amount of a device that is used in an array we need to check for overlap. This currently includes a check on AllReserved So even without overlap, sharing with an AllReserved device is not allowed. However the bd_claim usage already precludes sharing with these devices, so the test on AllReserved is not needed. And in fact it is wrong. As this is the only use of AllReserved, simply remove all usage and definition of AllReserved. Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-31 12:10:09 +11:00
Jonathan Brassow	a6ff7e089c	md: separate meta and data devs Allow the metadata to be on a separate device from the data. This doesn't mean the data and metadata will by on separate physical devices - it simply gives device-mapper and userspace tools more flexibility. Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-14 09:14:34 +11:00
Jonathan Brassow	ccebd4c415	md-new-param-to_sync_page_io Add new parameter to 'sync_page_io'. The new parameter allows us to distinguish between metadata and data operations. This becomes important later when we add the ability to use separate devices for data and metadata. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>	2011-01-14 09:14:33 +11:00
NeilBrown	0ca69886a8	md: Ensure no IO request to get md device before it is properly initialised. When an md device is in the process of coming on line it is possible for an IO request (typically a partition table probe) to get through before the array is fully initialised, which can cause unexpected behaviour (e.g. a crash). So explicitly record when the array is ready for IO and don't allow IO through until then. There is no possibility for a similar problem when the array is going off-line as there must only be one 'open' at that time, and it is busy off-lining the array and so cannot send IO requests. So no memory barrier is needed in md_stop() This has been a bug since commit `409c57f380` in 2.6.30 which introduced md_make_request. Before then, each personality would register its own make_request_fn when it was ready. This is suitable for any stable kernel from 2.6.30.y onwards. Cc: <stable@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: "Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>	2011-01-14 09:14:33 +11:00
NeilBrown	a167f66324	md: use separate bio pool for each md device. bio_clone and bio_alloc allocate from a common bio pool. If an md device is stacked with other devices that use this pool, or under something like swap which uses the pool, then the multiple calls on the pool can cause deadlocks. So allocate a local bio pool for each md array and use that rather than the common pool. This pool is used both for regular IO and metadata updates. Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-28 17:36:15 +11:00
NeilBrown	2b193363ef	md: change type of first arg to sync_page_io. Currently sync_page_io takes a 'bdev'. Every caller passes 'rdev->bdev'. We will soon want another field out of the rdev in sync_page_io, So just pass the rdev instead of the bdev out of it. Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-28 17:36:11 +11:00
Jens Axboe	fa251f8990	Merge branch 'v2.6.36-rc8' into for-2.6.37/barrier Conflicts: block/blk-core.c drivers/block/loop.c mm/swapfile.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-10-19 09:13:04 +02:00
Tejun Heo	e9c7469bb4	md: implment REQ_FLUSH/FUA support This patch converts md to support REQ_FLUSH/FUA instead of now deprecated REQ_HARDBARRIER. In the core part (md.c), the following changes are notable. * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with processing of other requests and thus there is no reason to mark the queue congested while FLUSH/FUA is in progress. * REQ_FLUSH/FUA failures are final and its users don't need retry logic. Retry logic is removed. * Preflush needs to be issued to all member devices but FUA writes can be handled the same way as other writes - their processing can be deferred to request_queue of member devices. md_barrier_request() is renamed to md_flush_request() and simplified accordingly. For linear, raid0 and multipath, the core changes are enough. raid1, 5 and 10 need the following conversions. * raid1: Handling of FLUSH/FUA bio's can simply be deferred to request_queues of member devices. Barrier related logic removed. * raid5: Queue draining logic dropped. FUA bit is propagated through biodrain and stripe resconstruction such that all the updated parts of the stripe are written out with FUA writes if any of the dirtying writes was FUA. preread_active_stripes handling in make_request() is updated as suggested by Neil Brown. * raid10: FUA bit needs to be propagated to write clones. linear, raid0, 1, 5 and 10 tested. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-10 12:35:38 +02:00
NeilBrown	070dc6dd71	md: resolve confusion of MD_CHANGE_CLEAN MD_CHANGE_CLEAN is used for two different purposes and this leads to confusion. One of the purposes is largely mirrored by MD_CHANGE_PENDING which is not used for anything else, so have MD_CHANGE_PENDING take over that purpose fully. The two purposes are: 1/ tell md_update_sb that an update is needed and that it is just a clean/dirty transition. 2/ tell user-space that an transition from clean to dirty is pending (something wants to write), and tell te kernel (by clearin the flag) that the transition is OK. The first purpose remains wit MD_CHANGE_CLEAN, the second is moved fully to MD_CHANGE_PENDING. This means that various places which conditionally set or cleared MD_CHANGE_CLEAN no longer need to be conditional. Signed-off-by: NeilBrown <neilb@suse.de>	2010-08-30 18:06:21 +10:00
Linus Torvalds	3d30701b58	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: (24 commits) md: clean up do_md_stop md: fix another deadlock with removing sysfs attributes. md: move revalidate_disk() back outside open_mutex md/raid10: fix deadlock with unaligned read during resync md/bitmap: separate out loading a bitmap from initialising the structures. md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log. md/bitmap: optimise scanning of empty bitmaps. md/bitmap: clean up plugging calls. md/bitmap: reduce dependence on sysfs. md/bitmap: white space clean up and similar. md/raid5: export raid5 unplugging interface. md/plug: optionally use plugger to unplug an array during resync/recovery. md/raid5: add simple plugging infrastructure. md/raid5: export is_congested test raid5: Don't set read-ahead when there is no queue md: add support for raising dm events. md: export various start/stop interfaces md: split out md_rdev_init md: be more careful setting MD_CHANGE_CLEAN md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk ...	2010-08-10 15:38:19 -07:00
NeilBrown	bb4f1e9d0e	md: fix another deadlock with removing sysfs attributes. Move the deletion of sysfs attributes from reconfig_mutex to open_mutex didn't really help as a process can try to take open_mutex while holding reconfig_mutex, so the same deadlock can happen, just requiring one more process to be involved in the chain. I looks like I cannot easily use locking to wait for the sysfs deletion to complete, so don't. The only things that we cannot do while the deletions are still pending is other things which can change the sysfs namespace: run, takeover, stop. Each of these can fail with -EBUSY. So set a flag while doing a sysfs deletion, and fail run, takeover, stop if that flag is set. This is suitable for 2.6.35.x Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2010-08-08 21:21:27 +10:00
Christoph Hellwig	7b6d91daee	block: unify flags for struct bio and struct request Remove the current bio flags and reuse the request flags for the bio, too. This allows to more easily trace the type of I/O from the filesystem down to the block driver. There were two flags in the bio that were missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've renamed two request flags that had a superflous RW in them. Note that the flags are in bio.h despite having the REQ_ name - as blkdev.h includes bio.h that is the only way to go for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:20:39 +02:00
NeilBrown	e384e58549	md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log. This allows md/raid5 to fully work as a dm target. Normally md uses a 'filemap' which contains a list of pages of bits each of which may be written separately. dm-log uses and all-or-nothing approach to writing the log, so when using a dm-log, ->filemap is NULL and the flags normally stored in filemap_attr are stored in ->logattrs instead. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 13:21:34 +10:00
NeilBrown	b63d7c2e29	md/bitmap: clean up plugging calls. 1/ use md_unplug in bitmap.c as we will soon be using bitmaps under arrays with no queue attached. 2/ Don't bother plugging the queue when we set a bit in the bitmap. The reason for this was to encourage as many bits as possible to get set before we unplug and write stuff out. However every personality already plugs the queue after bitmap_startwrite either directly (raid1/raid10) or be setting STRIPE_BIT_DELAY which causes the queue to be plugged later (raid5). Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 13:21:32 +10:00
NeilBrown	ac2f40be46	md/bitmap: white space clean up and similar. Fixes some whitespace problems Fixed some checkpatch.pl complaints. Replaced kmalloc ... memset(0), with kzalloc Fixed an unlikely memory leak on an error path. Reformatted a number of 'if/else' sets, sometimes replacing goto with an else clause. Removed some old comments and commented-out code. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 13:07:22 +10:00
NeilBrown	252ac5221a	md/plug: optionally use plugger to unplug an array during resync/recovery. If an array doesn't have a 'queue' then md_do_sync cannot unplug it. In that case it will have a 'plugger', so make that available to the mddev, and use it to unplug the array if needed. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:53:08 +10:00
NeilBrown	2ac8740151	md/raid5: add simple plugging infrastructure. md/raid5 uses the plugging infrastructure provided by the block layer and 'struct request_queue'. However when we plug raid5 under dm there is no request queue so we cannot use that. So create a similar infrastructure that is much lighter weight and use it for raid5. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:53:08 +10:00
NeilBrown	768a418db1	md: add support for raising dm events. dm uses scheduled work to raise events to user-space. So allow md device to have work_structs and schedule them on an error. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:27 +10:00
NeilBrown	390ee602a1	md: export various start/stop interfaces export entry points for starting and stopping md arrays. This will be used by a module to make md/raid5 work under dm. Also stop calling md_stop_writes from md_stop, as that won't work well with dm - it will want to call the two separately. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:27 +10:00
NeilBrown	e8bb9a839a	md: split out md_rdev_init This functionality will be needed separately in a subsequent patch, so split it into it's own exported function. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:27 +10:00
NeilBrown	00bcb4ac7e	md: reduce dependence on sysfs. We will want md devices to live as dm targets where sysfs is not visible. So allow md to not connect to sysfs. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-21 13:27:53 +10:00
NeilBrown	e93f68a1fc	md: fix handling of array level takeover that re-arranges devices. Most array level changes leave the list of devices largely unchanged, possibly causing one at the end to become redundant. However conversions between RAID0 and RAID10 need to renumber all devices (except 0). This renumbering is currently being done in the ->run method when the new personality takes over. However this is too late as the common code in md.c might already have invalidated some of the devices if they had a ->raid_disk number that appeared to high. Moving it into the ->takeover method is too early as the array is still active at that time and wrong ->raid_disk numbers could cause confusion. So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate the new raid_disk number. Now the common code knows exactly which devices need to be renumbered, and which can be invalidated, and can do it all at a convenient time when the array is suspend. It can also update some symlinks in sysfs which previously were not be updated correctly. Reported-by: Maciej Trela <maciej.trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:33:24 +10:00
NeilBrown	a8707c08f4	md: simplify updating of event count to sometimes avoid updating spares. When updating the event count for a simple clean <-> dirty transition, we try to avoid updating the spares so they can safely spin-down. As the event_counts across an array must be +/- 1, this means decrementing the event_count on a dirty->clean transition. This is not always safe and we have to avoid the unsafe time. We current do this with a misguided idea about it being safe or not depending on whether the event_count is odd or even. This approach only works reliably in a few common instances, but easily falls down. So instead, simply keep internal state concerning whether it is safe or not, and always assume it is not safe when an array is first assembled. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:28:01 +10:00
NeilBrown	21a52c6d05	md: pass mddev to make_request functions rather than request_queue We used to pass the personality make_request function direct to the block layer so the first argument had to be a queue. But now we have the intermediary md_make_request so it makes at lot more sense to pass a struct mddev_s. It makes it possible to have an mddev without its own queue too. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:55 +10:00
NeilBrown	b821eaa572	md: remove ->changed and related code. We set ->changed to 1 and call check_disk_change at the end of md_open so that bd_invalidated would be set and thus partition rescan would happen appropriately. Now that we call revalidate_disk directly, which sets bd_invalidates, that indirection is no longer needed and can be removed. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:53 +10:00
NeilBrown	c0cc75f84e	md: discard StateChanged device flag. This was needed when sysfs files could only be 'notified' from process context. Now that we have sys_notify_direct, we can call it directly from an interrupt. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:47 +10:00
NeilBrown	ee8b81b03d	md: remove some dead fields from mddev_s These fields have never been used. commit `4b6d287f62` added them, but also added identical files to bitmap_super_s, and only used the latter. So remove these unused fields. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:45 +10:00
NeilBrown	a64c876fd3	md: manage redundancy group in sysfs when changing level. Some levels expect the 'redundancy group' to be present, others don't. So when we change level of an array we might need to add or remove this group. This requires fixing up the current practice of overloading ->private to indicate (when ->pers == NULL) that something needs to be removed. So create a new ->to_remove to fill that role. When changing levels, we may need to add or remove attributes. When changing RAID5 -> RAID6, we both add and remove the same thing. It is important to catch this and optimise it out as the removal is delayed until a lock is released, so trying to add immediately would cause problems. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-17 14:45:40 +10:00
Robert Becker	1e50915fe0	raid: improve MD/raid10 handling of correctable read errors. We've noticed severe lasting performance degradation of our raid arrays when we have drives that yield large amounts of media errors. The raid10 module will queue each failed read for retry, and also will attempt call fix_read_error() to perform the read recovery. Read recovery is performed while the array is frozen, so repeated recovery attempts can degrade the performance of the array for extended periods of time. With this patch I propose adding a per md device max number of corrected read attempts. Each rdev will maintain a count of read correction attempts in the rdev->read_errors field (not used currently for raid10). When we enter fix_read_error() we'll check to see when the last read error occurred, and divide the read error count by 2 for every hour since the last read error. If at that point our read error count exceeds the read error threshold, we'll fail the raid device. In addition in this patch I add sysfs nodes (get/set) for the per md max_read_errors attribute, the rdev->read_errors attribute, and added some printk's to indicate when fix_read_error fails to repair an rdev. For testing I used debugfs->fail_make_request to inject IO errors to the rdev while doing IO to the raid array. Signed-off-by: Robert Becker <Rob.Becker@riverbed.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	ece5cff0da	md: Support write-intent bitmaps with externally managed metadata. In this case, the metadata needs to not be in the same sector as the bitmap. md will not read/write any bitmap metadata. Config must be done via sysfs and when a recovery makes the array non-degraded again, writing 'true' to 'bitmap/can_clear' will allow bits in the bitmap to be cleared again. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	43a705076e	md: support updating bitmap parameters via sysfs. A new attribute directory 'bitmap' in 'md' is created which contains files for configuring the bitmap. 'location' identifies where the bitmap is, either 'none', or 'file' or 'sector offset from metadata'. Writing 'location' can create or remove a bitmap. Adding a 'file' bitmap this way is not yet supported. 'chunksize' and 'time_base' must be set before 'location' can be set. 'chunksize' can be set before creating a bitmap, but is currently always over-ridden by the bitmap superblock. 'time_base' and 'backlog' can be updated at any time. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Andre Noll <maan@systemlinux.org>	2009-12-14 12:51:41 +11:00
NeilBrown	72e02075a3	md: factor out parsing of fixed-point numbers safe_delay_store can parse fixed point numbers (for fractions of a second). We will want to do that for another sysfs file soon, so factor out the code. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	f6af949c56	md: support bitmap offset appropriate for external-metadata arrays. For md arrays were metadata is managed externally, the kernel does not know about a superblock so the superblock offset is 0. If we want to have a write-intent-bitmap near the end of the devices of such an array, we should support sector_t sized offset. We need offset be possibly negative for when the bitmap is before the metadata, so use loff_t instead. Also add sanity check that bitmap does not overlap with data. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	42a04b5078	md: move offset, daemon_sleep and chunksize out of bitmap structure ... and into bitmap_info. These are all configuration parameters that need to be set before the bitmap is created. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	c3d9714e88	md: collect bitmap-specific fields into one structure. In preparation for making bitmap fields configurable via sysfs, start tidying up by making a single structure to contain the configuration fields. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	a2826aa92e	md: support barrier requests on all personalities. Previously barriers were only supported on RAID1. This is because other levels requires synchronisation across all devices and so needed a different approach. Here is that approach. When a barrier arrives, we send a zero-length barrier to every active device. When that completes - and if the original request was not empty - we submit the barrier request itself (with the barrier flag cleared) and then submit a fresh load of zero length barriers. The barrier request itself is asynchronous, but any subsequent request will block until the barrier completes. The reason for clearing the barrier flag is that a barrier request is allowed to fail. If we pass a non-empty barrier through a striping raid level it is conceivable that part of it could succeed and part could fail. That would be way too hard to deal with. So if the first run of zero length barriers succeed, we assume all is sufficiently well that we send the request and ignore errors in the second run of barriers. RAID5 needs extra care as write requests may not have been submitted to the underlying devices yet. So we flush the stripe cache before proceeding with the barrier. Note that the second set of zero-length barriers are submitted immediately after the original request is submitted. Thus when a personality finds mddev->barrier to be set during make_request, it should not return from make_request until the corresponding per-device request(s) have been queued. That will be done in later patches. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Andre Noll <maan@systemlinux.org>	2009-12-14 12:49:49 +11:00
NeilBrown	aa5cbd1038	md/bitmap: protect against bitmap removal while being updated. A write intent bitmap can be removed from an array while the array is active. When this happens, all IO is suspended and flushed before the bitmap is removed. However it is possible that bitmap_daemon_work is still running to clear old bits from the bitmap. If it is, it can dereference the bitmap after it has been freed. So introduce a new mutex to protect bitmap_daemon_work and get it before destroying a bitmap. This is suitable for any current -stable kernel. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2009-12-14 12:49:46 +11:00
NeilBrown	3fa841d7e7	md: report device as congested when suspended This should writeback from coming when the device is temporarily suspended. Signed-off-by: NeilBrown <neilb@suse.de>	2009-09-23 18:10:29 +10:00
Anand Gadiyar	411c940385	trivial: fix typo "for for" in multiple files trivial: fix typo "for for" in multiple files Signed-off-by: Anand Gadiyar <gadiyar@ti.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-09-21 15:14:54 +02:00
NeilBrown	c8c00a6915	Remove deadlock potential in md_open A recent commit: commit `449aad3e25` introduced the possibility of an A-B/B-A deadlock between bd_mutex and reconfig_mutex. __blkdev_get holds bd_mutex while calling md_open which takes reconfig_mutex, do_md_run is always called with reconfig_mutex held, and it now takes bd_mutex in the call the revalidate_disk. This potential deadlock was not caught by lockdep due to the use of mutex_lock_interruptible_nexted which was introduced by commit `d63a5a74de` do avoid a warning of an impossible deadlock. It is quite possible to split reconfig_mutex in to two locks. One protects the array data structures while it is being reconfigured, the other ensures that an array is never even partially open while it is being deactivated. In particular, the second lock prevents an open from completing between the time when do_md_stop checks if there are any active opens, and the time when the array is either set read-only, or when ->pers is set to NULL. So we can be certain that no IO is in flight as the array is being destroyed. So create a new lock, open_mutex, just to ensure exclusion between 'open' and 'stop'. This avoids the deadlock and also avoids the lockdep warning mentioned in commit `d63a5a74d` Reported-by: "Mike Snitzer" <snitzer@gmail.com> Reported-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-10 12:50:52 +10:00
Andre Noll	ac5e7113e7	md: Push down data integrity code to personalities. This patch replaces md_integrity_check() by two new public functions: md_integrity_register() and md_integrity_add_rdev() which are both personality-independent. md_integrity_register() is called from the ->run and ->hot_remove methods of all personalities that support data integrity. The function iterates over the component devices of the array and determines if all active devices are integrity capable and if their profiles match. If this is the case, the common profile is registered for the mddev via blk_integrity_register(). The second new function, md_integrity_add_rdev() is called from the ->hot_add_disk methods, i.e. whenever a new device is being added to a raid array. If the new device does not support data integrity, or has a profile different from the one already registered, data integrity for the mddev is disabled. For raid0 and linear, only the call to md_integrity_register() from the ->run method is necessary. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:47 +10:00
Andre Noll	0894cc3066	md: Move check for bitmap presence to personality code. If the superblock of a component device indicates the presence of a bitmap but the corresponding raid personality does not support bitmaps (raid0, linear, multipath, faulty), then something is seriously wrong and we'd better refuse to run such an array. Currently, this check is performed while the superblocks are examined, i.e. before entering personality code. Therefore the generic md layer must know which raid levels support bitmaps and which do not. This patch avoids this layer violation without adding identical code to various personalities. This is accomplished by introducing a new public function to md.c, md_check_no_bitmap(), which replaces the hard-coded checks in the superblock loading functions. A call to md_check_no_bitmap() is added to the ->run method of each personality which does not support bitmaps and assembly is aborted if at least one component device contains a bitmap. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:49:23 +10:00
NeilBrown	8190e754e0	md: remove chunksize rounding from common code. It is easiest to round sizes to multiples of chunk size in the personality code for those personalities which care. Those personalities now do the rounding, so we can remove that function from common code. Also remove the upper bound on the size of a chunk, and the lower bound on the size of a device (1 chunk), neither of which really buy us anything. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:48:58 +10:00
NeilBrown	50ac168a6e	md: merge reconfig and check_reshape methods. The difference between these two methods is artificial. Both check that a pending reshape is valid, and perform any aspect of it that can be done immediately. 'reconfig' handles chunk size and layout. 'check_reshape' handles raid_disks. So make them just one method. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:47:55 +10:00
NeilBrown	597a711b69	md: remove unnecessary arguments from ->reconfig method. Passing the new layout and chunksize as args is not necessary as the mddev has fields for new_check and new_layout. This is preparation for combining the check_reshape and reconfig methods Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:47:42 +10:00
Andre Noll	664e7c413f	md: Convert mddev->new_chunk to sectors. A straight-forward conversion which gets rid of some multiplications/divisions/shifts. The patch also introduces a couple of new ones, most of which are due to conf->chunk_size still being represented in bytes. This will be cleaned up in subsequent patches. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:45:27 +10:00
Andre Noll	9d8f036362	md: Make mddev->chunk_size sector-based. This patch renames the chunk_size field to chunk_sectors with the implied change of semantics. Since is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9) = is_power_of_2(chunk_sectors) these bits don't need an adjustment for the shift. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:45:01 +10:00
Christoph Hellwig	63fe08177f	md: tiny md.h cleanups - update inclusion guard and make sure it covers the whole file - remove superflous #ifdef CONFIG_BLOCK - make sure all required headers are included so that new users aren't required to include others before Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>	2009-04-14 12:01:53 +10:00
NeilBrown	cea9c22800	md: add explicit method to signal the end of a reshape. Currently raid5 (the only module that supports restriping) notices that the reshape has finished be sync_request being given a large value, and handles any cleanup them. This patch changes it so md_check_recovery calls into an explicit finish_reshape method as well. The clean-up from sync_request can do things that need to be done promptly, typically things local to the raid5_conf_t structure. The "finish_reshape" method is called under the mddev_lock so it can do things involving reconfiguring the device. This allows us to get rid of md_set_array_sectors_locked, which would have caused a deadlock if you tried to stop and array while a reshape was happening. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:15:05 +11:00
Dan Williams	b522adcde9	md: 'array_size' sysfs attribute Allow userspace to set the size of the array according to the following semantics: 1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0) a) If size is set before the array is running, do_md_run will fail if size is greater than the default size b) A reshape attempt that reduces the default size to less than the set array size should be blocked 2/ once userspace sets the size the kernel will not change it 3/ writing 'default' to this attribute returns control of the size to the kernel and reverts to the size reported by the personality Also, convert locations that need to know the default size from directly reading ->array_sectors to <pers>_size. Resync/reshape operations always follow the default size. Finally, fixup other locations that read a number of 1k-blocks from userspace to use strict_blocks_to_sectors() which checks for unsigned long long to sector_t overflow and blocks to sectors overflow. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-03-31 15:00:31 +11:00
Dan Williams	1f403624bd	md: centralize ->array_sectors modifications Get personalities out of the business of directly modifying ->array_sectors. Lays groundwork to introduce policy on when ->array_sectors can be modified. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-03-31 14:59:03 +11:00
Dan Williams	80c3a6ce4b	md: add 'size' as a personality method In preparation for giving userspace control over ->array_sectors we need to be able to retrieve the 'default' size, and the 'anticipated' size when a reshape is requested. For personalities that do not reshape emit a warning if anything but the default size is requested. In the raid5 case we need to update ->previous_raid_disks to make the new 'default' size available. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-03-31 14:57:49 +11:00
NeilBrown	245f46c2c2	md: add ->takeover method to support changing the personality managing an array Implement this for RAID6 to be able to 'takeover' a RAID5 array. The new RAID6 will use a layout which places Q on the last device, and that device will be missing. If there are any available spares, one will immediately have Q recovered onto it. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:39 +11:00
NeilBrown	409c57f380	md: enable suspend/resume of md devices. To be able to change the 'level' of an md/raid array, we need to suspend the device so that no requests are active - then move some pointers around etc. The code already keeps counts of active requests and the ->quiesce function can be used to wait until those counts hit zero. However the quiesce function blocks new requests once they are all ready 'inside' the personality module, and that is too late if we want to replace the personality modules. So make all md requests come in through a common md_make_request function that keeps track of how many requests have entered the modules but may not yet be on the internal reference counts. Allow md_make_request to be blocked when we want to suspend the device, and make it possible to wait for all those in-transit requests to be added to internal lists so that ->quiesce can wait for them. There is still a problem that when a request completes, we drop the ref count inside the personality code so there is a short time between when the refcount hits zero, and when the personality code is no longer being used. The personality code never blocks (schedule or spinlock) between dropping the refcount and exiting the routine, so this should be safe (as put_module calls synchronize_sched() before unmapping the module code). Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:39 +11:00
Andre Noll	dd8ac336c1	md: Represent raid device size in sectors. This patch renames the "size" field of struct mdk_rdev_s to "sectors" and changes this field to store sectors instead of blocks. All users of this field, linear.c, raid0.c and md.c, are fixed up accordingly which gets rid of many multiplications and divisions. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
Andre Noll	58c0fed400	md: Make mddev->size sector-based. This patch renames the "size" field of struct mddev_s to "dev_sectors" and stores the number of 512-byte sectors instead of the number of 1K-blocks in it. All users of that field, including raid levels 1,4-6,10, are adjusted accordingly. This simplifies the code a bit because it allows to get rid of a couple of divisions/multiplications by two. In order to make checkpatch happy, some minor coding style issues have also been addressed. In particular, size_store() now uses strict_strtoull() instead of simple_strtoull(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
NeilBrown	97e4f42d62	md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash Version 1.x metadata has the ability to record the status of a partially completed drive recovery. However we only update that record on a clean shutdown. It would be nice to update it on unclean shutdowns too, particularly when using a bitmap that removes much to the 'sync' effort after an unclean shutdown. One complication with checkpointing recovery is that we only know where we are up to in terms of IO requests started, not which ones have completed. And we need to know what has completed to record how much is recovered. So occasionally pause the recovery until all submitted requests are completed, then update the record of where we are up to. When we have a bitmap, we already do that pause occasionally to keep the bitmap up-to-date. So enhance that code to record the recovery offset and schedule a superblock update. And when there is no bitmap, just pause 16 times during the resync to do a checkpoint. '16' is a fairly arbitrary number. But we don't really have any good way to judge how often is acceptable, and it seems like a reasonable number for now. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
NeilBrown	43b2e5d86d	md: move md_k.h from include/linux/raid/ to drivers/md/ It really is nicer to keep related code together.. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00

... 2 3 4 5 6

284 Commits