linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-05 19:41:54 +00:00

Author	SHA1	Message	Date
Mikulas Patocka	374bf7e7f6	dm: stripe support flush Flush support for the stripe target. This sets ti->num_flush_requests to the number of stripes and remaps individual flush requests to the appropriate stripe devices. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:22 +01:00
Mikulas Patocka	433bcac564	dm: linear support flush Flush support for the linear target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:22 +01:00
Mikulas Patocka	52b1fd5a27	dm: send empty barriers to targets in dm_flush Pass empty barrier flushes to the targets in dm_flush(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:21 +01:00
Alasdair G Kergon	9015df24a8	dm: initialise tio in alloc_tio Move repeated dm_target_io initialisation inside alloc_tio(). Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:21 +01:00
Mikulas Patocka	f9ab94cee3	dm: introduce num_flush_requests Introduce num_flush_requests for a target to set to say how many flush instructions (empty barriers) it wants to receive. These are sent by __clone_and_map_empty_barrier with map_info->flush_request going from 0 to (num_flush_requests - 1). Old targets without flush support won't receive any flush requests. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:20 +01:00
Mikulas Patocka	27eaa14975	dm: remove check that prevents mapping empty bios Remove the check that the size of the cloned bio is not zero because a subsequent patch needs to send zero-sized barriers down this path. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:20 +01:00
Mikulas Patocka	fdb9572b73	dm: remove EOPNOTSUPP for barriers If the underlying device doesn't support barriers and dm receives a barrier, it waits until all requests on that device drain so it no longer needs to report -EOPNOTSUPP to the caller. This patch deals with the confusing situation when moving a volume from one physical device to another triggers an EOPNOTSUPP on a volume that didn't report it before. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:19 +01:00
Mikulas Patocka	5aa2781d96	dm: store only first barrier error With the following patches, more than one error can occur during processing. Change md->barrier_error so that only the first one is recorded and returned to the caller. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:18 +01:00
Mikulas Patocka	2761e95fe4	dm: process requeue in dm_wq_work If barrier request was returned with DM_ENDIO_REQUEUE, requeue it in dm_wq_work instead of dec_pending. This allows us to correctly handle a situation when some targets are asking for a requeue and other targets signal an error. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:18 +01:00
Mikulas Patocka	531fe96364	dm: make dm_flush return void Make dm_flush return void. The first error during flush is stored in md->barrier_error instead. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:17 +01:00
Mikulas Patocka	32a926da5a	dm: always hold bdev reference Fix a potential deadlock when creating multiple snapshots by holding a reference to struct block_device for the whole lifecycle of every dm device instead of obtaining it independently at each point it is needed. bdget_disk() was called while the device was being suspended, in dm_suspend(). However there could be other devices already suspended, for example when creating additional snapshots of a device. bdget_disk() can wait for IO and allocate memory resulting in waiting for the already-suspended device - deadlock. This patch changes the code so that it gets the reference to struct block_device when struct mapped_device is allocated and initialized in alloc_dev() where it is always OK to allocate memory or wait for I/O. It drops the reference when it is destroyed in free_dev(). Thus there is no call to bdget_disk() while any device is suspended. Previously unlock_fs() was called only if bdev was held. Now it is called unconditionally, but the superfluous calls are harmless because it returns immediately if the filesystem was not previously frozen. This patch also now allows the device size to be changed in a noflush suspend because the bdev is held. This has no adverse effect. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:17 +01:00
Mikulas Patocka	db8fef4fab	dm: rename suspended_bdev to bdev Rename suspended_bdev to bdev. This patch doesn't change any functionality, just renames the variable. In the next patch, the variable will be used even for non-suspended device. (Pre-requisite for the per-target barrier support patches.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:15 +01:00
Jonathan Brassow	f6bd4eb73c	dm exception store: fix exstore lookup to be case insensitive When snapshots are created using 'p' instead of 'P' as the exception store type, the device-mapper table loading fails. This patch makes the code case insensitive as intended and fixes some regressions reported with device-mapper snapshots. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:15 +01:00
Mikulas Patocka	5657e8fa45	dm: use i_size_read Use i_size_read() instead of reading i_size. If someone changes the size of the device simultaneously, i_size_read is guaranteed to return a valid value (either the old one or the new one). i_size can return some intermediate invalid value (on 32-bit computers with 64-bit i_size, the reads to both halves of i_size can be interleaved with updates to i_size, resulting in garbage being returned). Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:14 +01:00
Mikulas Patocka	8cbeb67ad5	dm: avoid unsupported spanning of md stripe boundaries A bio that has two or more vector entries, size less than or equal to page size, that crosses a stripe boundary of an underlying md device is accepted by device mapper (it conforms to all its limits) but not by the underlying device. The fix is: If device mapper selects the one-page maximum request size, it also needs to set its own q->merge_bvec_fn to reject any bios with multiple vector entries that span more pages. The problem was discovered in the following scenario: * MD - RAID-0 * LV on the top of it (raid1, snapshot or striped with chunk size/stripe larger than RAID-0 stripe) * one of the logical volumes is exported to xen domU * inside xen domU it is partitioned, the key point is that the partition must be unaligned on page boundary (fdisk normally aligns the partition to 63 sectors which will trigger it) * install the system on the partitioned disk in domU This causes I/O failures in dom0. Reference: https://bugzilla.redhat.com/show_bug.cgi?id=223947 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:14 +01:00
Mikulas Patocka	53b351f972	dm mpath: flush keventd queue in destructor The commit `fe9cf30eb8` moves dm table event submission from kmultipath queue to kernel kevent queue to avoid a deadlock. There is a possibility of race condition because kevent queue is not flushed in the multipath destructor. The scenario is: - some event happens and is queued to keventd - keventd thread is delayed due to scheuling latency or some other work - multipath device is destroyed - keventd now attempts to process work_struct that is residing in already released memory. The patch flushes the keventd queue in multipath constructor. I've already fixed similar bug in dm-raid1. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org	2009-06-22 10:12:13 +01:00
Mikulas Patocka	a72986c562	dm raid1: keep retrying alloc if mempool_alloc failed If the code can't handle allocation failures, use __GFP_NOFAIL so that in case of memory pressure the allocator will retry indefinitely and won't return NULL which would cause a crash in the function. This is still not a correct fix, it may cause a classic deadlock when memory manager waits for I/O being done and I/O waits for some free memory. I/O code shouldn't allocate any memory. But in this case it probably doesn't matter much in practice, people usually do not swap on RAID. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:13 +01:00
Chandra Seetharaman	e54f77ddda	dm mpath: call activate fn for each path in pg_init Fixed a problem affecting reinstatement of passive paths. Before we moved the hardware handler from dm to SCSI, it performed a pg_init for a path group and didn't maintain any state about each path in hardware handler code. But in SCSI dh, such state is now maintained, as we want to fail I/O early on a path if it is not the active path. All the hardware handlers have a state now and set to active or some form of inactive. They have prep_fn() which uses this state to fail the I/O without it ever being sent to the device. So in effect when dm-multipath calls scsi_dh_activate(), activate is sent to only one path and the "state" of that path is changed appropriately to "active" while other paths in the same path group are never changed as they never got an "activate". In order make sure all the paths in a path group gets their state set properly when a pg_init happens, we need to call scsi_dh_activate() on all paths in a path group. Doing this at the hardware handler layer is not a good option as we want the multipath layer to define the relationship between path and path groups and not the hardware handler. Attached patch sends an "activate" on each path in a path group when a path group is switched. It also sends an activate when a path is reinstated. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:12 +01:00
Hannes Reinecke	a0cf7ea954	dm mpath: change attached scsi_dh When specifying a different hardware handler via multipath features we should be able to override the built-in defaults. The problem here is the hardware table from scsi_dh is compiled in and cannot be changed from userland. The multipath.conf OTOH is purely user-defined and, what's more, the user might have a valid reason for modifying it. (EG EMC Clariion can well be run in PNR mode even though ALUA is active, or the user might want to try ALUA on any as-of-yet unknown devices) So _not_ allowing multipath to override the device handler setting will just add to the confusion and makes error tracking even more difficult. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:11 +01:00
Milan Broz	4d89b7b4e4	dm: sysfs skip output when device is being destroyed Do not process sysfs attributes when device is being destroyed. Otherwise code can cause BUG_ON(test_bit(DMF_FREEING, &md->flags)); in dm_put() call. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:11 +01:00
Mikulas Patocka	e094f4f15f	dm mpath: validate hw_handler argument count Fix arg count parsing error in hw handlers. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:10 +01:00
Mikulas Patocka	0e0497c0c0	dm mpath: validate table argument count The parser reads the argument count as a number but doesn't check that sufficient arguments are supplied. This command triggers the bug: dmsetup create mpath --table "0 `blockdev --getsize /dev/mapper/cr0` multipath 0 0 2 1 round-robin 1000 0 1 1 /dev/mapper/cr0 round-robin 0 1 1 /dev/mapper/cr1 1000" kernel BUG at drivers/md/dm-mpath.c:530! Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:08:02 +01:00
Linus Torvalds	31583d6acf	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: Fix kernel-doc parameter name typo in blk-settings.c: block: rename CONFIG_LBD to CONFIG_LBDAF block: Fix bounce_pfn setting hd: stop defining MAJOR_NR	2009-06-19 17:43:04 -07:00
Bartlomiej Zolnierkiewicz	90c699a9ee	block: rename CONFIG_LBD to CONFIG_LBDAF Follow-up to "block: enable by default support for large devices and files on 32-bit archs". Rename CONFIG_LBD to CONFIG_LBDAF to: - allow update of existing [def]configs for "default y" change - reflect that it is used also for large files support nowadays Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-19 08:08:50 +02:00
Linus Torvalds	9729a6eb58	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: (39 commits) md/raid5: correctly update sync_completed when we reach max_resync md/raid5: add missing call to schedule() after prepare_to_wait() md/linear: use call_rcu to free obsolete 'conf' structures. md linear: Protecting mddev with rcu locks to avoid races md: Move check for bitmap presence to personality code. md: remove chunksize rounding from common code. md: raid0/linear: ensure device sizes are rounded to chunk size. md: move assignment of ->utime so that it never gets skipped. md: Push down reconstruction log message to personality code. md: merge reconfig and check_reshape methods. md: remove unnecessary arguments from ->reconfig method. md: raid5: check stripe cache is large enough in start_reshape md: raid0: chunk_sectors cleanups. md: fix some comments. md/raid5: Use is_power_of_2() in raid5_reconfig()/raid6_reconfig(). md: convert conf->chunk_size and conf->prev_chunk to sectors. md: Convert mddev->new_chunk to sectors. md: Make mddev->chunk_size sector-based. md: raid0 :Enables chunk size other than powers of 2. md: prepare for non-power-of-two chunk sizes ...	2009-06-18 13:11:50 -07:00
NeilBrown	48606a9f2f	md/raid5: correctly update sync_completed when we reach max_resync At the end of reshape_request we update cyrr_resync_completed if we are about to pause due to reaching resync_max. However we update it to the wrong value. We need to add the "reshape_sectors" that have just been reshaped. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 09:14:12 +10:00
Dan Williams	7a3ab90894	md/raid5: add missing call to schedule() after prepare_to_wait() In the unlikely event that reshape progresses past the current request while it is waiting for a stripe we need to schedule() before retrying for 2 reasons: 1/ Prevent list corruption from duplicated list_add() calls without intervening list_del(). 2/ Give the reshape code a chance to make some progress to resolve the conflict. Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:50:18 +10:00
NeilBrown	495d357301	md/linear: use call_rcu to free obsolete 'conf' structures. Current, when we update the 'conf' structure, when adding a drive to a linear array, we keep the old version around until the array is finally stopped, as it is not safe to free it immediately. Now that we have rcu protection on all accesses to 'conf', we can use call_rcu to free it more promptly. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:49:42 +10:00
SandeepKsinha	af11c397fd	md linear: Protecting mddev with rcu locks to avoid races Due to the lack of memory ordering guarantees, we may have races around mddev->conf. In particular, the correct contents of the structure we get from dereferencing ->private might not be visible to this CPU yet, and they might not be correct w.r.t mddev->raid_disks. This patch addresses the problem using rcu protection to avoid such race conditions. Signed-off-by: SandeepKsinha <sandeepksinha@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:49:35 +10:00
Andre Noll	0894cc3066	md: Move check for bitmap presence to personality code. If the superblock of a component device indicates the presence of a bitmap but the corresponding raid personality does not support bitmaps (raid0, linear, multipath, faulty), then something is seriously wrong and we'd better refuse to run such an array. Currently, this check is performed while the superblocks are examined, i.e. before entering personality code. Therefore the generic md layer must know which raid levels support bitmaps and which do not. This patch avoids this layer violation without adding identical code to various personalities. This is accomplished by introducing a new public function to md.c, md_check_no_bitmap(), which replaces the hard-coded checks in the superblock loading functions. A call to md_check_no_bitmap() is added to the ->run method of each personality which does not support bitmaps and assembly is aborted if at least one component device contains a bitmap. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:49:23 +10:00
NeilBrown	8190e754e0	md: remove chunksize rounding from common code. It is easiest to round sizes to multiples of chunk size in the personality code for those personalities which care. Those personalities now do the rounding, so we can remove that function from common code. Also remove the upper bound on the size of a chunk, and the lower bound on the size of a device (1 chunk), neither of which really buy us anything. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:48:58 +10:00
NeilBrown	13f2682b72	md: raid0/linear: ensure device sizes are rounded to chunk size. This is currently ensured by common code, but it is more reliable to ensure it where it is needed in personality code. All the other personalities that care already round the size to the chunk_size. raid0 and linear are the only hold-outs. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:48:55 +10:00
NeilBrown	1b57f13223	md: move assignment of ->utime so that it never gets skipped. Currently the assignment to utime gets skipped for 'external' metadata. So move it to the top of the function so that it always gets effected. This is of largely cosmetic interest. Nothing actually depends on ->utime being right for external arrays. "mdadm --monitor" does use it for 0.90 and 1.x arrays, but with mdadm-3.0, this is not important for external metadata. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:48:19 +10:00
Andre Noll	8c6ac868b1	md: Push down reconstruction log message to personality code. Currently, the md layer checks in analyze_sbs() if the raid level supports reconstruction (mddev->level >= 1) and if reconstruction is in progress (mddev->recovery_cp != MaxSector). Move that printk into the personality code of those raid levels that care (levels 1, 4, 5, 6, 10). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:48:06 +10:00
NeilBrown	50ac168a6e	md: merge reconfig and check_reshape methods. The difference between these two methods is artificial. Both check that a pending reshape is valid, and perform any aspect of it that can be done immediately. 'reconfig' handles chunk size and layout. 'check_reshape' handles raid_disks. So make them just one method. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:47:55 +10:00
NeilBrown	597a711b69	md: remove unnecessary arguments from ->reconfig method. Passing the new layout and chunksize as args is not necessary as the mddev has fields for new_check and new_layout. This is preparation for combining the check_reshape and reconfig methods Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:47:42 +10:00
NeilBrown	01ee22b496	md: raid5: check stripe cache is large enough in start_reshape In reshape cases that do not change the number of devices, start_reshape is called without first calling check_reshape. Currently, the check that the stripe_cache is large enough is only done in check_reshape. It should be in start_reshape too. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:47:20 +10:00
NeilBrown	d6e412eaa5	md: raid0: chunk_sectors cleanups. following the conversion to chunk_sectors, there is room for cleaning up a little. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:47:00 +10:00
Andre Noll	cdc2ae6d6a	md: fix some comments. 1/ Raid5 has learned to take over also raid4 and raid6 arrays. 2/ new_chunk in mdp_superblock_1 is in sectors, not bytes. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:46:47 +10:00
Andre Noll	0ba459d262	md/raid5: Use is_power_of_2() in raid5_reconfig()/raid6_reconfig(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:46:10 +10:00
Andre Noll	09c9e5fa1b	md: convert conf->chunk_size and conf->prev_chunk to sectors. This kills some more shifts. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:45:55 +10:00
Andre Noll	664e7c413f	md: Convert mddev->new_chunk to sectors. A straight-forward conversion which gets rid of some multiplications/divisions/shifts. The patch also introduces a couple of new ones, most of which are due to conf->chunk_size still being represented in bytes. This will be cleaned up in subsequent patches. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:45:27 +10:00
Andre Noll	9d8f036362	md: Make mddev->chunk_size sector-based. This patch renames the chunk_size field to chunk_sectors with the implied change of semantics. Since is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9) = is_power_of_2(chunk_sectors) these bits don't need an adjustment for the shift. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:45:01 +10:00
Linus Torvalds	6fd03301d7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (64 commits) debugfs: use specified mode to possibly mark files read/write only debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem. xen: remove driver_data direct access of struct device from more drivers usb: gadget: at91_udc: remove driver_data direct access of struct device uml: remove driver_data direct access of struct device block/ps3: remove driver_data direct access of struct device s390: remove driver_data direct access of struct device parport: remove driver_data direct access of struct device parisc: remove driver_data direct access of struct device of_serial: remove driver_data direct access of struct device mips: remove driver_data direct access of struct device ipmi: remove driver_data direct access of struct device infiniband: ehca: remove driver_data direct access of struct device ibmvscsi: gadget: at91_udc: remove driver_data direct access of struct device hvcs: remove driver_data direct access of struct device xen block: remove driver_data direct access of struct device thermal: remove driver_data direct access of struct device scsi: remove driver_data direct access of struct device pcmcia: remove driver_data direct access of struct device PCIE: remove driver_data direct access of struct device ... Manually fix up trivial conflicts due to different direct driver_data direct access fixups in drivers/block/{ps3disk.c,ps3vram.c}	2009-06-16 12:57:37 -07:00
Li Zefan	e212d6f250	block: remove some includings of blktrace_api.h When porting blktrace to tracepoints, we changed to trace/block.h for trace prober declarations. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-16 11:19:36 +02:00
raz ben yehuda	fbb704efb7	md: raid0 :Enables chunk size other than powers of 2. Maintain two flows, one for pow2 chunk sizes (which uses masks and shift), and a flow for the general case (which uses sector_div). This is for the sake of performance. - introduce map_sector and is_io_in_chunk_boundary to encapsulate those two flows better for raid0_make_request - fix blk_mergeable to support the two flows. Signed-off-by: raziebe@gmail.com Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 17:02:05 +10:00
raz ben yehuda	2ac06c3332	md: prepare for non-power-of-two chunk sizes Remove chunk size check from md as this is now performed in the run function in each personality. Replace chunk size power 2 code calculations by a regular division. Signed-off-by: raziebe@gmail.com Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 17:01:42 +10:00
raz ben yehuda	740da44918	md: raid5: chunk size check in setup_conf have raid5 check chunk size in run/reshape method instead of in md Signed-off-by: raziebe@gmail.com Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 17:01:36 +10:00
raz ben yehuda	964e7913b0	md: raid10: chunk size check in run have raid10 check chunk size in run method instead of in md Signed-off-by: raziebe@gmail.com Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 17:01:22 +10:00
raz ben yehuda	92e59b6ba2	md: raid0: chunk size check in raid0_run have raid0 check chunk size in run method instead of in md. This is part of a series moving the checks from common code to the personalities where they belong. hardsect is short and chunksize is an int, so it is safe to use %. Signed-off-by: raziebe@gmail.com Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 17:00:57 +10:00
raz ben yehuda	46994191ae	md: have raid0 report its formation Report to the user what are the raid zones Signed-off-by: raziebe@gmail.com Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 17:00:54 +10:00
raz ben yehuda	1b9614291e	md: have raid0 compile with MD_DEBUG on Because of the removal of the device list from the strips raid0 did not compile with MD_DEBUG flag on Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 16:57:40 +10:00
Sandeep K Sinha	aece3d1f40	md: Binary search in linear raid Replace the linear search with binary search in which_dev. Signed-off-by: Sandeep K Sinha <sandeepksinha@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 16:57:08 +10:00
Sandeep K Sinha	4db7cdc859	md: Removing num_sector and replacing start_sector with end_sector Remove num_sectors from dev_info and replace start_sector with end_sector. This makes a lot of comparisons much simpler. Signed-off-by: Sandeep K Sinha <sandeepksinha@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 16:56:13 +10:00
Sandeep K Sinha	45d4582f21	md: Removal of hash table in linear raid Get rid of sector_div and hash table for linear raid and replace with a linear search in which_dev. The hash table adds a lot of complexity for little if any gain. Ultimately a binary search will be used which will have smaller cache foot print, a similar number of memory access, and no divisions. Signed-off-by: Sandeep K Sinha <sandeepksinha@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 16:55:26 +10:00
NeilBrown	070ec55d07	md: remove mddev_to_conf "helper" macro Having a macro just to cast a void* isn't really helpful. I would must rather see that we are simply de-referencing ->private, than have to know what the macro does. So open code the macro everywhere and remove the pointless cast. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 16:54:21 +10:00
NeilBrown	a6b3deafe0	md: raid0: remove setting of segment boundary. This setting doesn't seem to make sense (half the chunk size??) and shouldn't be needed. The segment boundary exported by raid0 should simply be the minimum of the segment boundary of all component devices. And we already get that right. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 16:54:07 +10:00
NeilBrown	b414579f45	md: raid0: remove ->dev pointer from strip_zone structure If we treat conf->devlist more like a 2 dimensional array, we can get the devlist for a particular zone simply by indexing that array, so we don't need to store the pointers to subarrays in strip_zone. This makes strip_zone smaller and so (hopefully) searches faster. Signed-of-by: NeilBrown <neilb@suse.de>	2009-06-16 16:50:52 +10:00
NeilBrown	49f357a22b	md: raid0: remove ->sectors from the strip_zone structure. storing ->sectors is redundant as is can be computed from the difference z->zone_end - (z-1)->zone_end The one place where it is used, it is just as efficient to use a zone_end value instead. And removing it makes strip_zone smaller, so they array of these that is searched on every request has a better chance to say in cache. So discard the field and get the value from elsewhere. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 16:50:35 +10:00
Andre Noll	fb5ab4b5d6	md: raid0: Fix a memory leak when stopping a raid0 array. raid0_stop() removes all references to the raid0 configuration but misses to free the ->devlist buffer. This patch closes this leak, removes a pointless initialization and fixes a coding style issue in raid0_stop(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 16:48:19 +10:00
Andre Noll	ed7b00380d	md: raid0: Allocate all buffers for the raid0 configuration in one function. Currently the raid0 configuration is allocated in raid0_run() while the buffers for the strip_zone and the dev_list arrays are allocated in create_strip_zones(). On errors, all three buffers are freed in raid0_run(). It's easier and more readable to do the allocation and cleanup within a single function. So move that code into create_strip_zones(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 16:47:36 +10:00
Andre Noll	5568a6035d	md: raid0: Make raid0_run() return a proper error code. Currently raid0_run() always returns -ENOMEM on errors. This is incorrect as running the array might fail for other reasons, for example because not all component devices were available. This patch changes create_strip_zones() so that it returns a proper error code (either -ENOMEM or -EINVAL) rather than 1 on errors and makes raid0_run(), its single caller, return that value instead of -ENOMEM. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 16:47:21 +10:00
Andre Noll	8f79cfcdb6	md: raid0: Remove hash spacing and sector shift. The "sector_shift" and "spacing" fields of struct raid0_private_data were only used for the hash table lookups. So the removal of the hash table allows get rid of these fields as well which simplifies create_strip_zones() and raid0_run() quite a bit. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 16:47:10 +10:00
Andre Noll	09770e0b6e	md: raid0: Remove hash table. The raid0 hash table has become unused due to the changes in the previous patch. This patch removes the hash table allocation and setup code and kills the hash_table field of struct raid0_private_data. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 16:46:48 +10:00
NeilBrown	d27a43abd7	md/raid0: two cleanups in create_stripe_zones. 1/ remove current_start. The same value is available in zone->dev_start and storing it separately doesn't gain anything. 2/ rename curr_zone_start to curr_zone_end as we are now more focused on the 'end' of each zone. We end up storing the same number though - the old name was a little confusing (and what does 'current' mean in this context anyway). Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 16:46:46 +10:00
Andre Noll	dc58266385	md: raid0: Replace hash table lookup by looping over all strip_zones. The number of strip_zones of a raid0 array is bounded by the number of drives in the array and is in fact much smaller for typical setups. For example, any raid0 array containing identical disks will have only a single strip_zone. Therefore, the hash tables which are used for quickly finding the strip_zone that holds a particular sector are of questionable value and add quite a bit of unnecessary complexity. This patch replaces the hash table lookup by equivalent code which simply loops over all strip zones to find the zone that holds the given sector. In order to make this loop as fast as possible, the zone->start field of struct strip_zone has been renamed to zone_end, and it now stores the beginning of the next zone in sectors. This allows to save one addition in the loop. Subsequent cleanup patches will remove the hash table structure. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 16:18:43 +10:00
Kay Sievers	d405640539	Driver Core: misc: add nodename support for misc devices. This adds support for misc devices to report their requested nodename to userspace. It also updates a number of misc drivers to provide the needed subdirectory and device name to be used for them. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-06-15 21:30:25 -07:00
Linus Torvalds	c9059598ea	Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits) block: add request clone interface (v2) floppy: fix hibernation ramdisk: remove long-deprecated "ramdisk=" boot-time parameter fs/bio.c: add missing __user annotation block: prevent possible io_context->refcount overflow Add serial number support for virtio_blk, V4a block: Add missing bounce_pfn stacking and fix comments Revert "block: Fix bounce limit setting in DM" cciss: decode unit attention in SCSI error handling code cciss: Remove no longer needed sendcmd reject processing code cciss: change SCSI error handling routines to work with interrupts enabled. cciss: separate error processing and command retrying code in sendcmd_withirq_core() cciss: factor out fix target status processing code from sendcmd functions cciss: simplify interface of sendcmd() and sendcmd_withirq() cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code cciss: Use schedule_timeout_uninterruptible in SCSI error handling code block: needs to set the residual length of a bidi request Revert "block: implement blkdev_readpages" block: Fix bounce limit setting in DM Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt ... Manually fix conflicts with tracing updates in: block/blk-sysfs.c drivers/ide/ide-atapi.c drivers/ide/ide-cd.c drivers/ide/ide-floppy.c drivers/ide/ide-tape.c include/trace/events/block.h kernel/trace/blktrace.c	2009-06-11 11:10:35 -07:00
Linus Torvalds	8623661180	Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits) Revert "x86, bts: reenable ptrace branch trace support" tracing: do not translate event helper macros in print format ftrace/documentation: fix typo in function grapher name tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK tracing: add protection around module events unload tracing: add trace_seq_vprint interface tracing: fix the block trace points print size tracing/events: convert block trace points to TRACE_EVENT() ring-buffer: fix ret in rb_add_time_stamp ring-buffer: pass in lockdep class key for reader_lock tracing: add annotation to what type of stack trace is recorded tracing: fix multiple use of __print_flags and __print_symbolic tracing/events: fix output format of user stack tracing/events: fix output format of kernel stack tracing/trace_stack: fix the number of entries in the header ring-buffer: discard timestamps that are at the start of the buffer ring-buffer: try to discard unneeded timestamps ring-buffer: fix bug in ring_buffer_discard_commit ftrace: do not profile functions when disabled tracing: make trace pipe recognize latency format flag ...	2009-06-10 19:53:40 -07:00
Li Zefan	55782138e4	tracing/events: convert block trace points to TRACE_EVENT() TRACE_EVENT is a more generic way to define tracepoints. Doing so adds these new capabilities to this tracepoint: - zero-copy and per-cpu splice() tracing - binary tracing without printf overhead - structured logging records exposed under /debug/tracing/events - trace events embedded in function tracer output and other plugins - user-defined, per tracepoint filter expressions ... Cons: - no dev_t info for the output of plug, unplug_timer and unplug_io events. no dev_t info for getrq and sleeprq events if bio == NULL. no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL. This is mainly because we can't get the deivce from a request queue. But this may change in the future. - A packet command is converted to a string in TP_assign, not TP_print. While blktrace do the convertion just before output. Since pc requests should be rather rare, this is not a big issue. - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT has a unique format, which means we have some unused data in a trace entry. The overhead is minimized by using __dynamic_array() instead of __array(). I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing: dd dd + ioctl blktrace dd + TRACE_EVENT (splice) 1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s 2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s 3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s So the overhead of tracing is very small, and no regression when using those trace events vs blktrace. And the binary output of TRACE_EVENT is much smaller than blktrace: # ls -l -h -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0 -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1 -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out Following are some comparisons between TRACE_EVENT and blktrace: plug: kjournald-480 [000] 303.084981: block_plug: [kjournald] kjournald-480 [000] 303.084981: 8,0 P N [kjournald] unplug_io: kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1 kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1 remap: kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384 kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384 bio_backmerge: kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald] kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald] getrq: kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald] kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald] bash-2066 [001] 1072.953770: 8,0 G N [bash] bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash] rq_complete: konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0] konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0] ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0] ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0] rq_insert: kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald] kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald] Changelog from v2 -> v3: - use the newly introduced __dynamic_array(). Changelog from v1 -> v2: - use __string() instead of __array() to minimize the memory required to store hex dump of rq->cmd(). - support large pc requests. - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT. - some cleanups. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-09 12:34:23 -04:00
NeilBrown	0e6e0271a2	md/raid5: fix bug in reshape code when chunk_size decreases. Now that we support changing the chunksize, we calculate "reshape_sectors" to be the max of number of sectors in old and new chunk size. However there is one please where we still use 'chunksize' rather than 'reshape_sectors'. This causes a reshape that reduces the size of chunks to freeze. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-09 16:32:22 +10:00
NeilBrown	a8c906ca3f	md/raid5 - avoid deadlocks in get_active_stripe during reshape md has functionality to 'quiesce' and array so that all pending IO completed and no new IO starts. This is used to achieve a stable state before making internal changes. Currently this quiescing applies equally to normal IO, resync IO, and reshape IO. However there is a problem with applying it to reshape IO. Reshape can have multiple 'stripe_heads' that must be active together. If the quiesce come between allocating the first and the last of such a collection, then we deadlock, as the last will not be allocated until the quiesce is lifted, the quiesce will not be lifted until the first (which has been allocated) gets used, and that first cannot be used until the last is allocated. It is not necessary to inhibit reshape IO when a quiesce is requested. Those places in the code that require a full quiesce will ensure the reshape thread is not running at all. So allow reshape requests to get access to new stripe_heads without being blocked by a 'quiesce'. This only affects in-place reshapes (i.e. where the array does not grow or shrink) and these are only newly supported. So this patch is not needed in earlier kernels. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-09 14:39:59 +10:00
NeilBrown	f001a70cdc	md/raid5: use conf->raid_disks in preference to mddev->raid_disk mddev->raid_disks can be changed and any time by a request from user-space. It is a suggestion as to what number of raid_disks is desired. conf->raid_disks can only be changed by the raid5 module with suitable locks in place. It is a statement as to the current number of raid_disks. There are two places where the latter should be used, but the former is used. This can lead to a crash when reshaping an array. This patch changes to mddev-> to conf-> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-09 14:30:31 +10:00
Jens Axboe	9df1bb9b51	Revert "block: Fix bounce limit setting in DM" This reverts commit `a05c0205ba`. DM doesn't need to access the bounce_pfn directly. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-09 06:22:57 +02:00
Martin K. Petersen	a05c0205ba	block: Fix bounce limit setting in DM blk_queue_bounce_limit() is more than a wrapper about the request queue limits.bounce_pfn variable. Introduce blk_queue_bounce_pfn() which can be called by stacking drivers that wish to set the bounce limit explicitly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-03 09:33:18 +02:00
NeilBrown	ed37d83e6a	md: raid5: change incorrect usage of 'min' macro to 'min_t' A recent patch to raid5.c use min on an int and a sector_t. This isn't allowed. So change it to min_t(sector_t,x,y). Signed-off-by: NeilBrown <neilb@suse.de>	2009-05-27 21:39:05 +10:00
NeilBrown	b492b852cd	md: don't use locked_ioctl. md has no need for the BKL - it does its own locking. So md_ioctl doesn't need to be a locked_ioctl. Signed-off-by: NeilBrown <neilb@suse.de>	2009-05-26 12:57:36 +10:00
NeilBrown	7a91ee1f62	md: don't update curr_resync_completed without also updating reshape_position. In order for the metadata to always be consistent, we mustn't updated curr_resync_completed without also updating reshape_position. The reshape code updates both at the same time. However since commit `97e4f42d62` the common md_do_sync will sometimes update curr_resync_completed but is not in a position to update reshape_position. So if MD_RECOVERY_RESHAPE is set (indicating that a reshape is happening, so reshape_position might change), don't update curr_resync_completed in md_do_sync, leave it to the per-personality reshape code. Signed-off-by: NeilBrown <neilb@suse.de>	2009-05-26 12:57:21 +10:00
NeilBrown	848b318236	md: raid5: avoid sector values going negative when testing reshape progress. As sector_t in unsigned, we cannot afford to let 'safepos' etc go negative. So replace a -= b; by a -= min(b,a); Signed-off-by: NeilBrown <neilb@suse.de>	2009-05-26 12:41:08 +10:00
NeilBrown	b6a9ce688f	md: export 'frozen' resync state through sysfs The md resync engine has a 'frozen' state which ensures that no resync/recovery. This is used to avoid races. Export this state through the 'sync_action' sysfs attribute so that user-space can benefit and also avoid some races. Signed-off-by: NeilBrown <neilb@suse.de>	2009-05-26 09:41:17 +10:00
NeilBrown	be51269103	md: bitmap: improve bitmap maintenance code. The code for checking which bits in the bitmap can be cleared has 2 problems: 1/ it repeatedly takes and drops a spinlock, where it would make more sense to just hold on to it most of the time. 2/ it doesn't make use of some opportunities to skip large sections of the bitmap This patch fixes those. It will only affect CPU consumption, not correctness. Signed-off-by: NeilBrown <neilb@suse.de>	2009-05-26 09:41:17 +10:00
NeilBrown	2b69c83924	md: improve errno return when setting array_size Instead of always returns EINVAL if anything goes wrong when setting the array size, add the option of E2BIG if the size requested is too large. This makes it easier for user-space to be sure what went wrong. Signed-off-by: NeilBrown <neilb@suse.de>	2009-05-26 09:41:17 +10:00
NeilBrown	62e1e389f8	md: always update level / chunk_size / layout when writing v1.x metadata. We previously didn't update these fields when writing the metadata because they could never change. They can now, so we better write them. v0.90 metadata always updated these fields. Signed-off-by: NeilBrown <neilb@suse.de>	2009-05-26 09:40:59 +10:00
Martin K. Petersen	ae03bf639a	block: Use accessor functions for queue limits Convert all external users of queue limits to using wrapper functions instead of poking the request queue variables directly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 23:22:54 +02:00
Martin K. Petersen	e1defc4ff0	block: Do away with the notion of hardsect_size Until now we have had a 1:1 mapping between storage device physical block size and the logical block sized used when addressing the device. With SATA 4KB drives coming out that will no longer be the case. The sector size will be 4KB but the logical block size will remain 512-bytes. Hence we need to distinguish between the physical block size and the logical ditto. This patch renames hardsect_size to logical_block_size. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 23:22:54 +02:00
Ingo Molnar	1079cac0f4	Merge commit 'v2.6.30-rc6' into tracing/core Merge reason: we were on an -rc4 base, sync up to -rc6 Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 10:15:35 +02:00
Ingo Molnar	44347d947f	Merge branch 'linus' into tracing/core Merge reason: tracing/core was on a .30-rc1 base and was missing out on on a handful of tracing fixes present in .30-rc5-almost. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-07 11:17:34 +02:00
NeilBrown	c4647292fd	md: remove rd%d links immediately after stopping an array. md maintains link in sys/mdXX/md/ to identify which device has which role in the array. e.g. rd2 -> dev-sda indicates that the device with role '2' in the array is sda. These links are only present when the array is active. They are created immediately after ->run is called, and so should be removed immediately after ->stop is called. However they are currently removed a little bit later, and it is possible for ->run to be called again, thus adding these links, before they are removed. So move the removal earlier so they are consistently only present when the array is active. Signed-off-by: NeilBrown <neilb@suse.de>	2009-05-07 12:51:06 +10:00
NeilBrown	5bf2959754	md: remove ability to explicit set an inactive array to 'clean'. Being able to write 'clean' to an 'array_state' of an inactive array to activate it in 'clean' mode is both unnecessary and inconvenient. It is unnecessary because the same can be achieved by writing 'active'. This activates and array, but it still remains 'clean' until the first write. It is inconvenient because writing 'clean' is more often used to cause an 'active' array to revert to 'clean' mode (thus blocking any writes until a 'write-pending' is promoted to 'active'). Allowing 'clean' to both activate an array and mark an active array as clean can lead to races: One program writes 'clean' to mark the active array as clean at the same time as another program writes 'inactive' to deactivate (stop) and active array. Depending on which writes first, the array could be deactivated and immediately reactivated which isn't what was desired. So just disable the use of 'clean' to activate an array. This avoids a race that can be triggered with mdadm-3.0 and external metadata, so it suitable for -stable. Reported-by: Rafal Marszewski <rafal.marszewski@intel.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: <stable@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-05-07 12:50:57 +10:00
Jan Engelhardt	110518bccf	md: constify VFTs Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: NeilBrown <neilb@suse.de>	2009-05-07 12:49:37 +10:00
NeilBrown	dd71cf6b27	md: tidy up status_resync to handle large arrays. Two problems in status_resync. 1/ It still used Kilobytes as the basic block unit, while most code now uses sectors uniformly. 2/ It doesn't allow for the possibility that max_sectors exceeds the range of "unsigned long". So - change "max_blocks" to "max_sectors", and store sector numbers in there and in 'resync' - Make 'rt' a 'sector_t' so it can temporarily hold the number of remaining sectors. - use sector_div rather than normal division. - change the magic '100' used to preserve precision to '32'. + making it a power of 2 makes division easier + it doesn't need to be as large as it was chosen when we averaged speed over the entire run. Now we average speed over the last 30 seconds or so. Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE> Signed-off-by: NeilBrown <neilb@suse.de>	2009-05-07 12:49:35 +10:00
NeilBrown	db305e507d	md: fix some (more) errors with bitmaps on devices larger than 2TB. If a write intent bitmap covers more than 2TB, we sometimes work with values beyond 32bit, so these need to be sector_t. This patches add the required casts to some unsigned longs that are being shifted up. This will affect any raid10 larger than 2TB, or any raid1/4/5/6 with member devices that are larger than 2TB. Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE> Cc: stable@kernel.org	2009-05-07 12:49:06 +10:00
NeilBrown	1805556912	md/raid10: don't clear bitmap during recovery if array will still be degraded. If we have a raid10 with multiple missing devices, and we recover just one of these to a spare, then we risk (depending on the bitmap and array chunk size) clearing bits of the bitmap for which recovery isn't complete (because a device is still missing). This can lead to a subsequent "re-add" being recovered without any IO happening, which would result in loss of data. This patch takes the safe approach of not clearing bitmap bits if the array will still be degraded. This patch is suitable for all active -stable kernels. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-05-07 12:48:10 +10:00
NeilBrown	b74fd2826c	md: fix loading of out-of-date bitmap. When md is loading a bitmap which it knows is out of date, it fills each page with 1s and writes it back out again. However the write_page call makes used of bitmap->file_pages and bitmap->last_page_size which haven't been set correctly yet. So this can sometimes fail. Move the setting of file_pages and last_page_size to before the call to write_page. This bug can cause the assembly on an array to fail, thus making the data inaccessible. Hence I think it is a suitable candidate for -stable. Cc: stable@kernel.org Reported-by: Vojtech Pavlik <vojtech@suse.cz> Signed-off-by: NeilBrown <neilb@suse.de>	2009-05-07 12:47:19 +10:00
Alan D. Brunelle	22a7c31a96	blktrace: from-sector redundant in trace_block_remap Remove redundant from-sector parameter: it's /always/ the bio's sector passed in. [ Impact: cleanup ] Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <49FF517C.7000503@hp.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-06 14:13:01 +02:00
Linus Torvalds	2edbdd1266	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: support bitmaps on RAID10 arrays larger then 2 terabytes md: update sync_completed and reshape_position even more often. md: improve usefulness and accuracy of sysfs file md/sync_completed. md: allow setting newly added device to 'in_sync' via sysfs. md: tiny md.h cleanups	2009-04-20 08:37:37 -07:00
NeilBrown	1f59390339	md: support bitmaps on RAID10 arrays larger then 2 terabytes .. and other arrays with components larger than 2 terabytes. We use a "long" rather than a "sector_t" in part of the bitmap size calculations, which is sad. Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE> Signed-off-by: NeilBrown <neilb@suse.de>	2009-04-20 11:50:24 +10:00
NeilBrown	c03f6a1969	md: update sync_completed and reshape_position even more often. There are circumstances when a user-space process might need to "oversee" a resync/reshape process. For example when doing an in-place reshape of a raid5, it is prudent to take a backup of each section before reshaping it as this is the only way to provide safety against an unplanned shutdown (i.e. crash/power failure). The sync_max sysfs value can be used to stop the resync from advancing beyond a particular point. So user-space can: suspend IO to the first section and back it up set 'sync_max' to the end of the section wait for 'sync_completed' to reach that point resume IO on the first section and move on to the next section. However this process requires the kernel and user-space to run in lock-step which could introduce unnecessary delays. It would be better if a 'double buffered' approach could be used with userspace and kernel space working on different sections with the 'next' section always ready when the 'current' section is finished. One problem with implementing this is that sync_completed is only guaranteed to be updated when the sync process reaches sync_max. (it is updated on a time basis at other times, but it is hard to rely on that). This defeats some of the double buffering. With this patch, sync_completed (and reshape_position) get updated as the current position approaches sync_max, so there is room for userspace to advance sync_max early without losing updates. To be precise, sync_completed is updated when the current sync position reaches half way between the current value of sync_completed and the value of sync_max. This will usually be a good time for user space to update sync_max. If sync_max does not get updated, the updates to sync_completed (together with associated metadata updates) will occur at an exponentially increasing frequency which will get unreasonably fast (one update every page) immediately before the process hits sync_max and stops. So the update rate will be unreasonably fast only for an insignificant period of time. Signed-off-by: NeilBrown <neilb@suse.de>	2009-04-17 11:06:30 +10:00
Christoph Hellwig	8f3d8ba20e	block: move bio list helpers into bio.h It's used by DM and MD and generally useful, so move the bio list helpers into bio.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 08:28:09 +02:00
NeilBrown	acb180b0e3	md: improve usefulness and accuracy of sysfs file md/sync_completed. The sync_completed file reports how much of a resync (or recovery or reshape) has been completed. However due to the possibility of out-of-order completion of writes, it is not certain to be accurate. We have an internal value - mddev->curr_resync_completed - which is an accurate value (though it might not always be quite so uptodate). So: - make curr_resync_completed be uptodate a little more often, particularly when raid5 reshape updates status in the metadata - report curr_resync_completed in the sysfs file - allow poll/select to report all updates to md/sync_completed. This makes sync_completed completed usable by any external metadata handler that wants to record this status information in its metadata. Signed-off-by: NeilBrown <neilb@suse.de>	2009-04-14 16:28:34 +10:00

1 2 3 4 5 ...

1354 Commits