When we are accessing an mddev via sysfs we know that the
mddev cannot disappear because it has an embedded kobj which
is refcounted by sysfs.
And we also take the mddev_lock.
However this is not enough.
The final mddev_put could have been called and the
mddev_delayed_delete is waiting for sysfs to let go so it can destroy
the kobj and mddev.
In this state there are a lot of changes that should not be attempted.
To to guard against this we:
- initialise mddev->all_mddevs in on last put so the state can be
easily detected.
- in md_attr_show and md_attr_store, check ->all_mddevs under
all_mddevs_lock and mddev_get the mddev if it still appears to
be active.
This means that if we get to sysfs as the mddev is being deleted we
will get -EBUSY.
rdev_attr_store and rdev_attr_show are similar but already have
sufficient protection. They check that rdev->mddev still points to
mddev after taking mddev_lock. As this is cleared before delayed
removal which can only be requested under the mddev_lock, this
ensure the rdev and mddev are still alive.
Signed-off-by: NeilBrown <neilb@suse.de>
We like md devices to disappear when they really are not needed.
However it is not possible to tell from the current state whether it
is needed or not. We can only tell from recent history of changes.
In particular immediately after we create an md device it looks very
similar to immediately after we have finished with it.
So we always preserve a newly created md device until something
significant happens. This state is stored in 'hold_active'.
The normal case is to keep it until an ioctl happens, as that will
normally either activate it, or explicitly de-activate it. If it
doesn't then it was probably created by mistake and it is now time to
get rid of it.
We can also modify an array via sysfs (instead of via ioctl) and we
currently treat any change via sysfs like an ioctl as a sign that if
it now isn't more active, it should be destroyed.
However this is not appropriate as changes made via sysfs are more
gradual so we should look for a more definitive change.
So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when
the array_state is changed via sysfs. Other changes via sysfs
are ignored.
Signed-off-by: NeilBrown <neilb@suse.de>
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
block: don't call blk_drain_queue() if elevator is not up
blk-throttle: use queue_is_locked() instead of lockdep_is_held()
blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
blk-throttle: Free up policy node associated with deleted rule
block: warn if tag is greater than real_max_depth.
block: make gendisk hold a reference to its queue
blk-flush: move the queue kick into
blk-flush: fix invalid BUG_ON in blk_insert_flush
block: Remove the control of complete cpu from bio.
block: fix a typo in the blk-cgroup.h file
block: initialize the bounce pool if high memory may be added later
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
block: drop @tsk from attempt_plug_merge() and explain sync rules
block: make get_request[_wait]() fail if queue is dead
block: reorganize throtl_get_tg() and blk_throtl_bio()
block: reorganize queue draining
block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
block: pass around REQ_* flags instead of broken down booleans during request alloc/free
block: move blk_throtl prototypes to block/blk.h
block: fix genhd refcounting in blkio_policy_parse_and_set()
...
Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
- drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
- drivers/staging/zram/zram_drv.c
A pending cleanup will mean that module.h won't be implicitly
everywhere anymore. Make sure the modular drivers in md dir
are actually calling out for <module.h> explicitly in advance.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
If an incremental recovery was interrupted, a subsequent
re-add will result in a full recovery, even though an
incremental should be possible (seen with raid1).
Solve this problem by not updating the superblock on the
recovering device until array is not degraded any longer.
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When we add a device to an active array it can be meaningful to set
the 'insync' flag. This indicates that the device is in-sync with the
array except for locations recorded in the bitmap.
A bitmap-based recovery can then bring it completely in-sync.
Internally we move that flag to 'saved_raid_disk' but forgot to clear
In_sync like we do in add_new_disk.
So clear In_sync after moving its value to saved_raid_disk.
Reported-by: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.
Signed-off-by: NeilBrown <neilb@suse.de>
The md_notify_reboot() method includes a call to mdelay(1000),
to deal with "exotic SCSI devices" which are too volatile on
reboot. The delay is unconditional. Even if the machine does
not have any block devices, let alone MD devices, the kernel
shutdown sequence is slowed down.
1 second does not matter much with physical hardware, but with
certain virtualization use cases any wasted time in the bootup
& shutdown sequence counts for alot.
* drivers/md/md.c: md_notify_reboot() - only impose a delay if
there was at least one MD device to be stopped during reboot
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Two related problems:
1/ some error paths call "md_unregister_thread(mddev->thread)"
without subsequently clearing ->thread. A subsequent call
to mddev_unlock will try to wake the thread, and crash.
2/ Most calls to md_wakeup_thread are protected against the thread
disappeared either by:
- holding the ->mutex
- having an active request, so something else must be keeping
the array active.
However mddev_unlock calls md_wakeup_thread after dropping the
mutex and without any certainty of an active request, so the
->thread could theoretically disappear.
So we need a spinlock to provide some protections.
So change md_unregister_thread to take a pointer to the thread
pointer, and ensure that it always does the required locking, and
clears the pointer properly.
Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
cc: stable@kernel.org
There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.
Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
0.90 metadata uses an unsigned 32bit number to count the number of
kilobytes used from each device.
This should allow up to 4TB per device.
However we multiply this by 2 (to get sectors) before casting to a
larger type, so sizes above 2TB get truncated.
Also we allow rdev->sectors to be larger than 4TB, so it is possible
for the array to be resized larger than the metadata can handle.
So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in
used.
Also the sanity check at the end of super_90_load should include level
1 as it used ->size too. (RAID0 and Linear don't use ->size at all).
Reported-by: Pim Zandbergen <P.Zandbergen@macroscoop.nl>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
When the 'blocked' flag on a device is cleared while there are
unacknowledged bad blocks we must fail the device. This is needed for
backwards compatability of the interface.
The code currently uses the wrong test for "unacknowledged bad blocks
exist". Change it to the right test.
Signed-off-by: NeilBrown <neilb@suse.de>
Queue idling is used for the anticipation of immediate
sequencial I/O's but md_super_write() is a kind of one-
shot operation, coupled with md_super_wait(), so the
idling in this case will be just a waste of time.
Specifying REQ_NOIDLE prevents it. Instead of adding
the flag to submit_bio() directly, use pre-defined
macro WRITE_FLUSH_FUA.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The 'write-mostly' flag can be changed through sysfs.
With 0.90 metadata, those changes are reflected in the metadata.
For 1.x metadata, they aren't.
So fix super_1_sync to record 'write-mostly' status.
Signed-off-by: NeilBrown <neilb@suse.de>
Sometimes a device will refuse to be set faulty. e.g. RAID1 will
never let the last working device become faulty.
So check if "md_error()" did manage to set the faulty flag and fail
with EBUSY if it didn't.
Resolves-Debian-Bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=601198
Reported-by: Mike Hommey <mh+reportbug@glandium.org>
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-linus' of git://neil.brown.name/md: (75 commits)
md/raid10: handle further errors during fix_read_error better.
md/raid10: Handle read errors during recovery better.
md/raid10: simplify read error handling during recovery.
md/raid10: record bad blocks due to write errors during resync/recovery.
md/raid10: attempt to fix read errors during resync/check
md/raid10: Handle write errors by updating badblock log.
md/raid10: clear bad-block record when write succeeds.
md/raid10: avoid writing to known bad blocks on known bad drives.
md/raid10 record bad blocks as needed during recovery.
md/raid10: avoid reading known bad blocks during resync/recovery.
md/raid10 - avoid reading from known bad blocks - part 3
md/raid10: avoid reading from known bad blocks - part 2
md/raid10: avoid reading from known bad blocks - part 1
md/raid10: Split handle_read_error out from raid10d.
md/raid10: simplify/reindent some loops.
md/raid5: Clear bad blocks on successful write.
md/raid5. Don't write to known bad block on doubtful devices.
md/raid5: write errors should be recorded as bad blocks if possible.
md/raid5: use bad-block log to improve handling of uncorrectable read errors.
md/raid5: avoid reading from known bad blocks.
...
When recovering one or more devices, if all the good devices have
bad blocks we should record a bad block on the device being rebuilt.
If this fails, we need to abort the recovery.
To ensure we don't think that we aborted later than we actually did,
we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync,
in particular before mddev->curr_resync is updated.
Signed-off-by: NeilBrown <neilb@suse.de>
It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.
If it hasn't we need to wait for the acknowledgement.
We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.
This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.
It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.
When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).
We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata. This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed. Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.
Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.
The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks. So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.
Signed-off-by: NeilBrown <neilb@suse.de>
If a device has ever seen a write error, we will want to handle
known-bad-blocks differently.
So create an appropriate state flag and export it via sysfs.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Now that we have a bad block list, we should not read from those
blocks.
There are several main parts to this:
1/ read_balance needs to check for bad blocks, and return not only
the chosen device, but also how many good blocks are available
there.
2/ fix_read_error needs to avoid trying to read from bad blocks.
3/ read submission must be ready to issue multiple reads to
different devices as different bad blocks on different devices
could mean that a single large read cannot be served by any one
device, but can still be served by the array.
This requires keeping count of the number of outstanding requests
per bio. This count is stored in 'bi_phys_segments'
4/ retrying a read needs to also be ready to submit a smaller read
and queue another request for the rest.
This does not yet handle bad blocks when reading to perform resync,
recovery, or check.
'md_trim_bio' will also be used for RAID10, so put it in md.c and
export it.
Signed-off-by: NeilBrown <neilb@suse.de>
Space must have been allocated when array was created.
A feature flag is set when the badblock list is non-empty, to
ensure old kernels don't load and trust the whole device.
We only update the on-disk badblocklist when it has changed.
If the badblocklist (or other metadata) is stored on a bad block, we
don't cope very well.
If metadata has no room for bad block, flag bad-blocks as disabled,
and do the same for 0.90 metadata.
Signed-off-by: NeilBrown <neilb@suse.de>
This can show the log (providing it fits in one page) and
allows bad blocks to be 'acknowledged' meaning that they
have safely been recorded in metadata.
Clearing bad blocks is not allowed via sysfs (except for
code testing). A bad block can only be cleared when
a write to the block succeeds.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
This the first step in allowing md to track bad-blocks per-device so
that we can fail individual blocks rather than the whole device.
This patch just adds a data structure for recording bad blocks, with
routines to add, remove, search the list.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
When calling bioset_create we pass the size of the front_pad as
sizeof(mddev)
which looks suspicious as mddev is a pointer and so it looks like a
common mistake where
sizeof(*mddev)
was intended.
The size is actually correct as we want to store a pointer in the
front padding of the bios created by the bioset, so make the intent
more explicit by using
sizeof(mddev_t *)
Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This patch causes MD to generate an event (for device-mapper) when the
synchronization thread is reaped. This is expected behavior for device-mapper.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
page_address() returns void pointer, so the casts can be removed.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If we hit a read error while recovering a mirror, we want to abort the
recovery without necessarily failing the disk - as having a disk this
a read error is better than not having an array at all.
Currently this is managed with a per-array flag "recovery_disabled"
and is only implemented for RAID1. For RAID10 we will need finer
grained control as we might want to disable recovery for individual
devices separately.
So push more of the decision making into the personality.
'recovery_disabled' is now a 'cookie' which is copied when the
personality want to disable recovery and is changed when a device is
added to the array as this is used as a trigger to 'try recovery
again'.
This will allow RAID10 to get the control that it needs.
Signed-off-by: NeilBrown <neilb@suse.de>
Commit c89a8eee61 ("Allow faulty devices to be removed from a
readonly array.") added some work on ro array in the function,
but it couldn't be done since we didn't allow the ro array to be
handled from the beginning. Fix it.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
There are places where sysfs links to rdev are handled
in a same way. Add the helper functions to consolidate
them.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Moving the event counter into the dynamically allocated 'struc seq_file'
allows poll() support without the need to allocate its own tracking
structure.
All current users are switched over to use the new counter.
Requested-by: Andrew Morton akpm@linux-foundation.org
Acked-by: NeilBrown <neilb@suse.de>
Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If a device fails in a way that causes pending request to take a while
to complete, md will not be able to immediately remove it from the
array in remove_and_add_spares.
It will then incorrectly look like a spare device and md will try to
recover it even though it is failed.
This leads to a recovery process starting and instantly aborting over
and over again.
We should check if the device is faulty before considering it to be a
spare. This will avoid trying to start a recovery that cannot
proceed.
This bug was introduced in 2.6.26 so that patch is suitable for any
kernel since then.
Cc: stable@kernel.org
Reported-by: Jim Paradis <james.paradis@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Add the 'sync_super' function pointer to MD array structure (struct mddev_s)
If device-mapper (dm-raid.c) is to define its own on-disk superblock and be
able to load it, there must still be a way for MD to initiate superblock
updates. The simplest way to make this happen is to provide a pointer in
the MD array structure that can be set by device-mapper (or other module)
with a function to do this. If the function has been set, it will be used;
otherwise, the method with be looked up via 'super_types' as usual.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Move personality and sync/recovery thread starting outside md_run.
Moving the wakeup's of the personality and sync/recovery threads out of
md_run and into do_md_run and mddev_resume solves two issues:
1) It allows bitmap_load to be called before the sync_thread is run and
2) when MD personalities are used by device-mapper (dm-raid.c), the start-up
of the array is better alligned with device-mapper primatives
(CTR/resume/suspend/DTR). I/O - in this case, recovery operations - should
not happen until after a resume has taken place.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Make message a bit clearer by s/blocks/k/
I chose 'k' vs 'kiB' or 'kB' because it is what is used earlier in the
message. 'k' may be a bit ambigous, but I think it's better than "blocks"
which normally means 512, but means 1024 in MD.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Disallow resync I/O while the RAID array is suspended.
Recovery, resync, and metadata I/O should not be allowed while a device is
suspended.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Don't attempt md_integrity_register if there is no gendisk struct available.
When MD arrays are built via device-mapper, the gendisk structure is not
available via mddev.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The sysfs attribute 'resync_start' (known internally as recovery_cp),
records where a resync is up to. A value of 0 means the array is
not known to be in-sync at all. A value of MaxSector means the array
is believed to be fully in-sync.
When the size of member devices of an array (RAID1,RAID4/5/6) is
increased, the array can be increased to match. This process sets
resync_start to the old end-of-device offset so that the new part of
the array gets resynced.
However with RAID1 (and RAID6) a resync is not technically necessary
and may be undesirable. So it would be good if the implied resync
after the array is resized could be avoided.
So: change 'resync_start' so the value can be changed while the array
is active, and as a precaution only allow it to be changed while
resync/recovery is 'frozen'. Changing it once resync has started is
not going to be useful anyway.
This allows the array to be resized without a resync by:
write 'frozen' to 'sync_action'
write new size to 'component_size' (this will set resync_start)
write 'none' to 'resync_start'
write 'idle' to 'sync_action'.
Also slightly improve some tests on recovery_cp when resizing
raid1/raid5. Now that an arbitrary value could be set we should be
more careful in our tests.
Signed-off-by: NeilBrown <neilb@suse.de>
The 'add_new_disk' ioctl can be used to add a device either as a
spare, or as an active disk that just needs to be resynced based on
write-intent-bitmap information (re-add)
Currently if a re-add is requested but fails we add as a spare
instead. This makes it impossible for user-space to check for
failure.
So change to require that a re-add attempt will either succeed or
completely fail. User-space can then decide what to do next.
Signed-off-by: NeilBrown <neilb@suse.de>
There is a race when creating an md device by opening /dev/mdXX.
If two processes do this at much the same time they will follow the
call path
__blkdev_get -> get_gendisk -> kobj_lookup
The first will call
-> md_probe -> md_alloc -> add_disk -> blk_register_region
and the race happens when the second gets to kobj_lookup after
add_disk has called blk_register_region but before it returns to
md_alloc.
In the case the second will not call md_probe (as the probe is already
done) but will get a handle on the gendisk, return to __blkdev_get
which will then call md_open (via the ->open) pointer.
As mddev->gendisk hasn't been set yet, md_open will think something is
wrong an return with ERESTARTSYS.
This can loop endlessly while the first thread makes no progress
through add_disk. Nothing is blocking it, but due to scheduler
behaviour it doesn't get a turn.
So this is essentially a live-lock.
We fix this by simply moving the assignment to mddev->gendisk before
the call the add_disk() so md_open doesn't get confused.
Also move blk_queue_flush earlier because add_disk should be as late
as possible.
To make sure that md_open doesn't complete until md_alloc has done all
that is needed, we take mddev->open_mutex during the last part of
md_alloc. md_open will wait for this.
This can cause a lock-up on boot so Cc:ing for stable.
For 2.6.36 and earlier a different patch will be needed as the
'blk_queue_flush' call isn't there.
Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Tested-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Cc: stable@kernel.org
Problem:
After raid4->raid0 takeover operation, another takeover operation
(e.g raid0->raid10) results "kernel oops".
Root cause:
Variables 'degraded' in mddev structure is not cleared
on raid45->raid0 takeover.
This patch reset this variable.
Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When an md device adds a request to a queue, it can call
mddev_check_plugged.
If this succeeds then we know that the md thread will be woken up
shortly, and ->plug_cnt will be non-zero until then, so some
processing can be delayed.
If it fails, then no unplug callback is expected and the make_request
function needs to do whatever is required to make the request happen.
Signed-off-by: NeilBrown <neilb@suse.de>
md has some plugging infrastructure for RAID5 to use because the
normal plugging infrastructure required a 'request_queue', and when
called from dm, RAID5 doesn't have one of those available.
This relied on the ->unplug_fn callback which doesn't exist any more.
So remove all of that code, both in md and raid5. Subsequent patches
with restore the plugging functionality.
Signed-off-by: NeilBrown <neilb@suse.de>
We incorrectly returned -EINVAL when none of the devices in the array
had an integrity profile. This in turn prevented mdadm from starting
the metadevice. Fix this so we only return errors on mismatched
profiles and memory allocation failures.
Reported-by: Giacomo Catenazzi <cate@cateee.net>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
Documentation/iostats.txt: bit-size reference etc.
cfq-iosched: removing unnecessary think time checking
cfq-iosched: Don't clear queue stats when preempt.
blk-throttle: Reset group slice when limits are changed
blk-cgroup: Only give unaccounted_time under debug
cfq-iosched: Don't set active queue in preempt
block: fix non-atomic access to genhd inflight structures
block: attempt to merge with existing requests on plug flush
block: NULL dereference on error path in __blkdev_get()
cfq-iosched: Don't update group weights when on service tree
fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
block: Require subsystems to explicitly allocate bio_set integrity mempool
jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
fs: make fsync_buffers_list() plug
mm: make generic_writepages() use plugging
blk-cgroup: Add unaccounted time to timeslice_used.
block: fixup plugging stubs for !CONFIG_BLOCK
block: remove obsolete comments for blkdev_issue_zeroout.
blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
...
Fix up conflicts in fs/{aio.c,super.c}
MD and DM create a new bio_set for every metadevice. Each bio_set has an
integrity mempool attached regardless of whether the metadevice is
capable of passing integrity metadata. This is a waste of memory.
Instead we defer the allocation decision to MD and DM since we know at
metadevice creation time whether integrity passthrough is needed or not.
Automatic integrity mempool allocation can then be removed from
bioset_create() and we make an explicit integrity allocation for the
fs_bio_set.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Acked-by: Mike Snitzer <snizer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix build failure introduced by s/freezeable/freezable/
workqueue: add system_freezeable_wq
rds/ib: use system_wq instead of rds_ib_fmr_wq
net/9p: replace p9_poll_task with a work
net/9p: use system_wq instead of p9_mux_wq
xfs: convert to alloc_workqueue()
reiserfs: make commit_wq use the default concurrency level
ocfs2: use system_wq instead of ocfs2_quota_wq
ext4: convert to alloc_workqueue()
scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path
scsi/be2iscsi,qla2xxx: convert to alloc_workqueue()
misc/iwmc3200top: use system_wq instead of dedicated workqueues
i2o: use alloc_workqueue() instead of create_workqueue()
acpi: kacpi*_wq don't need WQ_MEM_RECLAIM
fs/aio: aio_wq isn't used in memory reclaim path
input/tps6507x-ts: use system_wq instead of dedicated workqueue
cpufreq: use system_wq instead of dedicated workqueues
wireless/ipw2x00: use system_wq instead of dedicated workqueues
arm/omap: use system_wq in mailbox
workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER
With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Revert
b821eaa572
and
f3b99be19d
When I wrote the first of these I had a wrong idea about the
lifetime of 'struct block_device'. It can disappear at any time that
the block device is not open if it falls out of the inode cache.
So relying on the 'size' recorded with it to detect when the
device size has changed and so we need to revalidate, is wrong.
Rather, we really do need the 'changed' attribute stored directly in
the mddev and set/tested as appropriate.
Without this patch, a sequence of:
mknod / open / close / unlink
(which can cause a block_device to be created and then destroyed)
will result in a rescan of the partition table and consequence removal
and addition of partitions.
Several of these in a row can get udev racing to create and unlink and
other code can get confused.
With the patch, the rescan is only performed when needed and so there
are no races.
This is suitable for any stable kernel from 2.6.35.
Reported-by: "Wojcik, Krzysztof" <krzysztof.wojcik@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
'mdp' devices are md devices with preallocated device numbers
for partitions. As such it is possible to mknod and open a partition
before opening the whole device.
this causes md_probe() to be called with a device number of a
partition, which in-turn calls mddev_find with such a number.
However mddev_find expects the number of a 'whole device' and
does the wrong thing with partition numbers.
So add code to mddev_find to remove the 'partition' part of
a device number and just work with the 'whole device'.
This patch addresses https://bugzilla.kernel.org/show_bug.cgi?id=28652
Reported-by: hkmaly@bigfoot.com
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: <stable@kernel.org>
If the desired size of an array is set (via sysfs) before the array is
active (which is the normal sequence), we currrently call set_capacity
immediately.
This means that a subsequent 'open' (as can be caused by some
udev-triggers program) will notice the new size and try to probe for
partitions. However as the array isn't quite ready yet the read will
fail. Then when the array is read, as the size doesn't change again
we don't try to re-probe.
So when setting array size via sysfs, only call set_capacity if the
array is already active.
Signed-off-by: NeilBrown <neilb@suse.de>
md_make_request was calling bio_sectors() for part_stat_add
after it was calling the make_request function. This is
bad because the make_request function can free the bio and
because the bi_size field can change around.
The fix here was suggested by Jens Axboe. It saves the
sector count before the make_request call. I hit this
with CONFIG_DEBUG_PAGEALLOC turned on while trying to break
his pretty fusionio card.
Cc: <stable@kernel.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Activating a spare in an array while resync/recovery is already
happening can lead the that spare being marked in-sync when it isn't
really.
So don't allow the 'slot' to be set (this activating the device)
while resync/recovery is happening.
Signed-off-by: NeilBrown <neilb@suse.de>
There is no need to set this to zero at this point. It will be
set to zero by remove_and_add_spares or at the start of
md_do_sync at the latest.
And setting it to zero before MD_RECOVERY_RUNNING is cleared can
make a 'zero' appear briefly in the 'sync_completed' sysfs attribute
just as resync is finishing.
So simply remove this setting to zero.
Signed-off-by: NeilBrown <neilb@suse.de>
remove_and_add_spares is called in two places where the needs really
are very different.
remove_and_add_spares should not be called on an array which is about
to be reshaped as some extra devices might have been manually added
and that would remove them. However if the array is 'read-auto',
that will currently happen, which is bad.
So in the 'ro != 0' case don't call remove_and_add_spares but simply
remove the failed devices as the comment suggests is needed.
Signed-off-by: NeilBrown <neilb@suse.de>
This flag is not needed and is used badly.
Devices that are included in a native-metadata array are reserved
exclusively for that array - and currently have AllReserved set.
They all are bd_claimed for the rdev and so cannot be shared.
Devices that are included in external-metadata arrays can be shared
among multiple arrays - providing there is no overlap.
These are bd_claimed for md in general - not for a particular rdev.
When changing the amount of a device that is used in an array we need
to check for overlap. This currently includes a check on AllReserved
So even without overlap, sharing with an AllReserved device is not
allowed.
However the bd_claim usage already precludes sharing with these
devices, so the test on AllReserved is not needed. And in fact it is
wrong.
As this is the only use of AllReserved, simply remove all usage and
definition of AllReserved.
Signed-off-by: NeilBrown <neilb@suse.de>
If we try to update_raid_disks and it fails, we should put
'delta_disks' back to zero. This is important because some code,
such as slot_store, assumes that delta_disks has been validated.
Signed-off-by: NeilBrown <neilb@suse.de>
WQ_RESCUER is now an internal flag and should only be used in the
workqueue implementation proper. Use WQ_MEM_RECLAIM instead.
This doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Commit e09b457b (block: simplify holder symlink handling) incorrectly
assumed that there is only one link at maximum. dm may use multiple
links and expects block layer to track reference count for each link,
which is different from and unrelated to the exclusive device holder
identified by @holder when the device is opened.
Remove the single holder assumption and automatic removal of the link
and revive the per-link reference count tracking. The code
essentially behaves the same as before commit e09b457b sans the
unnecessary kobject reference count dancing.
While at it, note that this facility should not be used by anyone else
than the current ones. Sysfs symlinks shouldn't be abused like this
and the whole thing doesn't belong in the block layer at all.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Milan Broz <mbroz@redhat.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://neil.brown.name/md:
md: Fix removal of extra drives when converting RAID6 to RAID5
md: range check slot number when manually adding a spare.
md/raid5: handle manually-added spares in start_reshape.
md: fix sync_completed reporting for very large drives (>2TB)
md: allow suspend_lo and suspend_hi to decrease as well as increase.
md: Don't let implementation detail of curr_resync leak out through sysfs.
md: separate meta and data devs
md-new-param-to_sync_page_io
md-new-param-to-calc_dev_sboffset
md: Be more careful about clearing flags bit in ->recovery
md: md_stop_writes requires mddev_lock.
md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer
md: Ensure no IO request to get md device before it is properly initialised.
md: Fix single printks with multiple KERN_<level>s
md: fix regression resulting in delays in clearing bits in a bitmap
md: fix regression with re-adding devices to arrays with no metadata
When a RAID6 is converted to a RAID5, the extra drive should
be discarded. However it isn't due to a typo in a comparison.
This bug was introduced in commit e93f68a1fc in 2.6.35-rc4
and is suitable for any -stable since than.
As the extra drive is not removed, the 'degraded' counter is wrong and
so the RAID5 will not respond correctly to a subsequent failure.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
When adding a spare to an active array, we should check the slot
number, but allow it to be larger than raid_disks if a reshape
is being prepared.
Apply the same test when adding a device to an
array-under-construction. It already had most of the test in place,
but not quite all.
Signed-off-by: NeilBrown <neilb@suse.de>
The values exported in the sync_completed file are unsigned long, which
overflows with very large drives, resulting in wrong values reported.
Since sync_completed uses sectors as unit, we'll start getting wrong
values with components larger than 2TB.
This patch simply replaces the use of unsigned long by unsigned long long.
Signed-off-by: Rémi Rérolle <rrerolle@lacie.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The sysfs attributes 'suspend_lo' and 'suspend_hi' describe a region
to which read/writes are suspended so that the under lying data can be
manipulated without user-space noticing.
Currently the window they describe can only move forwards along the
device. However this is an unnecessary restriction which will cause
problems with planned developments.
So relax this restriction and allow these endpoints to move
arbitrarily.
Signed-off-by: NeilBrown <neilb@suse.de>
mddev->curr_resync has artificial values of '1' and '2' which are used
by the code which ensures only one resync is happening at a time on
any given device.
These values are internal and should never be exposed to user-space
(except when translated appropriately as in the 'pending' status in
/proc/mdstat).
Unfortunately they are as ->curr_resync is assigned to
->curr_resync_completed and that value is directly visible through
sysfs.
So change the assignments to ->curr_resync_completed to get the same
valued from elsewhere in a form that doesn't have the magic '1' or '2'
values.
Signed-off-by: NeilBrown <neilb@suse.de>
Allow the metadata to be on a separate device from the
data.
This doesn't mean the data and metadata will by on separate
physical devices - it simply gives device-mapper and userspace
tools more flexibility.
Signed-off-by: NeilBrown <neilb@suse.de>
Add new parameter to 'sync_page_io'.
The new parameter allows us to distinguish between metadata and data
operations. This becomes important later when we add the ability to
use separate devices for data and metadata.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
When we allow for separate devices for data and metadata
in a later patch, we will need to be able to calculate
the superblock offset based on more than the bdev.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Setting ->recovery to 0 is generally not a good idea as it could clear
bits that shouldn't be cleared. In particular, MD_RECOVERY_FROZEN
should only be cleared on explicit request from user-space.
So when we need to clear things, just clear the bits that need
clearing.
As there are a few different places which reap a resync process - and
some do an incomplte job - factor out the code for doing the from
md_check_recovery and call that function instead of open coding part
of it.
Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: Jonathan Brassow <jbrassow@redhat.com>
As md_stop_writes manipulates the sync_thread and calls md_update_sb,
it need to be called with mddev_lock held.
In all internal cases it is, but the symbol is exported for dm-raid to
call and in that case the lock won't be help.
Do make an exported version which takes the lock, and an internal
version which does not.
Signed-off-by: NeilBrown <neilb@suse.de>
When an md device is in the process of coming on line it is possible
for an IO request (typically a partition table probe) to get through
before the array is fully initialised, which can cause unexpected
behaviour (e.g. a crash).
So explicitly record when the array is ready for IO and don't allow IO
through until then.
There is no possibility for a similar problem when the array is going
off-line as there must only be one 'open' at that time, and it is busy
off-lining the array and so cannot send IO requests. So no memory
barrier is needed in md_stop()
This has been a bug since commit 409c57f380 in 2.6.30 which
introduced md_make_request. Before then, each personality would
register its own make_request_fn when it was ready.
This is suitable for any stable kernel from 2.6.30.y onwards.
Cc: <stable@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: "Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
commit 589a594be1 (2.6.37-rc4) fixed a problem were md_thread would
sometimes call the ->run function at a bad time.
If an error is detected during array start up after the md_thread has
been started, the md_thread is killed. This resulted in the ->run
function being called once. However the array may not be in a state
that it is safe to call ->run.
However the fix imposed meant that ->run was not called on a timeout.
This means that when an array goes idle, bitmap bits do not get
cleared promptly. While the array is busy the bits will still be
cleared when appropriate so this is not very serious. There is no
risk to data.
Change the test so that we only avoid calling ->run when the thread
is being stopped. This more explicitly addresses the problem situation.
This is suitable for 2.6.37-stable and any -stable kernel to which
589a594be1 was applied.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...
Commit 1a855a0606 (2.6.37-rc4) fixed a problem where devices were
re-added when they shouldn't be but caused a regression in a less
common case that means sometimes devices cannot be re-added when they
should be.
In particular, when re-adding a device to an array without metadata
we should always access the device, but after the above commit we
didn't.
This patch sets the In_sync flag in that case so that the re-add
succeeds.
This patch is suitable for any -stable kernel to which 1a855a0606 was
applied.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cciss: fix cciss_revalidate panic
block: max hardware sectors limit wrapper
block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead
blk-throttle: Correct the placement of smp_rmb()
blk-throttle: Trim/adjust slice_end once a bio has been dispatched
block: check for proper length of iov entries earlier in blk_rq_map_user_iov()
drbd: fix for spin_lock_irqsave in endio callback
drbd: don't recvmsg with zero length
When stacking devices, a request_queue is not always available. This
forced us to have a no_cluster flag in the queue_limits that could be
used as a carrier until the request_queue had been set up for a
metadevice.
There were several problems with that approach. First of all it was up
to the stacking device to remember to set queue flag after stacking had
completed. Also, the queue flag and the queue limits had to be kept in
sync at all times. We got that wrong, which could lead to us issuing
commands that went beyond the max scatterlist limit set by the driver.
The proper fix is to avoid having two flags for tracking the same thing.
We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
block layer merging functions. The queue_limit 'no_cluster' is turned
into 'cluster' to avoid double negatives and to ease stacking.
Clustering defaults to being enabled as before. The queue flag logic is
removed from the stacking function, and explicitly setting the cluster
flag is no longer necessary in DM and MD.
Reported-by: Ed Lin <ed.lin@promise.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When we fail to start a raid10 for some reason, we call
md_unregister_thread to kill the thread that was created.
Unfortunately md_thread() will then make one call into the handler
(raid10d) even though md_wakeup_thread has not been called. This is
not safe and as md_unregister_thread is called after mddev->private
has been set to NULL, it will definitely cause a NULL dereference.
So fix this at both ends:
- md_thread should only call the handler if THREAD_WAKEUP has been
set.
- raid10 should call md_unregister_thread before setting things
to NULL just like all the other raid modules do.
This is applicable to 2.6.35 and later.
Cc: stable@kernel.org
Reported-by: "Citizen" <citizen_lee@thecus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
With v0.90 metadata, a hot-spare does not become a full member of the
array until recovery is complete. So if we re-add such a device to
the array, we know that all of it is as up-to-date as the event count
would suggest, and so it a bitmap-based recovery is possible.
However with v1.x metadata, the hot-spare immediately becomes a full
member of the array, but it record how much of the device has been
recovered. If the array is stopped and re-assembled recovery starts
from this point.
When such a device is hot-added to an array we currently lose the 'how
much is recovered' information and incorrectly included it as a full
in-sync member (after bitmap-based fixup).
This is wrong and unsafe and could corrupt data.
So be more careful about setting saved_raid_disk - which is what
guides the re-adding of devices back into an array.
The new code matches the code in slot_store which does a similar
thing, which is encouraging.
This is suitable for any -stable kernel.
Reported-by: "Dailey, Nate" <Nate.Dailey@stratus.com>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
As recorded in
https://bugzilla.kernel.org/show_bug.cgi?id=24012
it is possible for a flush request through md to hang. This is due to
an interaction between the recursion avoidance in
generic_make_request, the insistence in md of only having one flush
active at a time, and the possibility of dm (or md) submitting two
flush requests to a device from the one generic_make_request.
If a generic_make_request call into dm causes two flush requests to be
queued (as happens if the dm table has two targets - they get one
each), these two will be queued inside generic_make_request.
Assume they are for the same md device.
The first is processed and causes 1 or more flush requests to be sent
to lower devices. These get queued within generic_make_request too.
Then the second flush to the md device gets handled and it blocks
waiting for the first flush to complete. But it won't complete until
the two lower-device requests complete, and they haven't even been
submitted yet as they are on the generic_make_request queue.
The deadlock can be broken by using a separate thread to submit the
requests to lower devices. md has such a thread readily available:
md_wq.
So use it to submit these requests.
Reported-by: Giacomo Catenazzi <cate@cateee.net>
Tested-by: Giacomo Catenazzi <cate@cateee.net>
Signed-off-by: NeilBrown <neilb@suse.de>
submit_flushes is called from exactly one place.
Move the code that is before and after that call into
submit_flushes.
This has not functional change, but will make the next patch
smaller and easier to follow.
Signed-off-by: NeilBrown <neilb@suse.de>
None of the functions called between setting flush_pending to 1, and
atomic_dec_and_test can change flush_pending, or will anything
running in any other thread (as ->flush_bio is not NULL). So the
atomic_dec_and_test will always succeed.
So remove the atomic_sec and the atomic_dec_and_test.
Signed-off-by: NeilBrown <neilb@suse.de>
Before 2.6.37, the md layer had a mechanism for catching I/Os with the
barrier flag set, and translating the barrier into barriers for all
the underlying devices. With 2.6.37, I/O barriers have become plain
old flushes, and the md code was updated to reflect this. However,
one piece was left out -- the md layer does not tell the block layer
that it supports flushes or FUA access at all, which results in md
silently dropping flush requests.
Since the support already seems there, just add this one piece of
bookkeeping.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When trying to grow an array by enlarging component devices,
rdev_size_store() expects the return value of rdev_size_change() to be
in sectors, but the actual value is returned in KBs.
This functionality was broken by commit
dd8ac336c1
so this patch is suitable for any kernel since 2.6.30.
Cc: stable@kernel.org
Signed-off-by: Justin Maggard <jmaggard10@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().
blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.
All users are converted. Most conversions are mechanical and don't
introduce any behavior difference. There are several exceptions.
* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
reason to OR it explicitly on blkdev_put().
* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
sb->s_mode.
* With the above changes, sb->s_mode now always should contain
FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
errors.
The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.
* blkdev_get/put() are the primary open and close functions.
* bd_claim/release() deal with exclusive open.
* open/close_bdev_exclusive() are combination of open and claim and
the other way around, respectively.
* bd_link/unlink_disk_holder() to create and remove holder/slave
symlinks.
* open_by_devnum() wraps bdget() + blkdev_get().
The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access. Reorganize the interface such that,
* blkdev_get() is extended to include exclusive access management.
@holder argument is added and, if is @FMODE_EXCL specified, it will
gain exclusive access atomically w.r.t. other exclusive accesses.
* blkdev_put() is similarly extended. It now takes @mode argument and
if @FMODE_EXCL is set, it releases an exclusive access. Also, when
the last exclusive claim is released, the holder/slave symlinks are
removed automatically.
* bd_claim/release() and close_bdev_exclusive() are no longer
necessary and either made static or removed.
* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
is no longer necessary and removed.
* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
and blkdev_get(). It also has an unexpected extra bdev_read_only()
test which probably should be moved into blkdev_get().
* open_by_devnum() is modified to take @holder argument and pass it to
blkdev_get().
Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should). This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.
open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features. Well, let's leave them for another day.
Most conversions are straight-forward. drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Code to manage symlinks in /sys/block/*/{holders|slaves} are overly
complex with multiple holder considerations, redundant extra
references to all involved kobjects, unused generic kobject holder
support and unnecessary mixup with bd_claim/release functionalities.
Strip it down to what's necessary (single gendisk holder) and make it
use a separate interface. This is a step for cleaning up
bd_claim/release. This patch makes dm-table slightly more complex but
it will be simplified again with further changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Convert direct reads of an inode's i_size to using i_size_read().
i_size_{read,write} use a seqcount to protect reads from accessing
incomple writes. Concurrent i_size_write()s require mutual exclussion
to protect the seqcount that is used by i_size_{read,write}. But
i_size_read() callers do not need to use additional locking.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: NeilBrown <neilb@suse.de>
Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
bio_clone and bio_alloc allocate from a common bio pool.
If an md device is stacked with other devices that use this pool, or under
something like swap which uses the pool, then the multiple calls on
the pool can cause deadlocks.
So allocate a local bio pool for each md array and use that rather
than the common pool.
This pool is used both for regular IO and metadata updates.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently sync_page_io takes a 'bdev'.
Every caller passes 'rdev->bdev'.
We will soon want another field out of the rdev in sync_page_io,
So just pass the rdev instead of the bdev out of it.
Signed-off-by: NeilBrown <neilb@suse.de>
Workqueue usage in md has two problems.
* Flush can be used during or depended upon by memory reclaim, but md
uses the system workqueue for flush_work which may lead to deadlock.
* md depends on flush_scheduled_work() to achieve exclusion against
completion of removal of previous instances. flush_scheduled_work()
may incur unexpected amount of delay and is scheduled to be removed.
This patch adds two workqueues to md - md_wq and md_misc_wq. The
former is guaranteed to make forward progress under memory pressure
and serves flush_work. The latter serves as the flush domain for
other works.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
lock_kernel calls were recently pushed down into open/release
functions.
md doesn't need that protection.
Then the BKL calls were change to md_mutex. We don't need those
either.
So remove it all.
Signed-off-by: NeilBrown <neilb@suse.de>
A RAID1 which has no persistent metadata, whether internal or
external, will hang on the first write.
This is caused by commit 070dc6dd71
In that case, MD_CHANGE_PENDING never gets cleared.
So during md_update_sb, is neither persistent or external,
clear MD_CHANGE_PENDING.
This is suitable for 2.6.36-stable.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
xen-blkfront: disable barrier/flush write support
Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
block: remove BLKDEV_IFL_WAIT
aic7xxx_old: removed unused 'req' variable
block: remove the BH_Eopnotsupp flag
block: remove the BLKDEV_IFL_BARRIER flag
block: remove the WRITE_BARRIER flag
swap: do not send discards as barriers
fat: do not send discards as barriers
ext4: do not send discards as barriers
jbd2: replace barriers with explicit flush / FUA usage
jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
jbd: replace barriers with explicit flush / FUA usage
nilfs2: replace barriers with explicit flush / FUA usage
reiserfs: replace barriers with explicit flush / FUA usage
gfs2: replace barriers with explicit flush / FUA usage
btrfs: replace barriers with explicit flush / FUA usage
xfs: replace barriers with explicit flush / FUA usage
block: pass gfp_mask and flags to sb_issue_discard
dm: convey that all flushes are processed as empty
...
* 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
block: autoconvert trivial BKL users to private mutex
drivers: autoconvert trivial BKL users to private mutex
ipmi: autoconvert trivial BKL users to private mutex
mac: autoconvert trivial BKL users to private mutex
mtd: autoconvert trivial BKL users to private mutex
scsi: autoconvert trivial BKL users to private mutex
Fix up trivial conflicts (due to addition of private mutex right next to
deletion of a version string) in drivers/char/pcmcia/cm40[04]0_cs.c
The block device drivers have all gained new lock_kernel
calls from a recent pushdown, and some of the drivers
were already using the BKL before.
This turns the BKL into a set of per-driver mutexes.
Still need to check whether this is safe to do.
file=$1
name=$2
if grep -q lock_kernel ${file} ; then
if grep -q 'include.*linux.mutex.h' ${file} ; then
sed -i '/include.*<linux\/smp_lock.h>/d' ${file}
else
sed -i 's/include.*<linux\/smp_lock.h>.*$/include <linux\/mutex.h>/g' ${file}
fi
sed -i ${file} \
-e "/^#include.*linux.mutex.h/,$ {
1,/^\(static\|int\|long\)/ {
/^\(static\|int\|long\)/istatic DEFINE_MUTEX(${name}_mutex);
} }" \
-e "s/\(un\)*lock_kernel\>[ ]*()/mutex_\1lock(\&${name}_mutex)/g" \
-e '/[ ]*cycle_kernel_lock();/d'
else
sed -i -e '/include.*\<smp_lock.h\>/d' ${file} \
-e '/cycle_kernel_lock()/d'
fi
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
If an array with 1.x metadata is assembled with the last disk missing,
md doesn't properly record the fact that the disk was missing.
This is unlikely to cause a real problem as the event count will be
different to the count on the missing disk so it won't be included in
the array. However it could still cause confusion.
So make sure we clear all the relevant slots, not just the early ones.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that we depend on md_update_sb to clear variable bits in
mddev->flags (rather than trying not to set them) it is important to
always call md_update_sb when appropriate.
md_check_recovery has this job but explicitly avoids it for ->external
metadata arrays. This is not longer appropraite, or needed.
However we do want to avoid taking the mddev lock if only
MD_CHANGE_PENDING is set as that is not cleared by md_update_sb for
external-metadata arrays.
Reported-by: "Kwolek, Adam" <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This patch converts md to support REQ_FLUSH/FUA instead of now
deprecated REQ_HARDBARRIER. In the core part (md.c), the following
changes are notable.
* Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
processing of other requests and thus there is no reason to mark the
queue congested while FLUSH/FUA is in progress.
* REQ_FLUSH/FUA failures are final and its users don't need retry
logic. Retry logic is removed.
* Preflush needs to be issued to all member devices but FUA writes can
be handled the same way as other writes - their processing can be
deferred to request_queue of member devices. md_barrier_request()
is renamed to md_flush_request() and simplified accordingly.
For linear, raid0 and multipath, the core changes are enough. raid1,
5 and 10 need the following conversions.
* raid1: Handling of FLUSH/FUA bio's can simply be deferred to
request_queues of member devices. Barrier related logic removed.
* raid5: Queue draining logic dropped. FUA bit is propagated through
biodrain and stripe resconstruction such that all the updated parts
of the stripe are written out with FUA writes if any of the dirtying
writes was FUA. preread_active_stripes handling in make_request()
is updated as suggested by Neil Brown.
* raid10: FUA bit needs to be propagated to write clones.
linear, raid0, 1, 5 and 10 tested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
MD_CHANGE_CLEAN is used for two different purposes and this leads to
confusion.
One of the purposes is largely mirrored by MD_CHANGE_PENDING which is
not used for anything else, so have MD_CHANGE_PENDING take over that
purpose fully.
The two purposes are:
1/ tell md_update_sb that an update is needed and that it is just a
clean/dirty transition.
2/ tell user-space that an transition from clean to dirty is pending
(something wants to write), and tell te kernel (by clearin the
flag) that the transition is OK.
The first purpose remains wit MD_CHANGE_CLEAN, the second is moved
fully to MD_CHANGE_PENDING.
This means that various places which conditionally set or cleared
MD_CHANGE_CLEAN no longer need to be conditional.
Signed-off-by: NeilBrown <neilb@suse.de>
If this bit is cleared in md_update_sb() the kernel will allow writes to the
array if userspace triggers md_allow_write(), e.g. through stripe_cache_size,
when mdmon is not active. When mdmon is active the array transitions to
active-idle bypassing write-pending, setting up a race for mdmon to set the
array clean before a write arrives.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The update of ->recovery_offset in sync_sbs is appropriate even then external
metadata is in use. However sync_sbs is only called when native
metadata is used.
So move that update in to the top of md_update_sb (which is the only
caller of sync_sbs) before the test on ->external.
This moves the update out of ->write_lock protection, but those fields
only need ->reconfig_mutex protection which they still have.
Also move the test on ->persistent up to where ->external is set as
for metadata update purposes they are the same.
Clear MD_CHANGE_DEVS and MD_CHANGE_CLEAN as they can only be confusing
if ->external is set or ->persistent isn't.
Finally move the update of ->utime down as it is only relevent (like
the ->events update) for native metadata.
Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: "Kwolek, Adam" <adam.kwolek@intel.com>
* 'for-linus' of git://neil.brown.name/md: (24 commits)
md: clean up do_md_stop
md: fix another deadlock with removing sysfs attributes.
md: move revalidate_disk() back outside open_mutex
md/raid10: fix deadlock with unaligned read during resync
md/bitmap: separate out loading a bitmap from initialising the structures.
md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.
md/bitmap: optimise scanning of empty bitmaps.
md/bitmap: clean up plugging calls.
md/bitmap: reduce dependence on sysfs.
md/bitmap: white space clean up and similar.
md/raid5: export raid5 unplugging interface.
md/plug: optionally use plugger to unplug an array during resync/recovery.
md/raid5: add simple plugging infrastructure.
md/raid5: export is_congested test
raid5: Don't set read-ahead when there is no queue
md: add support for raising dm events.
md: export various start/stop interfaces
md: split out md_rdev_init
md: be more careful setting MD_CHANGE_CLEAN
md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk
...
There is only one error exit from do_md_stop, so make that more
explicit and discard the 'err' variable.
Also drop the 'revalidate' variable by moving the unlock calls around.
Signed-off-by: NeilBrown <neilb@suse.de>
Move the deletion of sysfs attributes from reconfig_mutex to
open_mutex didn't really help as a process can try to take
open_mutex while holding reconfig_mutex, so the same deadlock can
happen, just requiring one more process to be involved in the chain.
I looks like I cannot easily use locking to wait for the sysfs
deletion to complete, so don't.
The only things that we cannot do while the deletions are still
pending is other things which can change the sysfs namespace: run,
takeover, stop. Each of these can fail with -EBUSY.
So set a flag while doing a sysfs deletion, and fail run, takeover,
stop if that flag is set.
This is suitable for 2.6.35.x
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Commit b821eaa5 "md: remove ->changed and related code" moved
revalidate_disk() under open_mutex, and lockdep noticed.
[ INFO: possible circular locking dependency detected ]
2.6.32-mdadm-locking #1
-------------------------------------------------------
mdadm/3640 is trying to acquire lock:
(&bdev->bd_mutex){+.+.+.}, at: [<ffffffff811acecb>] revalidate_disk+0x5b/0x90
but task is already holding lock:
(&mddev->open_mutex){+.+...}, at: [<ffffffffa055e07a>] do_md_stop+0x4a/0x4d0 [md_mod]
which lock already depends on the new lock.
It is suitable for 2.6.35.x
Cc: <stable@kernel.org>
Reported-by: Przemyslaw Czarnowski <przemyslaw.hawrylewicz.czarnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The open and release block_device_operations are currently
called with the BKL held. In order to change that, we must
first make sure that all drivers that currently rely
on this have no regressions.
This blindly pushes the BKL into all .open and .release
operations for all block drivers to prepare for the
next step. The drivers can subsequently replace the BKL
with their own locks or remove it completely when it can
be shown that it is not needed.
The functions blkdev_get and blkdev_put are the only
remaining users of the big kernel lock in the block
layer, besides a few uses in the ioctl code, none
of which need to serialize with blkdev_{get,put}.
Most of these two functions is also under the protection
of bdev->bd_mutex, including the actual calls to
->open and ->release, and the common code does not
access any global data structures that need the BKL.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.
Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
dm makes this distinction between ->ctr and ->resume, so we need to
too.
Also get the new bitmap_load to clear out the bitmap first, as this is
most consistent with the dm suspend/resume approach
Signed-off-by: NeilBrown <neilb@suse.de>
1/ use md_unplug in bitmap.c as we will soon be using bitmaps under
arrays with no queue attached.
2/ Don't bother plugging the queue when we set a bit in the bitmap.
The reason for this was to encourage as many bits as possible to
get set before we unplug and write stuff out.
However every personality already plugs the queue after
bitmap_startwrite either directly (raid1/raid10) or be setting
STRIPE_BIT_DELAY which causes the queue to be plugged later
(raid5).
Signed-off-by: NeilBrown <neilb@suse.de>
If an array doesn't have a 'queue' then md_do_sync cannot
unplug it.
In that case it will have a 'plugger', so make that available
to the mddev, and use it to unplug the array if needed.
Signed-off-by: NeilBrown <neilb@suse.de>
md/raid5 uses the plugging infrastructure provided by the block layer
and 'struct request_queue'. However when we plug raid5 under dm there
is no request queue so we cannot use that.
So create a similar infrastructure that is much lighter weight and use
it for raid5.
Signed-off-by: NeilBrown <neilb@suse.de>
dm uses scheduled work to raise events to user-space.
So allow md device to have work_structs and schedule them on an error.
Signed-off-by: NeilBrown <neilb@suse.de>
export entry points for starting and stopping md arrays.
This will be used by a module to make md/raid5 work under
dm.
Also stop calling md_stop_writes from md_stop, as that won't
work well with dm - it will want to call the two separately.
Signed-off-by: NeilBrown <neilb@suse.de>
This functionality will be needed separately in a subsequent patch, so
split it into it's own exported function.
Signed-off-by: NeilBrown <neilb@suse.de>
When MD_CHANGE_CLEAN is set we might block in md_write_start.
So we should only set it when fairly sure that something will clear
it.
There are two places where it is set so as to encourage a metadata
update to record the progress of resync/recovery. This should only
be done if the internal metadata update mechanisms are in use, which
can be tested by by inspecting '->persistent'.
Signed-off-by: NeilBrown <neilb@suse.de>
We will want md devices to live as dm targets where sysfs is not
visible. So allow md to not connect to sysfs.
Signed-off-by: NeilBrown <neilb@suse.de>
When an array is reshaped to have fewer devices, the reshape proceeds
from the end of the devices to the beginning.
If a device happens to be non-In_sync (which is possible but rare)
we would normally update the ->recovery_offset as the reshape
progresses. However that would be wrong as the recover_offset records
that the early part of the device is in_sync, while in fact it would
only be the later part that is in_sync, and in any case the offset
number would be measured from the wrong end of the device.
Relatedly, if after a reshape a spare is discovered to not be
recoverred all the way to the end, not allow spare_active
to incorporate it in the array.
This becomes relevant in the following sample scenario:
A 4 drive RAID5 is converted to a 6 drive RAID6 in a combined
operation.
The RAID5->RAID6 conversion will cause a 5 drive to be included as a
spare, then the 5drive -> 6drive reshape will effectively rebuild that
spare as it progresses. The 6th drive is treated as in_sync the whole
time as there is never any case that we might consider reading from
it, but must not because there is no valid data.
If we interrupt this reshape part-way through and reverse it to return
to a 5-drive RAID6 (or event a 4-drive RAID5), we don't want to update
the recovery_offset - as that would be wrong - and we don't want to
include that spare as active in the 5-drive RAID6 when the reversed
reshape completed and it will be mostly out-of-sync still.
Signed-off-by: NeilBrown <neilb@suse.de>
Most array level changes leave the list of devices largely unchanged,
possibly causing one at the end to become redundant.
However conversions between RAID0 and RAID10 need to renumber
all devices (except 0).
This renumbering is currently being done in the ->run method when the
new personality takes over. However this is too late as the common
code in md.c might already have invalidated some of the devices if
they had a ->raid_disk number that appeared to high.
Moving it into the ->takeover method is too early as the array is
still active at that time and wrong ->raid_disk numbers could cause
confusion.
So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate
the new raid_disk number.
Now the common code knows exactly which devices need to be renumbered,
and which can be invalidated, and can do it all at a convenient time
when the array is suspend.
It can also update some symlinks in sysfs which previously were not be
updated correctly.
Reported-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit b821eaa572 broke partition
detection for md arrays.
The logic was almost right. However if revalidate_disk is called
when the device is not yet open, bdev->bd_disk won't be set, so the
flush_disk() Call will not set bd_invalidated.
So when md_open is called we still need to ensure that
->bd_invalidated gets set. This is easily done with a call to
check_disk_size_change in the place where the offending commit removed
check_disk_change. At the important times, the size will have changed
from 0 to non-zero, so check_disk_size_change will set bd_invalidated.
Tested-by: Duncan <1i5t5.duncan@cox.net>
Reported-by: Duncan <1i5t5.duncan@cox.net>
Signed-off-by: NeilBrown <neilb@suse.de>
Conflicts:
drivers/md/md.c
- Resolved conflict in md_update_sb
- Added extra 'NULL' arg to new instance of sysfs_get_dirent.
Signed-off-by: NeilBrown <neilb@suse.de>
The problem. When implementing a network namespace I need to be able
to have multiple network devices with the same name. Currently this
is a problem for /sys/class/net/*, /sys/devices/virtual/net/*, and
potentially a few other directories of the form /sys/ ... /net/*.
What this patch does is to add an additional tag field to the
sysfs dirent structure. For directories that should show different
contents depending on the context such as /sys/class/net/, and
/sys/devices/virtual/net/ this tag field is used to specify the
context in which those directories should be visible. Effectively
this is the same as creating multiple distinct directories with
the same name but internally to sysfs the result is nicer.
I am calling the concept of a single directory that looks like multiple
directories all at the same path in the filesystem tagged directories.
For the networking namespace the set of directories whose contents I need
to filter with tags can depend on the presence or absence of hotplug
hardware or which modules are currently loaded. Which means I need
a simple race free way to setup those directories as tagged.
To achieve a reace free design all tagged directories are created
and managed by sysfs itself.
Users of this interface:
- define a type in the sysfs_tag_type enumeration.
- call sysfs_register_ns_types with the type and it's operations
- sysfs_exit_ns when an individual tag is no longer valid
- Implement mount_ns() which returns the ns of the calling process
so we can attach it to a sysfs superblock.
- Implement ktype.namespace() which returns the ns of a syfs kobject.
Everything else is left up to sysfs and the driver layer.
For the network namespace mount_ns and namespace() are essentially
one line functions, and look to remain that.
Tags are currently represented a const void * pointers as that is
both generic, prevides enough information for equality comparisons,
and is trivial to create for current users, as it is just the
existing namespace pointer.
The work needed in sysfs is more extensive. At each directory
or symlink creating I need to check if the directory it is being
created in is a tagged directory and if so generate the appropriate
tag to place on the sysfs_dirent. Likewise at each symlink or
directory removal I need to check if the sysfs directory it is
being removed from is a tagged directory and if so figure out
which tag goes along with the name I am deleting.
Currently only directories which hold kobjects, and
symlinks are supported. There is not enough information
in the current file attribute interfaces to give us anything
to discriminate on which makes it useless, and there are
no potential users which makes it an uninteresting problem
to solve.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Devices which know that they are spares do not really need to have
an event count that matches the rest of the array, so there are no
data-in-sync issues. It is enough that the uuid matches.
So remove the requirement that the event count is up-to-date.
We currently still write out and event count on spares, but this
allows us in a year or 3 to stop doing that completely.
Signed-off-by: NeilBrown <neilb@suse.de>
When updating the event count for a simple clean <-> dirty transition,
we try to avoid updating the spares so they can safely spin-down.
As the event_counts across an array must be +/- 1, this means
decrementing the event_count on a dirty->clean transition.
This is not always safe and we have to avoid the unsafe time.
We current do this with a misguided idea about it being safe or
not depending on whether the event_count is odd or even. This
approach only works reliably in a few common instances, but easily
falls down.
So instead, simply keep internal state concerning whether it is safe
or not, and always assume it is not safe when an array is first
assembled.
Signed-off-by: NeilBrown <neilb@suse.de>
Some time ago we stopped the clean/active metadata updates
from being written to a 'spare' device in most cases so that
it could spin down and say spun down. Device failure/removal
etc are still recorded on spares.
However commit 51d5668cb2 broke this 50% of the time,
depending on whether the event count is even or odd.
The change log entry said:
This means that the alignment between 'odd/even' and
'clean/dirty' might take a little longer to attain,
how ever the code makes no attempt to create that alignment, so it
could take arbitrarily long.
So when we find that clean/dirty is not aligned with odd/even,
force a second metadata-update immediately. There are already cases
where a second metadata-update is needed immediately (e.g. when a
device fails during the metadata update). We just piggy-back on that.
Reported-by: Joe Bryant <tenminjoe@yahoo.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
Level modifications change the output of mdstat. The mdmon manager
thread is interested in these events for external metadata management.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This is
- unnecessary because mddev_suspend is always followed by a call to
->stop, and each ->stop unregisters the thread, and
- a problem as it makes it awkwards to suspend and then resume a
device as we will want later.
Signed-off-by: NeilBrown <neilb@suse.de>
We used to pass the personality make_request function direct
to the block layer so the first argument had to be a queue.
But now we have the intermediary md_make_request so it makes
at lot more sense to pass a struct mddev_s.
It makes it possible to have an mddev without its own queue too.
Signed-off-by: NeilBrown <neilb@suse.de>
This moves the call to the other side of set_readonly, but that should
not be an issue.
This encapsulates in 'md_stop' all of the functionality for internally
stopping the array, leaving all the interactions with externalities
(sysfs, request_queue, gendisk) in do_md_stop.
Signed-off-by: NeilBrown <neilb@suse.de>
Using do_md_stop to set an array to read-only is a little confusing.
Now most of the common code has been factored out, split
md_set_readonly off in to a separate function.
Signed-off-by: NeilBrown <neilb@suse.de>
Further refactoring of do_md_stop.
This one requires some explanation as it takes code from different
places in do_md_stop, so some re-ordering happens.
We only get into this part of do_md_stop if there are no active opens
of the device, so no writes can be happening and the device must have
been flushed. In md_stop_writes we want to stop any internal sources
of writes - i.e. resync - and flush out the metadata.
The only code that was previously before some of this code is
code to clean up the queue, the mddev, the gendisk, or sysfs, all
of which is probably better after code that makes active changes (i.e.
triggers writes).
Signed-off-by: NeilBrown <neilb@suse.de>
do_md_stop is large and clunky, so hard to understand.
This is a first step of refactoring, pulling two simple
sub-functions out.
Signed-off-by: NeilBrown <neilb@suse.de>
As part of relaxing the binding between an mddev and gendisk,
we separate do_md_run into two functions.
md_run does all the work internal to md
do_md_run calls md_run and makes and changes to gendisk
that are required.
Signed-off-by: NeilBrown <neilb@suse.de>
We set ->changed to 1 and call check_disk_change at the end
of md_open so that bd_invalidated would be set and thus
partition rescan would happen appropriately.
Now that we call revalidate_disk directly, which sets bd_invalidates,
that indirection is no longer needed and can be removed.
Signed-off-by: NeilBrown <neilb@suse.de>
Using ->array_sectors rather than get_capacity() is more
direct and is a step towards relaxing the tight connection
between mddev and gendisk.
Signed-off-by: NeilBrown <neilb@suse.de>
While I generally prefer letting personalities do as much as possible,
given that we have a central md_make_request anyway we may as well use
it to simplify code.
Also this centralises knowledge of ->gendisk which will help later.
Signed-off-by: NeilBrown <neilb@suse.de>
Level changes can be very significant, so make sure
to notify them via sysfs.
Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When metadata is being managed by user-space, md doesn't know
what the maximum number of devices allowed in an array is
so ->max_disks is 0. In this case we should allow any (+ve)
number of disks.
Signed-off-by: NeilBrown <neilb@suse.de>
Writing "none" to "../md/dev-xx/slot" removes that device
from being an active part of the array, but it didn't
set ->raid_disk to -1 to record this fact.
Signed-off-by: Maciej Trela <Maciej.Trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This was needed when sysfs files could only be 'notified'
from process context. Now that we have sys_notify_direct,
we can call it directly from an interrupt.
Signed-off-by: NeilBrown <neilb@suse.de>
Some levels expect the 'redundancy group' to be present,
others don't.
So when we change level of an array we might need to
add or remove this group.
This requires fixing up the current practice of overloading ->private
to indicate (when ->pers == NULL) that something needs to be removed.
So create a new ->to_remove to fill that role.
When changing levels, we may need to add or remove attributes. When
changing RAID5 -> RAID6, we both add and remove the same thing. It is
important to catch this and optimise it out as the removal is delayed
until a lock is released, so trying to add immediately would cause
problems.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
When an array is stopped we need to remove some
sysfs files which are dependent on the type of array.
We need to delay that deletion as deleting them while holding
reconfig_mutex can lead to deadlocks.
We currently delay them until the array is completely destroyed.
However it is possible to deactivate and then reactivate the array.
It is also possible to need to remove sysfs files when changing level,
which can potentially happen several times before an array is
destroyed.
So we need to delete these files more promptly: as soon as
reconfig_mutex is dropped.
We need to ensure this happens before do_md_run can restart the array,
so we use open_mutex for some extra locking. This is not deadlock
prone.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
When the user sets the block device to readwrite then the mddev should
follow suit. Otherwise, the BUG_ON in md_write_start() will be set to
trigger.
The reverse direction, setting mddev->ro to match a set readonly
request, can be ignored because the blkdev level readonly flag precludes
the need to have mddev->ro set correctly. Nevermind the fact that
setting mddev->ro to 1 may fail if the array is in use.
Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Some time ago we stopped the clean/active metadata updates
from being written to a 'spare' device in most cases so that
it could spin down and say spun down. Device failure/removal
etc are still recorded on spares.
However commit 51d5668cb2 broke this 50% of the time,
depending on whether the event count is even or odd.
The change log entry said:
This means that the alignment between 'odd/even' and
'clean/dirty' might take a little longer to attain,
how ever the code makes no attempt to create that alignment, so it
could take arbitrarily long.
So when we find that clean/dirty is not aligned with odd/even,
force a second metadata-update immediately. There are already cases
where a second metadata-update is needed immediately (e.g. when a
device fails during the metadata update). We just piggy-back on that.
Reported-by: Joe Bryant <tenminjoe@yahoo.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Constify struct sysfs_ops.
This is part of the ops structure constification
effort started by Arjan van de Ven et al.
Benefits of this constification:
* prevents modification of data that is shared
(referenced) by many other structure instances
at runtime
* detects/prevents accidental (but not intentional)
modification attempts on archs that enforce
read-only kernel data at runtime
* potentially better optimized code as the compiler
can assume that the const data cannot be changed
* the compiler/linker move const data into .rodata
and therefore exclude them from false sharing
Signed-off-by: Emese Revfy <re.emese@gmail.com>
Acked-by: David Teigland <teigland@redhat.com>
Acked-by: Matt Domsch <Matt_Domsch@dell.com>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
======
This fix is related to
http://bugzilla.kernel.org/show_bug.cgi?id=15142
but does not address that exact issue.
======
sysfs does like attributes being removed while they are being accessed
(i.e. read or written) and waits for the access to complete.
As accessing some md attributes takes the same lock that is held while
removing those attributes a deadlock can occur.
This patch addresses 3 issues in md that could lead to this deadlock.
Two relate to calling flush_scheduled_work while the lock is held.
This is probably a bad idea in general and as we use schedule_work to
delete various sysfs objects it is particularly bad.
In one case flush_scheduled_work is called from md_alloc (called by
md_probe) called from do_md_run which holds the lock. This call is
only present to ensure that ->gendisk is set. However we can be sure
that gendisk is always set (though possibly we couldn't when that code
was originally written. This is because do_md_run is called in three
different contexts:
1/ from md_ioctl. This requires that md_open has succeeded, and it
fails if ->gendisk is not set.
2/ from writing a sysfs attribute. This can only happen if the
mddev has been registered in sysfs which happens in md_alloc
after ->gendisk has been set.
3/ from autorun_array which is only called by autorun_devices, which
checks for ->gendisk to be set before calling autorun_array.
So the call to md_probe in do_md_run can be removed, and the check on
->gendisk can also go.
In the other case flush_scheduled_work is being called in do_md_stop,
purportedly to wait for all md_delayed_delete calls (which delete the
component rdevs) to complete. However there really isn't any need to
wait for them - they have already been disconnected in all important
ways.
The third issue is that raid5->stop() removes some attribute names
while the lock is held. There is already some infrastructure in place
to delay attribute removal until after the lock is released (using
schedule_work). So extend that infrastructure to remove the
raid5_attrs_group.
This does not address all lockdep issues related to the sysfs
"s_active" lock. The rest can be address by splitting that lockdep
context between symlinks and non-symlinks which hopefully will happen.
Signed-off-by: NeilBrown <neilb@suse.de>
If two arrays share a device, then they will not both resync at the
same time. One will wait for the other to complete.
While waiting, the MD_RECOVERY_INTR flag is not checked so a device
failure, which would make the resync pointless, does not cause the
resync to abort, so the failed device cannot be removed (as it cannot
be remove while a resync is happening).
So add a test for MD_RECOVERY_INTR.
Reported-by: Brett Russ <bruss@netezza.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Since commit dfc7064500,
->hot_remove_disks has not removed non-failed devices from
an array until recovery is no longer possible.
So the code in do_md_run to get around the fact that
md_check_recovery (which calls ->hot_remove_disks) would
remove partially-in-sync devices is no longer needed.
So remove it.
Signed-off-by: NeilBrown <neilb@suse.de>
By default md_do_sync() will perform recovery if no other actions are
specified. However, action_show() relies on MD_RECOVERY_RECOVER to be
set otherwise it returns 'idle'. So, add a missing set
MD_RECOVERY_RECOVER when starting recovery.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The start_ro modules parameter can be used to force arrays to be
started in 'auto-readonly' in which they are read-only until the first
write. This ensures that no resync/recovery happens until something
else writes to the device. This is important for resume-from-disk
off an md array.
However if an array is started 'readonly' (by writing 'readonly' to
the 'array_state' sysfs attribute) we want it to be really 'readonly',
not 'auto-readonly'.
So strengthen the condition to only set auto-readonly if the
array is not already read-only.
Signed-off-by: NeilBrown <neilb@suse.de>
evms configures md arrays by:
open device
send ioctl
close device
for each different ioctl needed.
Since 2.6.29, the device can disappear after the 'close'
unless a significant configuration has happened to the device.
The change made by "SET_ARRAY_INFO" can too minor to stop the device
from disappearing, but important enough that losing the change is bad.
So: make sure SET_ARRAY_INFO sets mddev->ctime, and keep the device
active as long as ctime is non-zero (it gets zeroed with lots of other
things when the array is stopped).
This is suitable for -stable kernels since 2.6.29.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Makes use of skip_spaces() defined in lib/string.c for removing leading
spaces from strings all over the tree.
It decreases lib.a code size by 47 bytes and reuses the function tree-wide:
text data bss dec hex filename
64688 584 592 65864 10148 (TOTALS-BEFORE)
64641 584 592 65817 10119 (TOTALS-AFTER)
Also, while at it, if we see (*str && isspace(*str)), we can be sure to
remove the first condition (*str) as the second one (isspace(*str)) also
evaluates to 0 whenever *str == 0, making it redundant. In other words,
"a char equals zero is never a space".
Julia Lawall tried the semantic patch (http://coccinelle.lip6.fr) below,
and found occurrences of this pattern on 3 more files:
drivers/leds/led-class.c
drivers/leds/ledtrig-timer.c
drivers/video/output.c
@@
expression str;
@@
( // ignore skip_spaces cases
while (*str && isspace(*str)) { \(str++;\|++str;\) }
|
- *str &&
isspace(*str)
)
Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Julia Lawall <julia@diku.dk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enable external metadata arrays to manage rebuild checkpointing via a
md/dev-XXX/recovery_start attribute which reflects rdev->recovery_offset
Also update resync_start_store to allow 'none' to be written, for
consistency.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Other walks of this list are either under rcu_read_lock() or the list
mutation lock (mddev_lock()). This protects against the improbable case of a
disk being removed from the array at the start of md_do_sync().
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
As v1.x metadata can record that a member of the array is
not completely recovered, it make sense to record that a
spare has become a regular member of the array at the earliest
opportunity.
So remove the tests on "recovery_offset > 0" in super_1_sync
as they really aren't needed, and schedule a metadata update
immediately after adding spares to a degraded array.
This means that if a crash happens immediately after a recovery
starts, the new device will be included in the array and recovery will
continue from wherever it was up to. Previously this didn't happen
unless recovery was at least 1/16 of the way through.
Signed-off-by: NeilBrown <neilb@suse.de>
The RAID ioctls are only implemented in md.c, so the
handling for them should also be moved there from
fs/compat_ioctl.c.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Andre Noll <maan@systemlinux.org>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
We've noticed severe lasting performance degradation of our raid
arrays when we have drives that yield large amounts of media errors.
The raid10 module will queue each failed read for retry, and also
will attempt call fix_read_error() to perform the read recovery.
Read recovery is performed while the array is frozen, so repeated
recovery attempts can degrade the performance of the array for
extended periods of time.
With this patch I propose adding a per md device max number of
corrected read attempts. Each rdev will maintain a count of
read correction attempts in the rdev->read_errors field (not
used currently for raid10). When we enter fix_read_error()
we'll check to see when the last read error occurred, and
divide the read error count by 2 for every hour since the
last read error. If at that point our read error count
exceeds the read error threshold, we'll fail the raid device.
In addition in this patch I add sysfs nodes (get/set) for
the per md max_read_errors attribute, the rdev->read_errors
attribute, and added some printk's to indicate when
fix_read_error fails to repair an rdev.
For testing I used debugfs->fail_make_request to inject
IO errors to the rdev while doing IO to the raid array.
Signed-off-by: Robert Becker <Rob.Becker@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>
A new attribute directory 'bitmap' in 'md' is created which
contains files for configuring the bitmap.
'location' identifies where the bitmap is, either 'none',
or 'file' or 'sector offset from metadata'.
Writing 'location' can create or remove a bitmap.
Adding a 'file' bitmap this way is not yet supported.
'chunksize' and 'time_base' must be set before 'location'
can be set.
'chunksize' can be set before creating a bitmap, but is
currently always over-ridden by the bitmap superblock.
'time_base' and 'backlog' can be updated at any time.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Andre Noll <maan@systemlinux.org>
safe_delay_store can parse fixed point numbers (for fractions
of a second). We will want to do that for another sysfs
file soon, so factor out the code.
Signed-off-by: NeilBrown <neilb@suse.de>
... and into bitmap_info. These are all configuration parameters
that need to be set before the bitmap is created.
Signed-off-by: NeilBrown <neilb@suse.de>
In preparation for making bitmap fields configurable via sysfs,
start tidying up by making a single structure to contain the
configuration fields.
Signed-off-by: NeilBrown <neilb@suse.de>
Previously barriers were only supported on RAID1. This is because
other levels requires synchronisation across all devices and so needed
a different approach.
Here is that approach.
When a barrier arrives, we send a zero-length barrier to every active
device. When that completes - and if the original request was not
empty - we submit the barrier request itself (with the barrier flag
cleared) and then submit a fresh load of zero length barriers.
The barrier request itself is asynchronous, but any subsequent
request will block until the barrier completes.
The reason for clearing the barrier flag is that a barrier request is
allowed to fail. If we pass a non-empty barrier through a striping
raid level it is conceivable that part of it could succeed and part
could fail. That would be way too hard to deal with.
So if the first run of zero length barriers succeed, we assume all is
sufficiently well that we send the request and ignore errors in the
second run of barriers.
RAID5 needs extra care as write requests may not have been submitted
to the underlying devices yet. So we flush the stripe cache before
proceeding with the barrier.
Note that the second set of zero-length barriers are submitted
immediately after the original request is submitted. Thus when
a personality finds mddev->barrier to be set during make_request,
it should not return from make_request until the corresponding
per-device request(s) have been queued.
That will be done in later patches.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Andre Noll <maan@systemlinux.org>
If a resync/recovery/check/repair is interrupted for some reason, it
can be useful to know exactly where it got up to.
So in that case, do not clear curr_resync_completed.
Initialise it when starting a resync/recovery/... instead.
Signed-off-by: NeilBrown <neilb@suse.de>
When a 'check' or 'repair' finished we should clear resync_min
so that a future check/repair will cover the whole array (by default).
However if it is interrupted, we should update resync_min to
where we got up to, so that when the check/repair continues it
just does the remainder of the array.
Signed-off-by: NeilBrown <neilb@suse.de>
A write intent bitmap can be removed from an array while the
array is active.
When this happens, all IO is suspended and flushed before the
bitmap is removed.
However it is possible that bitmap_daemon_work is still running to
clear old bits from the bitmap. If it is, it can dereference the
bitmap after it has been freed.
So introduce a new mutex to protect bitmap_daemon_work and get it
before destroying a bitmap.
This is suitable for any current -stable kernel.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits)
security/tomoyo: Remove now unnecessary handling of security_sysctl.
security/tomoyo: Add a special case to handle accesses through the internal proc mount.
sysctl: Drop & in front of every proc_handler.
sysctl: Remove CTL_NONE and CTL_UNNUMBERED
sysctl: kill dead ctl_handler definitions.
sysctl: Remove the last of the generic binary sysctl support
sysctl net: Remove unused binary sysctl code
sysctl security/tomoyo: Don't look at ctl_name
sysctl arm: Remove binary sysctl support
sysctl x86: Remove dead binary sysctl support
sysctl sh: Remove dead binary sysctl support
sysctl powerpc: Remove dead binary sysctl support
sysctl ia64: Remove dead binary sysctl support
sysctl s390: Remove dead sysctl binary support
sysctl frv: Remove dead binary sysctl support
sysctl mips/lasat: Remove dead binary sysctl support
sysctl drivers: Remove dead binary sysctl support
sysctl crypto: Remove dead binary sysctl support
sysctl security/keys: Remove dead binary sysctl support
sysctl kernel: Remove binary sysctl logic
...
For consistency drop & in front of every proc_handler. Explicity
taking the address is unnecessary and it prevents optimizations
like stubbing the proc_handlers to NULL.
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Resolve the conflict between v2.6.32-rc7 where dn_def_dev_handler
gets a small bug fix and the sysctl tree where I am removing all
sysctl strategy routines.
This is a combination that didn't really make sense before.
However when a reshape is converting e.g. raid5 -> raid6, the extra
device is not fully in-sync, but is certainly active and contains
important data.
So allow that start to be meaningful and in particular get
the 'recovery_offset' value (which is needed for any non-in-sync
active device) from the reshape_position.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that sys_sysctl is a wrapper around /proc/sys all of
the binary sysctl support elsewhere in the tree is
dead code.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Neil Brown <neilb@suse.de>
Cc: "James E.J. Bottomley" <James.Bottomley@suse.de>
Acked-by: Clemens Ladisch <clemens@ladisch.de> for drivers/char/hpet.c
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Each device has its own 'recovery_offset' showing how far
recovery has progressed on the device.
As the only real significance of this is that fact that it can
be stored in the metadata and recovered at restart, and as
only 1.x metadata can do this, we were only updating
'recovery_offset' to 'curr_resync_completed' when updating
v1.x metadata.
But this is wrong, and we will shortly make limited use of this
field in v0.90 metadata.
So move the update into common code.
Signed-off-by: NeilBrown <neilb@suse.de>
If a 'sync_max' has been set (via sysfs), it is wrong to clear it
until a resync (or reshape or recovery ...) actually reached that
point.
So if a resync is interrupted (e.g. by device failure),
leave 'resync_max' unchanged.
This is particularly important for 'reshape' operations that do not
change the size of the array. For such operations mdadm needs to
monitor the reshape taking rolling backups of the section being
reshaped. If resync_max gets cleared, the reshape can get ahead of
mdadm and then the backups that mdadm creates are useless.
This is suitable for 2.6.31.y stable kernels.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
When a raid5 (or raid6) array is being reshaped to have fewer devices,
conf->raid_disks is the latter and hence smaller number of devices.
However sometimes we want to use a number which is the total number of
currently required devices - the larger of the 'old' and 'new' sizes.
Before we implemented reducing the number of devices, this was always
'new' i.e. ->raid_disks.
Now we need max(raid_disks, previous_raid_disks) in those places.
This particularly affects assembling an array that was shutdown while
in the middle of a reshape to fewer devices.
md.c needs a similar fix when interpreting the md metadata.
Signed-off-by: NeilBrown <neilb@suse.de>
The management thread for raid4,5,6 arrays are all called
mdX_raid5, independent of the actual raid level, which is wrong and
can be confusion.
So change md_register_thread to use the name from the personality
unless no alternate name (like 'resync' or 'reshape') is given.
This is simpler and more correct.
Cc: Jinzc <zhenchengjin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Rename some variable and remove some duplicate definitions
to avoid there warnings. None of them are actual errors.
Signed-off-by: NeilBrown <neilb@suse.de>
Recent commit c8c00a6915
changed the exit paths in do_md_stop and was not quite
careful enough. There is one path were 'err' now needs
to be cleared but it isn't.
So setting an array to readonly (with mdadm --readonly) will
work, but will incorrectly report and error: ENXIO.
Signed-off-by: NeilBrown <neilb@suse.de>
Normally we only allow the upper limit for a reshape to be decreased
when the array not performing a sync/recovery/reshape, otherwise there
could be races. But if an array is part-way through a reshape when it
is assembled the reshape is started immediately leaving no window
to set an upper bound.
If the array is started read-only, the reshape will be suspended until
the array becomes writable, so that provides a window during which it
is perfectly safe to reduce the upper limit of a reshape.
So: allow the upper limit (sync_max) to be reduced even if the reshape
thread is running, as long as the array is still read-only.
Signed-off-by: NeilBrown <neilb@suse.de>
When assembling arrays, md allows two devices to have different event
counts as long as the difference is only '1'. This is to cope with
a system failure between updating the metadata on two difference
devices.
However there are currently times when we update the event count by
2. This was done to keep the event count even when the array is clean
and odd when it is dirty, which allows us to avoid writing common
update to spare devices and so allow those spares to go to sleep.
This is bad for the above reason. So change it to never increase by
two. This means that the alignment between 'odd/even' and
'clean/dirty' might take a little longer to attain, but that is only a
small cost. The spares will get a few more updates but that will
still be spared (;-) most updates and can still go to sleep.
Prior to this patch there was a small chance that after a crash an
array would fail to assemble due to the overly large event count
mismatch.
Signed-off-by: NeilBrown <neilb@suse.de>
A recent commit:
commit 449aad3e25
introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.
__blkdev_get holds bd_mutex while calling md_open which takes
reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
takes bd_mutex in the call the revalidate_disk.
This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
commit d63a5a74de
do avoid a warning of an impossible deadlock.
It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.
In particular, the second lock prevents an open from completing
between the time when do_md_stop checks if there are any active opens,
and the time when the array is either set read-only, or when ->pers is
set to NULL. So we can be certain that no IO is in flight as the
array is being destroyed.
So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.
This avoids the deadlock and also avoids the lockdep warning mentioned
in commit d63a5a74d
Reported-by: "Mike Snitzer" <snitzer@gmail.com>
Reported-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NeilBrown <neilb@suse.de>
As revalidate_disk calls check_disk_size_change, it will cause
any capacity change of a gendisk to be propagated to the blockdev
inode. So use that instead of mucking about with locks and
i_size_write.
Also add a call to revalidate_disk in do_md_run and a few other places
where the gendisk capacity is changed.
Signed-off-by: NeilBrown <neilb@suse.de>
The v1.x metadata does not have a fixed size and can grow
when devices are added.
If it grows enough to require an extra sector of storage,
we need to update the 'sb_size' to match.
Without this, md can write out an incomplete superblock with a
bad checksum, which will be rejected when trying to re-assemble
the array.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
We trust the 'desc_nr' field in v1.x metadata enough to use it
as an index in an array. This isn't really safe.
So range-check the value first.
Signed-off-by: NeilBrown <neilb@suse.de>
When an array is changed from RAID6 to RAID5, fewer drives are
needed. So any device that is made superfluous by the level
conversion must be marked as not-active.
For the RAID6->RAID5 conversion, this will be a drive which only
has 'Q' blocks on it.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.
md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity. The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().
The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.
For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit 5fd29d6ccb ("printk: clean up
handling of log-levels and newlines") changed printk semantics. printk
lines with multiple KERN_<level> prefixes are no longer emitted as
before the patch.
<level> is now included in the output on each additional use.
Remove all uses of multiple KERN_<level>s in formats.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
User space can set various limits on an md array so that resync waits
when it gets to a certain point, or so that I/O is blocked for a short
while.
When md is waiting against one of these limit, it should use an
interruptible wait so as not to add to the load average, and so are
not to trigger a warning if the wait goes on for too long.
Signed-off-by: NeilBrown <neilb@suse.de>
As the recent bug in md_alloc showed, having a single exit path for
unlocking and putting is a good idea. So restructure md_alloc to have
a single mutex_unlock and mddev_put, and use gotos where necessary.
Found-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When an md device is created by name (rather than number) we need to
check that the name is not already in use. If this check finds a
duplicate, we return an error without dropping the lock or freeing
the newly create mddev.
This patch fixes that.
Cc: stable@kernel.org
Found-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If we try to modify one of the md/ sysfs files
suspend_lo or suspend_hi
when the array is not active, we dereference a NULL.
Protect against that.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
If the superblock of a component device indicates the presence of a
bitmap but the corresponding raid personality does not support bitmaps
(raid0, linear, multipath, faulty), then something is seriously wrong
and we'd better refuse to run such an array.
Currently, this check is performed while the superblocks are examined,
i.e. before entering personality code. Therefore the generic md layer
must know which raid levels support bitmaps and which do not.
This patch avoids this layer violation without adding identical code
to various personalities. This is accomplished by introducing a new
public function to md.c, md_check_no_bitmap(), which replaces the
hard-coded checks in the superblock loading functions.
A call to md_check_no_bitmap() is added to the ->run method of each
personality which does not support bitmaps and assembly is aborted
if at least one component device contains a bitmap.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
It is easiest to round sizes to multiples of chunk size in
the personality code for those personalities which care.
Those personalities now do the rounding, so we can
remove that function from common code.
Also remove the upper bound on the size of a chunk, and the lower
bound on the size of a device (1 chunk), neither of which really buy
us anything.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently the assignment to utime gets skipped for 'external'
metadata. So move it to the top of the function so that it
always gets effected.
This is of largely cosmetic interest. Nothing actually depends
on ->utime being right for external arrays.
"mdadm --monitor" does use it for 0.90 and 1.x arrays, but with
mdadm-3.0, this is not important for external metadata.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently, the md layer checks in analyze_sbs() if the raid level
supports reconstruction (mddev->level >= 1) and if reconstruction is
in progress (mddev->recovery_cp != MaxSector).
Move that printk into the personality code of those raid levels that
care (levels 1, 4, 5, 6, 10).
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
The difference between these two methods is artificial.
Both check that a pending reshape is valid, and perform any
aspect of it that can be done immediately.
'reconfig' handles chunk size and layout.
'check_reshape' handles raid_disks.
So make them just one method.
Signed-off-by: NeilBrown <neilb@suse.de>
Passing the new layout and chunksize as args is not necessary as
the mddev has fields for new_check and new_layout.
This is preparation for combining the check_reshape and reconfig
methods
Signed-off-by: NeilBrown <neilb@suse.de>
A straight-forward conversion which gets rid of some
multiplications/divisions/shifts. The patch also introduces a couple
of new ones, most of which are due to conf->chunk_size still being
represented in bytes. This will be cleaned up in subsequent patches.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
This patch renames the chunk_size field to chunk_sectors with the
implied change of semantics. Since
is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9)
= is_power_of_2(chunk_sectors)
these bits don't need an adjustment for the shift.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Remove chunk size check from md as this is now performed in the run
function in each personality.
Replace chunk size power 2 code calculations by a regular division.
Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
block: add request clone interface (v2)
floppy: fix hibernation
ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
fs/bio.c: add missing __user annotation
block: prevent possible io_context->refcount overflow
Add serial number support for virtio_blk, V4a
block: Add missing bounce_pfn stacking and fix comments
Revert "block: Fix bounce limit setting in DM"
cciss: decode unit attention in SCSI error handling code
cciss: Remove no longer needed sendcmd reject processing code
cciss: change SCSI error handling routines to work with interrupts enabled.
cciss: separate error processing and command retrying code in sendcmd_withirq_core()
cciss: factor out fix target status processing code from sendcmd functions
cciss: simplify interface of sendcmd() and sendcmd_withirq()
cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
block: needs to set the residual length of a bidi request
Revert "block: implement blkdev_readpages"
block: Fix bounce limit setting in DM
Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
...
Manually fix conflicts with tracing updates in:
block/blk-sysfs.c
drivers/ide/ide-atapi.c
drivers/ide/ide-cd.c
drivers/ide/ide-floppy.c
drivers/ide/ide-tape.c
include/trace/events/block.h
kernel/trace/blktrace.c
In order for the metadata to always be consistent, we mustn't updated
curr_resync_completed without also updating reshape_position.
The reshape code updates both at the same time. However since
commit 97e4f42d62
the common md_do_sync will sometimes update curr_resync_completed
but is not in a position to update reshape_position.
So if MD_RECOVERY_RESHAPE is set (indicating that a reshape is
happening, so reshape_position might change), don't update
curr_resync_completed in md_do_sync, leave it to the per-personality
reshape code.
Signed-off-by: NeilBrown <neilb@suse.de>
The md resync engine has a 'frozen' state which ensures that
no resync/recovery. This is used to avoid races.
Export this state through the 'sync_action' sysfs attribute
so that user-space can benefit and also avoid some races.
Signed-off-by: NeilBrown <neilb@suse.de>
Instead of always returns EINVAL if anything goes wrong
when setting the array size, add the option of
E2BIG
if the size requested is too large. This makes it easier
for user-space to be sure what went wrong.
Signed-off-by: NeilBrown <neilb@suse.de>
We previously didn't update these fields when writing the metadata
because they could never change. They can now, so we better write
them.
v0.90 metadata always updated these fields.
Signed-off-by: NeilBrown <neilb@suse.de>
Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case. The
sector size will be 4KB but the logical block size will remain
512-bytes. Hence we need to distinguish between the physical block size
and the logical ditto.
This patch renames hardsect_size to logical_block_size.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
md maintains link in sys/mdXX/md/ to identify which device has
which role in the array. e.g.
rd2 -> dev-sda
indicates that the device with role '2' in the array is sda.
These links are only present when the array is active. They are
created immediately after ->run is called, and so should be removed
immediately after ->stop is called.
However they are currently removed a little bit later, and it is
possible for ->run to be called again, thus adding these links, before
they are removed.
So move the removal earlier so they are consistently only present when
the array is active.
Signed-off-by: NeilBrown <neilb@suse.de>
Being able to write 'clean' to an 'array_state' of an inactive array
to activate it in 'clean' mode is both unnecessary and inconvenient.
It is unnecessary because the same can be achieved by writing
'active'. This activates and array, but it still remains 'clean'
until the first write.
It is inconvenient because writing 'clean' is more often used to
cause an 'active' array to revert to 'clean' mode (thus blocking
any writes until a 'write-pending' is promoted to 'active').
Allowing 'clean' to both activate an array and mark an active array as
clean can lead to races: One program writes 'clean' to mark the
active array as clean at the same time as another program writes
'inactive' to deactivate (stop) and active array. Depending on which
writes first, the array could be deactivated and immediately
reactivated which isn't what was desired.
So just disable the use of 'clean' to activate an array.
This avoids a race that can be triggered with mdadm-3.0 and external
metadata, so it suitable for -stable.
Reported-by: Rafal Marszewski <rafal.marszewski@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Two problems in status_resync.
1/ It still used Kilobytes as the basic block unit, while most code
now uses sectors uniformly.
2/ It doesn't allow for the possibility that max_sectors exceeds
the range of "unsigned long".
So
- change "max_blocks" to "max_sectors", and store sector numbers
in there and in 'resync'
- Make 'rt' a 'sector_t' so it can temporarily hold the number of
remaining sectors.
- use sector_div rather than normal division.
- change the magic '100' used to preserve precision to '32'.
+ making it a power of 2 makes division easier
+ it doesn't need to be as large as it was chosen when we averaged
speed over the entire run. Now we average speed over the last 30
seconds or so.
Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
Signed-off-by: NeilBrown <neilb@suse.de>
There are circumstances when a user-space process might need to
"oversee" a resync/reshape process. For example when doing an
in-place reshape of a raid5, it is prudent to take a backup of each
section before reshaping it as this is the only way to provide
safety against an unplanned shutdown (i.e. crash/power failure).
The sync_max sysfs value can be used to stop the resync from
advancing beyond a particular point.
So user-space can:
suspend IO to the first section and back it up
set 'sync_max' to the end of the section
wait for 'sync_completed' to reach that point
resume IO on the first section and move on to the next section.
However this process requires the kernel and user-space to run in
lock-step which could introduce unnecessary delays.
It would be better if a 'double buffered' approach could be used with
userspace and kernel space working on different sections with the
'next' section always ready when the 'current' section is finished.
One problem with implementing this is that sync_completed is only
guaranteed to be updated when the sync process reaches sync_max.
(it is updated on a time basis at other times, but it is hard to rely
on that). This defeats some of the double buffering.
With this patch, sync_completed (and reshape_position) get updated as
the current position approaches sync_max, so there is room for
userspace to advance sync_max early without losing updates.
To be precise, sync_completed is updated when the current sync
position reaches half way between the current value of sync_completed
and the value of sync_max. This will usually be a good time for user
space to update sync_max.
If sync_max does not get updated, the updates to sync_completed
(together with associated metadata updates) will occur at an
exponentially increasing frequency which will get unreasonably fast
(one update every page) immediately before the process hits sync_max
and stops. So the update rate will be unreasonably fast only for an
insignificant period of time.
Signed-off-by: NeilBrown <neilb@suse.de>
The sync_completed file reports how much of a resync (or recovery or
reshape) has been completed.
However due to the possibility of out-of-order completion of writes,
it is not certain to be accurate.
We have an internal value - mddev->curr_resync_completed - which is an
accurate value (though it might not always be quite so uptodate).
So:
- make curr_resync_completed be uptodate a little more often,
particularly when raid5 reshape updates status in the metadata
- report curr_resync_completed in the sysfs file
- allow poll/select to report all updates to md/sync_completed.
This makes sync_completed completed usable by any external metadata
handler that wants to record this status information in its metadata.
Signed-off-by: NeilBrown <neilb@suse.de>
When adding devices to an active array via sysfs, there is currently
no way to mark a device as 'in-sync' which is useful when
incrementally assembling an array.
So add that option.
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-linus' of git://neil.brown.name/md: (53 commits)
md/raid5 revise rules for when to update metadata during reshape
md/raid5: minor code cleanups in make_request.
md: remove CONFIG_MD_RAID_RESHAPE config option.
md/raid5: be more careful about write ordering when reshaping.
md: don't display meaningless values in sysfs files resync_start and sync_speed
md/raid5: allow layout and chunksize to be changed on active array.
md/raid5: reshape using largest of old and new chunk size
md/raid5: prepare for allowing reshape to change layout
md/raid5: prepare for allowing reshape to change chunksize.
md/raid5: clearly differentiate 'before' and 'after' stripes during reshape.
Documentation/md.txt update
md: allow number of drives in raid5 to be reduced
md/raid5: change reshape-progress measurement to cope with reshaping backwards.
md: add explicit method to signal the end of a reshape.
md/raid5: enhance raid5_size to work correctly with negative delta_disks
md/raid5: drop qd_idx from r6_state
md/raid6: move raid6 data processing to raid6_pq.ko
md: raid5 run(): Fix max_degraded for raid level 4.
md: 'array_size' sysfs attribute
md: centralize ->array_sectors modifications
...
When no resync if happening, both of these files currently have
meaningless values (is slightly different ways).
Change them to "none" in that case.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently raid5 (the only module that supports restriping)
notices that the reshape has finished be sync_request being
given a large value, and handles any cleanup them.
This patch changes it so md_check_recovery calls into an
explicit finish_reshape method as well.
The clean-up from sync_request can do things that need to be
done promptly, typically things local to the raid5_conf_t
structure.
The "finish_reshape" method is called under the mddev_lock
so it can do things involving reconfiguring the device.
This allows us to get rid of md_set_array_sectors_locked, which
would have caused a deadlock if you tried to stop and array
while a reshape was happening.
Signed-off-by: NeilBrown <neilb@suse.de>
Allow userspace to set the size of the array according to the following
semantics:
1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
a) If size is set before the array is running, do_md_run will fail
if size is greater than the default size
b) A reshape attempt that reduces the default size to less than the set
array size should be blocked
2/ once userspace sets the size the kernel will not change it
3/ writing 'default' to this attribute returns control of the size to the
kernel and reverts to the size reported by the personality
Also, convert locations that need to know the default size from directly
reading ->array_sectors to <pers>_size. Resync/reshape operations
always follow the default size.
Finally, fixup other locations that read a number of 1k-blocks from
userspace to use strict_blocks_to_sectors() which checks for unsigned
long long to sector_t overflow and blocks to sectors overflow.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Get personalities out of the business of directly modifying
->array_sectors. Lays groundwork to introduce policy on when
->array_sectors can be modified.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2-drive raid5's aren't very interesting. But if you are converting
a raid1 into a raid5, you will at least temporarily have one. And
that it a good time to set the layout/chunksize for the new RAID5
if you aren't happy with the defaults.
layout and chunksize don't actually affect the placement of data
on a 2-drive raid5, so we just do some internal book-keeping.
Signed-off-by: NeilBrown <neilb@suse.de>
Implement this for RAID6 to be able to 'takeover' a RAID5 array. The
new RAID6 will use a layout which places Q on the last device, and
that device will be missing.
If there are any available spares, one will immediately have Q
recovered onto it.
Signed-off-by: NeilBrown <neilb@suse.de>
To be able to change the 'level' of an md/raid array, we need to
suspend the device so that no requests are active - then move some
pointers around etc.
The code already keeps counts of active requests and the ->quiesce
function can be used to wait until those counts hit zero.
However the quiesce function blocks new requests once they are all
ready 'inside' the personality module, and that is too late if we want
to replace the personality modules.
So make all md requests come in through a common md_make_request
function that keeps track of how many requests have entered the
modules but may not yet be on the internal reference counts.
Allow md_make_request to be blocked when we want to suspend the
device, and make it possible to wait for all those in-transit requests
to be added to internal lists so that ->quiesce can wait for them.
There is still a problem that when a request completes, we drop the
ref count inside the personality code so there is a short time between
when the refcount hits zero, and when the personality code is no
longer being used.
The personality code never blocks (schedule or spinlock) between
dropping the refcount and exiting the routine, so this should be safe
(as put_module calls synchronize_sched() before unmapping the module
code).
Signed-off-by: NeilBrown <neilb@suse.de>
Mostly md_unregister_thread is only called when we know that the
thread is NULL, but sometimes we need to check first. It is safer
to put the check inside md_unregister_thread itself.
Signed-off-by: NeilBrown <neilb@suse.de>
When an md array is undergoing a change, we have new_* fields that
show the new values.
When no change is happening, it is least confusing if these have
the same value as the normal fields.
This is true in most cases, but not when the values are set via sysfs.
So fix this up.
A subsequent patch will BUG_ON if these things aren't consistent.
Signed-off-by: NeilBrown <neilb@suse.de>
This patch renames the "size" field of struct mdk_rdev_s to
"sectors" and changes this field to store sectors instead of
blocks.
All users of this field, linear.c, raid0.c and md.c, are fixed up
accordingly which gets rid of many multiplications and divisions.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
This patch renames the "size" field of struct mddev_s to "dev_sectors"
and stores the number of 512-byte sectors instead of the number of
1K-blocks in it.
All users of that field, including raid levels 1,4-6,10, are adjusted
accordingly. This simplifies the code a bit because it allows to get
rid of a couple of divisions/multiplications by two.
In order to make checkpatch happy, some minor coding style issues
have also been addressed. In particular, size_store() now uses
strict_strtoull() instead of simple_strtoull().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
When a drive is added to an array using ADD_NEW_DISK, there are two
places we can get certain flags from: the metadata on the disk or the
flags passed through the IOCTL.
For the WriteMostly flag (aka MD_DISK_WRITEMOSTLY) we take the value
from either of those sources depending on if it is set (i.e. we
effectively 'or' the two sources together).
This makes it awkward to clear, and is at best inconsistent.
As documented code (in mdadm) requires that setting
MD_DISK_WRITEMOSTLY in the ioctl will be effective, we resolve the
inconsistency by always using the value for this flag from the ioctl,
and ignoring the value on disk.
Signed-off-by: NeilBrown <neilb@suse.de>
Version 1.x metadata has the ability to record the status of a
partially completed drive recovery.
However we only update that record on a clean shutdown.
It would be nice to update it on unclean shutdowns too, particularly
when using a bitmap that removes much to the 'sync' effort after an
unclean shutdown.
One complication with checkpointing recovery is that we only know
where we are up to in terms of IO requests started, not which ones
have completed. And we need to know what has completed to record
how much is recovered. So occasionally pause the recovery until all
submitted requests are completed, then update the record of where
we are up to.
When we have a bitmap, we already do that pause occasionally to keep
the bitmap up-to-date. So enhance that code to record the recovery
offset and schedule a superblock update.
And when there is no bitmap, just pause 16 times during the resync to
do a checkpoint.
'16' is a fairly arbitrary number. But we don't really have any good
way to judge how often is acceptable, and it seems like a reasonable
number for now.
Signed-off-by: NeilBrown <neilb@suse.de>
This makes the includes more explicit, and is preparation for moving
md_k.h to drivers/md/md.h
Remove include/raid/md.h as its only remaining use was to #include
other files.
Signed-off-by: NeilBrown <neilb@suse.de>
.. as they are part of the user-space interface.
Also move MdpMinorShift into there so we can remove duplication.
Lastly move mdp_major in. It is less obviously part of the user-space
interface, but do_mounts_md.c uses it, and it is acting a bit like
user-space.
Signed-off-by: NeilBrown <neilb@suse.de>
Move the headers with the local structures for the disciplines and
bitmap.h into drivers/md/ so that they are more easily grepable for
hacking and not far away. md.h is left where it is for now as there
are some uses from the outside.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
MAJOR_NR was only required for magic in linux/blk.h in 2.4 or earlier
kernels, so no need to keep it around.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
md: Add support for data integrity to MD
If all subdevices support the same protection format the MD device is
flagged as integrity capable.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.de>
There are two problems with is_mddev_idle.
1/ sync_io is 'atomic_t' and hence 'int'. curr_events and all the
rest are 'long'.
So if sync_io were to wrap on a 64bit host, the value of
curr_events would go very negative suddenly, and take a very
long time to return to positive.
So do all calculations as 'int'. That gives us plenty of precision
for what we need.
2/ To initialise rdev->last_events we simply call is_mddev_idle, on
the assumption that it will make sure that last_events is in a
suitable range. It used to do this, but now it does not.
So now we need to be more explicit about initialisation.
Signed-off-by: NeilBrown <neilb@suse.de>
We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
213d9417fe.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Each different metadata format supported by md supports a
different maximum number of devices.
We really should be enforcing this maximum in the kernel, but
we aren't quite doing that properly.
We currently only enforce it at the 'hot_add' point, which is an
older interface which is not used by current userspace.
We need to also enforce it at 'add_new_disk' time for active arrays
and at 'do_md_run' time when starting a new array.
So move the test from 'hot_add' into 'bind_rdev_to_array' which is
called from both 'hot_add' and 'add_new_disk, and add a new
test in 'analyse_sbs' which is called from 'do_md_run'.
This bug (or missing feature) has been around "forever" and so
the patch is suitable for any -stable that is currently maintained.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
If a raid1 has only one working drive and it has a sector which
gives an error on read, then an attempt to recover onto a spare will
fail, but as the single remaining drive is not removed from the
array, the recovery will be immediately re-attempted, resulting
in an infinite recovery loop.
So detect this situation and don't retry recovery once an error
on the lone remaining drive is detected.
Allow recovery to be retried once every time a spare is added
in case the problem wasn't actually a media error.
Signed-off-by: NeilBrown <neilb@suse.de>
Using sequential numbers to identify md devices is somewhat artificial.
Using names can be a lot more user-friendly.
Also, creating md devices by opening the device special file is a bit
awkward.
So this patch provides a new option for creating and naming devices.
Writing a name such as "md_home" to
/sys/modules/md_mod/parameters/new_array
will cause an array with that name to be created. It will appear in
/sys/block/ /proc/partitions and /proc/mdstat as 'md_home'.
It will have an arbitrary minor number allocated.
md devices that a created by an open are destroyed on the last
close when the device is inactive.
For named md devices, they will not be destroyed until the array
is explicitly stopped, either with the STOP_ARRAY ioctl or by
writing 'clear' to /sys/block/md_XXXX/md/array_state.
The name of the array must start 'md_' to avoid conflict with
other devices.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently md devices, once created, never disappear until the module
is unloaded. This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.
If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed. However it
is not possible to hook into the gendisk destruction process to enable
this.
So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed. However this has a
complication.
Between the call
__blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
__blkdev_get->md_open
there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.
Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created. We need to
ensure that this reference can not be used. i.e. the ->open must
fail.
So:
1/ in md_probe we set a flag in the mddev (hold_active) which
indicates that the array should be treated as active, even
though there are no references, and no appearance of activity.
This is cleared by md_release when the device is closed if it
is no longer needed.
This ensures that the gendisk will survive between md_probe and
md_open.
2/ In md_open we check if the mddev we expect to open matches
the gendisk that we did open.
If there is a mismatch we return -ERESTARTSYS and modify
__blkdev_get to retry from the top in that case.
In the -ERESTARTSYS sys case we make sure to wait until
the old gendisk (that we succeeded in opening) is really gone so
we loop at most once.
Some udev configurations will always open an md device when it first
appears. If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it. This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).
As an array can be stopped by writing to a sysfs attribute
echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects. This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().
Signed-off-by: NeilBrown <neilb@suse.de>
md_free is the .release handler for the md kobj_type.
So it makes sense to release all the objects referenced by
the mddev in there, rather than just prior to calling kobject_put
for what we think is the last time.
Signed-off-by: NeilBrown <neilb@suse.de>
It is more balanced to just do simple initialisation in mddev_find,
which allocates and links a new md device, and leave all the
more sophisticated allocation to md_probe (which calls mddev_find).
md_probe already allocated the gendisk. It should allocate the
queue too.
Signed-off-by: NeilBrown <neilb@suse.de>
The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to
list_for_each_entry_safe, from <linux/list.h>, it should be defined to
use list_for_each_entry_safe, instead of reinventing the wheel.
But some calls to each_entry_safe don't really need a safe version,
just a direct list_for_each_entry is enough, this could save a temp
variable (tmp) in every function that used rdev_for_each.
In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
totally save many tmp vars; and only in the other situations that will call
list_del to delete an entry, the safe version is used.
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
There is no compelling need for this, but sysfs_notify_dirent is a
nicer interface and the change is good for consistency.
Signed-off-by: NeilBrown <neilb@suse.de>
It turns out that it is only safe to call blkdev_ioctl when the device
is actually open (as ->bd_disk is set to NULL on last close). And it
is quite possible for do_md_stop to be called when the device is not
open. So discard the call to blkdev_ioctl(BLKRRPART) which was
added in
commit 934d9c23b4
It is just as easy to call this ioctl from userspace when needed (on
mdadm -S) so leave it out of the kernel
Signed-off-by: NeilBrown <neilb@suse.de>
md arrays are not currently destroyed when they are stopped - they
remain in /sys/block. Last time I tried this I tripped over locking
too much.
A consequence of this is that udev doesn't remove anything from /dev.
This is rather ugly.
As an interim measure until proper device removal can be achieved,
make sure all partitions are removed using the BLKRRPART ioctl, and
send a KOBJ_CHANGE when an md array is stopped.
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-linus' of git://neil.brown.name/md:
md: allow extended partitions on md devices.
md: use sysfs_notify_dirent to notify changes to md/dev-xxx/state
md: use sysfs_notify_dirent to notify changes to md/array_state
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.
Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.
New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The new extended partition support provides a much nicer was
to have partitions on md devices that the 'mdp' alternate major.
We cannot really get rid of 'mdp' at this time, but we can
enable extended partitions as that will probably make life
easier for sysadmins.
Signed-off-by: NeilBrown <neilb@suse.de>
The 'state' file for a device reports, for example, when the device
has failed. Changes should be reported to userspace ASAP without
the possibility of blocking on low-memory. sysfs_notify does
have that possibility (as it takes a mutex which can be held
across a kmalloc) so use sysfs_notify_dirent instead.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that we have sysfs_notify_dirent, use it to notify changes
to md/array_state.
As sysfs_notify_dirent can be called in atomic context, we can
remove the delayed notify and the MD_NOTIFY_ARRAY_STATE flag.
Signed-off-by: NeilBrown <neilb@suse.de>
safe_delay_store() currently truncates the last character of input since
it tells strlcpy that the buffer can only hold 'len' characters, off by
one. sysfs already null terminates the buffer, so just increase the
last argument to strlcpy.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Today's linux-next build (powerpc ppc64_defconfig) failed like this:
drivers/md/raid1.c: In function 'sync_request':
drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid1.o] Error 1
make[3]: *** Waiting for unfinished jobs....
drivers/md/raid10.c: In function 'sync_request':
drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid10.o] Error 1
drivers/md/md.c: In function 'md_do_sync':
drivers/md/md.c:5915: error: implicit declaration of function 'msleep'
Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove
unnecessary #includes, #defines, and function declarations"). I added
the following patch.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NeilBrown <neilb@suse.de>
Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE.
This makes moving an array to a machine with a larger PAGE_SIZE, or
changing the kernel to use a larger PAGE_SIZE, can stop an array from
working.
For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync
process works on whole pages at a time, and assumes them to be wholly
within a stripe. For other raid personalities, this restriction is
not needed at all and can be dropped.
So remove the test on chunk_size from common can, and add it in just
the places where it is needed: raid10 and raid4/5/6.
Signed-off-by: NeilBrown <neilb@suse.de>
Having
function (args)
instead of
function(args)
make is harder to search for calls of particular functions.
So remove all those spaces.
Signed-off-by: NeilBrown <neilb@suse.de>
'read-auto' is a variant of 'readonly' which will switch to writable
on the first write attempt.
Calling do_md_stop to set the array readonly when it is already readonly
returns an error. So make sure not to do that.
Signed-off-by: NeilBrown <neilb@suse.de>
For externally managed metadata, the 'metadata_version' sysfs
attribute is really just a channel for user-space programs to
communicate about how the array is being managed.
It can be useful for this to be changed while the array is active.
Normally changes to metadata_version are not permitted while the array
is active. Change that so that if the metadata is externally managed,
the metadata_version can be changed to a different flavour of external
management.
Signed-off-by: NeilBrown <neilb@suse.de>
Fix rdev_size_store with size == 0.
size == 0 means to use the largest size allowed by the
underlying device and is used when modifying an active array.
This fixes a regression introduced by
commit d7027458d6
Cc: <stable@kernel.org>
Signed-off-by: Chris Webb <chris@arachsys.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Move stats related fields - stamp, in_flight, dkstats - from disk to
part0 and unify stat handling such that...
* part_stat_*() now updates part0 together if the specified partition
is not part0. ie. part_stat_*() are now essentially all_stat_*().
* {disk|all}_stat_*() are gone.
* part_round_stats() is updated similary. It handles part0 stats
automatically and disk_round_stats() is killed.
* part_{inc|dec}_in_fligh() is implemented which automatically updates
part0 stats for parts other than part0.
* disk_map_sector_rcu() is updated to return part0 if no part matches.
Combined with the above changes, this makes NULL special case
handling in callers unnecessary.
* Separate stats show code paths for disk are collapsed into part
stats show code paths.
* Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
While at it, reposition stat handling macros a bit and add missing
parentheses around macro parameters.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Till now, bdev->bd_part is set only if the bdev was for parts other
than part0. This patch makes bdev->bd_part always set so that code
paths don't have to differenciate common handling.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Implement {disk|part}_to_dev() and use them to access generic device
instead of directly dereferencing {disk|part}->dev. To make sure no
user is left behind, rename generic devices fields to __dev.
This is in preparation of unifying partition 0 handling with other
partitions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When two md arrays share some block device (e.g each uses different
partitions on the one device), a resync of one array will wait for
the resync on the other to finish.
This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE,
the softlockup code notices and complains.
So use TASK_INTERRUPTIBLE instead and make sure to flush signals
before calling schedule.
Signed-off-by: NeilBrown <neilb@suse.de>
When stopping an md array, or just switching to read-only, we
currently call invalidate_partition while holding the mddev lock.
The main reason for this is probably to ensure all dirty buffers
are flushed (invalidate_partition calls fsync_bdev).
However if any dirty buffers are found, it will almost certainly cause
a deadlock as starting writeout will require an update to the
superblock, and performing that updates requires taking the mddev
lock - which is already held.
This deadlock can be demonstrated by running "reboot -f -n" with
a root filesystem on md/raid, and some dirty buffers in memory.
All other calls to stop an array should already happen after a flush.
The normal sequence is to stop using the array (e.g. umount) which
will cause __blkdev_put to call sync_blockdev. Then open the
array and issue the STOP_ARRAY ioctl while the buffers are all still
clean.
So this invalidate_partition is normally a no-op, except for one case
where it will cause a deadlock.
So remove it.
This patch possibly addresses the regression recored in
http://bugzilla.kernel.org/show_bug.cgi?id=11460
and
http://bugzilla.kernel.org/show_bug.cgi?id=11452
though it isn't yet clear how it ever worked.
Signed-off-by: NeilBrown <neilb@suse.de>
If a 'repair' is requested when an array is in a position to 'recover' raid1
will perform the repair while md believes a recovery is happening. Address
this at both ends, i.e. cancel check/repair requests upon detecting a
recover condition and do not call ->spare_active after completing a
check/repair.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Removing faulty devices from an array is a two stage process.
First the device is moved from being a part of the active array
to being similar to a spare device. Then it can be removed
by a request from user space.
The first step is currently not performed for read-only arrays,
so the second step can never succeed.
So allow readonly arrays to remove failed devices (which aren't
blocked).
Signed-off-by: NeilBrown <neilb@suse.de>
We cannot currently change the size of a write-intent bitmap.
So if we change the size of an array which has such a bitmap, it
tries to set bits beyond the end of the bitmap.
For now, simply reject any request to change the size of an array
which has a bitmap. mdadm can remove the bitmap and add a new one
after the array has changed size.
Signed-off-by: NeilBrown <neilb@suse.de>
A recent patch allowed do_md_stop to know whether it was being called
via an ioctl or not, and thus where to allow for an extra open file
descriptor when checking if it is in use.
This broke then switch to readonly performed by the shutdown notifier,
which needs to work even when the array is still (apparently) active
(as md doesn't get told when the filesystem becomes readonly).
So restore this feature by pretending that there can be lots of
file descriptors open, but we still want do_md_stop to switch to
readonly.
Signed-off-by: NeilBrown <neilb@suse.de>
If we reduce the 'safe_mode_delay', it could still wait for the old
delay to completely expire before doing anything about safe_mode.
Thus the effect if the change is delayed.
To make the effect more immediate, run the timeout function
immediately if the delay was reduced. This may cause it to run
slightly earlier that required, but that is the safer option.
Signed-off-by: NeilBrown <neilb@suse.de>
remove_and_add_spares() assumes that failed devices have been hot-removed
from the array. Removal is skipped in the 'blocked' case so do not count a
device in this state as 'spare'.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
All modifications and most access to the mddev->disks list are made
under the reconfig_mutex lock. However there are three places where
the list is walked without any locking. If a reconfig happens at this
time, havoc (and oops) can ensue.
So use RCU to protect these accesses:
- wrap them in rcu_read_{,un}lock()
- use list_for_each_entry_rcu
- add to the list with list_add_rcu
- delete from the list with list_del_rcu
- delay the 'free' with call_rcu rather than schedule_work
Note that export_rdev did a list_del_init on this list. In almost all
cases the entry was not in the list anymore so it was a no-op and so
safe. It is no longer safe as after list_del_rcu we may not touch
the list_head.
An audit shows that export_rdev is called:
- after unbind_rdev_from_array, in which case the delete has
already been done,
- after bind_rdev_to_array fails, in which case the delete isn't needed.
- before the device has been put on a list at all (e.g. in
add_new_disk where reading the superblock fails).
- and in autorun devices after a failure when the device is on a
different list.
So remove the list_del_init call from export_rdev, and add it back
immediately before the called to export_rdev for that last case.
Note also that ->same_set is sometimes used for lists other than
mddev->list (e.g. candidates). In these cases rcu is not needed.
Signed-off-by: NeilBrown <neilb@suse.de>
Open isn't the only thing that increments ->active. e.g. reading
/proc/mdstat will increment it briefly. So to avoid false positives
in testing for concurrent access, introduce a new counter that counts
just the number of times the md device it open.
Signed-off-by: NeilBrown <neilb@suse.de>
This patch renames the array_size field of struct mddev_s to array_sectors
and converts all instances to use units of 512 byte sectors instead of 1k
blocks.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Also, change the type of the size parameter from unsigned long long to
sector_t and rename it to num_sectors.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
The checks in overlaps() expect all parameters either in block-based
or sector-based quantities. However, its single caller passes two
rdev->data_offset arguments as well as two rdev->size arguments, the
former being sector counts while the latter are measured in 1K blocks.
This could cause rdev_size_store() to accept an invalid size from user
space. Fix it by passing only sector-based quantities to overlaps().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
- used strict_strtoull in place of simple_strtoull
- use my_mddev in place of rdev->mddev (they have the same value)
and more significantly,
- don't adjust mddev->size to fit, rather reject changes which make
rdev->size smaller than mddev->size
Adjusting mddev->size is a hangover from bind_rdev_to_array which
does a similar thing. But it really is a better design to insist that
mddev->size is set as required, then the rdev->sizes are set to allow
for that. The previous way invites confusion.
Signed-off-by: NeilBrown <neilb@suse.de>
Rename it to sb_start to make sure all users have been converted.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
As BLOCK_SIZE_BITS is 10 and
MD_NEW_SIZE_SECTORS(2 * x) = 2 * NEW_SIZE_BLOCKS(x),
the return value of calc_dev_sboffset() doubles. Fix up all three
callers accordingly.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Number of sectors is the preferred unit for sizes of raid devices,
so change calc_dev_size() so that it returns this unit instead of
the number of 1K blocks.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Changing the internal representations of sizes of raid devices
from 1K blocks to sector counts (512B units) is desirable because
it allows to get rid of many divisions/multiplications and unnecessary
casts that are present in the current code.
This patch is a first step in this direction. It replaces the old
1K-based "size" argument of update_size() by "num_sectors" and
fixes up its two callers.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
do_md_stop check the number of active users before allowing the array
to be stopped.
Two problems:
1/ it assumes the request is coming through an open file descriptor
(via ioctl) so it allows for that. This is not always the case.
2/ it doesn't do the check it the array hasn't been activated.
This is not good for cases when we use an inactive array to hold
some devices in a container.
Signed-off-by: Neil Brown <neilb@suse.de>
The current code copies a signed int from user space, converts it to
unsigned and passes the unsigned value to find_rdev_nr() which expects
a signed value. Simply pass the signed value from user space directly.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
If alloc_page() fails, ENOMEM is a more suitable error value
than EINVAL.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
The only caller of sb_equal() tests the return value against
zero, so it's OK to return the negated return value of memcmp().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
- Remove superfluous parentheses.
- Make format string match the type of the variable that is printed.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
In case pers->run() succeeds but creating the bitmap fails, we
print an error message stating that pers->run() has failed.
Print this message only if pers->run() really failed.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
md_allow_write() marks the metadata dirty while holding mddev->lock and then
waits for the write to complete. For externally managed metadata this causes a
deadlock as userspace needs to take the lock to communicate that the metadata
update has completed.
Change md_allow_write() in the 'external' case to start the 'mark active'
operation and then return -EAGAIN. The expected side effects while waiting for
userspace to write 'active' to 'array_state' are holding off reshape (code
currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to
fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and
cause updates to 'raid_disks' to fail. Except for 'stripe_cache_size' changes
these failures can be mitigated by coordinating with mdmon.
md_write_start() still prevents writes from occurring until the metadata
handler has had a chance to take action as it unconditionally waits for
MD_CHANGE_CLEAN to be cleared.
[neilb@suse.de: return -EAGAIN, try GFP_NOIO]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
From: Chris Webb <chris@arachsys.com>
Allow /sys/block/mdX/md/rdY/size to change on running arrays, moving the
superblock if necessary for this metadata version. We prevent the available
space from shrinking to less than the used size, and allow it to be set to zero
to fill all the available space on the underlying device.
Signed-off-by: Chris Webb <chris@arachsys.com>
Signed-off-by: Neil Brown <neilb@suse.de>
The important state change happens during an interrupt
in md_error. So just set a flag there and call sysfs_notify
later in process context.
Signed-off-by: Neil Brown <neilb@suse.de>
When a device fails, when a spare is activated, when
an array is reshaped, or when an array is started,
the extent to which the array is degraded can change.
Signed-off-by: Neil Brown <neilb@suse.de>
When the 'resync' thread starts or stops, when we explicitly
set sync_action, or when we determine that there is definitely nothing
to do, we notify sync_action.
To stop "sync_action" from occasionally showing the wrong value,
we introduce a new flags - MD_RECOVERY_RECOVER - to say that a
recovery is probably needed or happening, and we make sure
that we set MD_RECOVERY_RUNNING before clearing MD_RECOVERY_NEEDED.
Signed-off-by: Neil Brown <neilb@suse.de>
Changes in md/array_state could be of interest to a monitoring
program. So make sure all changes trigger a notification.
Exceptions:
changing active_idle to active is not reported because it
is frequent and not interesting.
changing active to active_idle is only reported on arrays
with externally managed metadata, as it is not interesting
otherwise.
Signed-off-by: Neil Brown <neilb@suse.de>
There is really no need for this test here, and there are valid
cases for selectively removing devices from an array that
it not actually active.
Signed-off-by: Neil Brown <neilb@suse.de>
For all array types but linear, ->hot_add_disk returns 1 on
success, 0 on failure.
For linear, it returns 0 on success and -errno on failure.
This doesn't cause a functional problem because the ->hot_add_disk
function of linear is used quite differently to the others.
However it is confusing.
So convert all to return 0 for success or -errno on failure
and fix call sites to match.
Signed-off-by: Neil Brown <neilb@suse.de>
i.e. extend the 'md/dev-XXX/slot' attribute so that you can
tell a device to fill an vacant slot in an and md array.
Signed-off-by: Neil Brown <neilb@suse.de>
offset_store and rdev_size_store allow control of the region of a
device which is to be using in an md/raid array.
They only allow these values to be set when an array is being assembled,
as changing them on an active array could be dangerous.
However when adding a spare device to an array, we might need to
set the offset and size before starting recovery. So allow
these values to be set also if "->raid_disk < 0" which indicates that
the device is still a spare.
Signed-off-by: Neil Brown <neilb@suse.de>
Arrays personalities such as 'raid0' and 'linear' have no redundancy,
and so marking them as 'clean' or 'dirty' is not meaningful.
So always allow write requests without requiring a superblock update.
Such arrays types are detected by ->sync_request being NULL. If it is
not possible to send a sync request we don't need a 'dirty' flag because
all a dirty flag does is trigger some sync_requests.
Signed-off-by: Neil Brown <neilb@suse.de>
There is a possible race in md_probe. If two threads call md_probe
for the same device, then one could exit (having checked that
->gendisk exists) before the other has called kobject_init_and_add,
thus returning an incomplete kobj which will cause problems when
we try to add children to it.
So extend the range of protection of disks_mutex slightly to
avoid this possibility.
Signed-off-by: Neil Brown <neilb@suse.de>
This makes it possible to just resync a small part of an array.
e.g. if a drive reports that it has questionable sectors,
a 'repair' of just the region covering those sectors will
cause them to be read and, if there is an error, re-written
with correct data.
Signed-off-by: Neil Brown <neilb@suse.de>
md_probe can fail (e.g. alloc_disk could fail) without
returning an error (as it alway returns NULL).
So when we call mddev_find immediately afterwards, we need
to check that md_probe actually succeeded. This means checking
that mdev->gendisk is non-NULL.
cc: <stable@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>
If an array was created with --assume-clean we will oops when trying to
set ->resync_max.
Fix this by initializing ->recovery_wait in mddev_find.
Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.
For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.
We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
- We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
which causes the recovery not be be checkpointed.
- We remove spares and then re-added them which loses important state
information.
The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed. If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error. So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.
Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded). Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.
Issue: If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available, do we want to:
1/ complete the current rebuild, then go back and rebuild the new spare or
2/ restart the rebuild from the start and rebuild both devices in
parallel.
Both options can be argued for. The code currently takes option 2 as
a/ this requires least code change
b/ this results in a minimally-degraded array in minimal time.
Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In some configurations, a raid6 resync can be limited by CPU speed
(Calculating P and Q and moving data) rather than by device speed. In
these cases there is nothing to be gained byt serialising resync of arrays
that share a device, and doing the resync in parallel can provide benefit.
So add a sysfs tunable to flag an array as being allowed to resync in
parallel with other arrays that use (a different part of) the same device.
Signed-off-by: Bernd Schubert <bs@q-leap.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This additional notification to 'array_state' is needed to allow the
monitor application to learn about stop events via sysfs. The
sysfs_notify("sync_action") call that comes at the end of do_md_stop()
(via md_new_event) is insufficient since the 'sync_action' attribute has
been removed by this point.
(Seems like a sysfs-notify-on-removal patch is a better fix. Currently
removal updates the event count but does not wake up waiters)
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When an array enters write pending, 'array_state' changes, so we must be
sure to sysfs_notify.
Also, when waiting for user-space to acknowledge 'write-pending' by
marking the metadata as dirty, we don't want to wait for MD_CHANGE_DEVS to
be cleared as that might not happen. So explicity test for the bits that
we are really interested in.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kill the trivial and rather pointless file_path wrapper around d_path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allows a userspace metadata handler to take action upon detecting a device
failure.
Based on an original patch by Neil Brown.
Changes:
-added blocked_wait waitqueue to rdev
-don't qualify Blocked with Faulty always let userspace block writes
-added md_wait_for_blocked_rdev to wait for the block device to be clear, if
userspace misses the notification another one is sent every 5 seconds
-set MD_RECOVERY_NEEDED after clearing "blocked"
-kill DoBlock flag, just test mddev->external
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Found when trying to reassemble an active externally managed array. Without
this check we hit the more noisy "sysfs duplicate" warning in the later call
to kobject_add.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When setting an array to 'readonly' or to 'active' via sysfs, we must make the
appropriate set_disk_ro call too.
Also when switching to "read_auto" (which is like readonly, but blocks on the
first write so that metadata can be marked 'dirty') we need to be more careful
about what state we are changing from.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'safemode' relates to marking an array as 'clean' if there has been no write
traffic for a while (a couple of seconds), to reduce the chance of the array
being found dirty on reboot.
->safemode is set to '1' when there have been no write for a while, and it
gets set to '0' when the superblock is updates with the 'clean' flag set.
This requires a few fixes for 'external' metadata:
- When an array is set to 'clean' via sysfs, 'safemode' must be cleared.
- when we write to an array that has 'safemode' set (there must have been
some delay in updating the metadata), we need to clear safemode.
- Don't try to update external metadata in md_check_recovery for safemode
transitions - it won't work.
Also, don't try to support "immediate safe mode" (safemode==2) for external
metadata, it cannot really work (the safemode timeout can be set very low if
this is really needed).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I keep finding problems where an mddev gets reused and some fields has a value
from a previous usage that confuses the new usage. So clear all fields that
could possible need clearing when calling do_md_stop.
Also initialise the 'level' of a new array to LEVEL_NONE (which isn't 0).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All the metadata update processing for external metadata is on in user-space
or through the sysfs interfaces, so make "md_update_sb" a no-op in that case.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rdev->mddev is no longer valid upon return from entry->store() when the
'remove' command is given.
Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: Skip I/O merges when disabled
block: add large command support
block: replace sizeof(rq->cmd) with BLK_MAX_CDB
ide: use blk_rq_init() to initialize the request
block: use blk_rq_init() to initialize the request
block: rename and export rq_init()
block: no need to initialize rq->cmd with blk_get_request
block: no need to initialize rq->cmd in prepare_flush_fn hook
block/blk-barrier.c:blk_ordered_cur_seq() mustn't be inline
block/elevator.c:elv_rq_merge_ok() mustn't be inline
block: make queue flags non-atomic
block: add dma alignment and padding support to blk_rq_map_kern
unexport blk_max_pfn
ps3disk: Remove superfluous cast
block: make rq_init() do a full memset()
relay: fix splice problem
Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data
be setup before gluing PDE to main tree.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Dmitry Torokhov <dtor@mail.ru>
Cc: Neil Brown <neilb@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can save some atomic ops in the IO path, if we clearly define
the rules of how to modify the queue flags.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
drivers/md/md.c:734:16: warning: Using plain integer as NULL pointer
drivers/md/md.c:1115:16: warning: Using plain integer as NULL pointer
Add some braces to match the else-block as well.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drivers/md/*.[ch] contains only one more printk line with a trailing space.
Remove it.
Signed-off-by: Nick Andrew <nick@nick-andrew.net>
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Exposing the binary blob which is the md 'super-block' via sysfs doesn't
really fit with the whole sysfs model, and ever since commit
8118a859dc ("sysfs: fix off-by-one error
in fill_read_buffer()") it doesn't actually work at all (as the size of
the blob is often one page).
(akpm: as in, fs/sysfs/file.c:fill_read_buffer() goes BUG)
So just remove it altogether. It isn't really useful.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If an md array is "auto-read-only", then this appears in /proc/mdstat as
/dev/md0: active(auto-read-only)
whereas if it is truely readonly, it appears as
/dev/md0: active (read-only)
The difference being a space.
One program known to parse this file expects the space and gets badly
confused. It will be fixed, but it would be best if what the kernel generates
is more consistent too.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we access attributes of an rdev (component device on an md array) through
sysfs, we really need to lock the array against concurrent changes. We
currently do that when we change an attribute, but not when we read an
attribute. We need to lock when reading as well else rdev->mddev could become
NULL while we are accessing it.
So add appropriate locking (mddev_lock) to rdev_attr_show.
rdev_size_store requires some extra care as well as it needs to unlock the
mddev while scanning other mddevs for overlapping regions. We currently
assume that rdev->mddev will still be unchanged after the scan, but that
cannot be certain. So take a copy of rdev->mddev for use at the end of the
function.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A resync/reshape/recovery thread will refuse to progress when the array is
marked read-only. So whenever it mark it not read-only, it is important to
wake up thread resync thread. There is one place we didn't do this.
The problem manifests if the start_ro module parameters is set, and a raid5
array that is in the middle of a reshape (restripe) is started. The array
will initially be semi-read-only (meaning it acts like it is readonly until
the first write). So the reshape will not proceed.
On the first write, the array will become read-write, but the reshape will not
be started, and there is no event which will ever restart that thread.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a raid1 array is stopped, all components currently get added to the list
for auto-detection. However we should really only add components that were
found by autodetection in the first place. So add a flag to record that
information, and use it.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make sure the data doesn't start before the end of the superblock when the
superblock is at the start of the device.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
seq_path() is always called with a dentry and a vfsmount from a struct path.
Make seq_path() take it directly as an argument.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Finish ITERATE_ to for_each conversion.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As this is more in line with common practice in the kernel. Also swap the
args around to be more like list_for_each.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As this is more consistent with kernel style.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As suggested by Andrew Morton.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Due to possible deadlock issues we need to use a schedule work to kobject_del
an 'rdev' object from a different thread.
A recent change means that kobject_add no longer gets a refernce, and
kobject_del doesn't put a reference. Consequently, we need to explicitly hold
a reference to ensure that the last reference isn't dropped before the
scheduled work get a chance to call kobject_del.
Also, rename delayed_delete to md_delayed_delete to that it is more obvious in
a stack trace which code is to blame.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, a given device is "claimed" by a particular array so that it cannot
be used by other arrays.
This is not ideal for DDF and other metadata schemes which have their own
partitioning concept.
So for externally managed metadata, just claim the device for md in general,
require that "offset" and "size" are set properly for each device, and make
sure that if a device is included in different arrays then the active sections
do not overlap.
This involves adding another flag to the rdev which makes it awkward to set
"->flags = 0" to clear certain flags. So now clear flags explicitly by name
when we want to clear things.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If you try to start an array for which the number of raid disks is listed as
zero, md will currently try to read metadata off any devices that have been
given. This was done because the value of raid_disks is used to signal
whether array details have been provided by userspace (raid_disks > 0) or must
be read from the devices (raid_disks == 0).
However for an array without persistent metadata (or with externally managed
metadata) this is the wrong thing to do. So we add a test in do_md_run to
give an error if raid_disks is zero for non-persistent arrays.
This requires that mddev->persistent is set corrently at this point, which it
currently isn't for in-kernel autodetected arrays.
So set ->persistent for autodetect arrays, and remove the settign in
super_*_validate which is now redundant.
Also clear ->persistent when stopping an array so it is consistently zero when
starting an array.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows userspace to control resync/reshape progress and synchronise it
with other activities, such as shared access in a SAN, or backing up critical
sections during a tricky reshape.
Writing a number of sectors (which must be a multiple of the chunk size if
such is meaningful) causes a resync to pause when it gets to that point.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a device fails, we must not allow an further writes to the array until
the device failure has been recorded in array metadata. When metadata is
managed externally, this requires some synchronisation...
Allow/require userspace to explicitly remove failed devices from active
service in the array by writing 'none' to the 'slot' attribute. If this
reduces the number of failed devices to 0, the write block will automatically
be lowered.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Add a state flag 'external' to indicate that the metadata is managed
externally (by user-space) so important changes need to be
left of user-space to handle.
Alternates are non-persistant ('none') where there is no stable metadata -
after the array is stopped there is no record of it's status - and
internal which can be version 0.90 or version 1.x
These are selected by writing to the 'metadata' attribute.
- move the updating of superblocks (sync_sbs) to after we have checked if
there are any superblocks or not.
- New array state 'write_pending'. This means that the metadata records
the array as 'clean', but a write has been requested, so the metadata has
to be updated to record a 'dirty' array before the write can continue.
This change is reported to md by writing 'active' to the array_state
attribute.
- tidy up marking of sb_dirty:
- don't set sb_dirty when resync finishes as md_check_recovery
calls md_update_sb when the sync thread finishes anyway.
- Don't set sb_dirty in multipath_run as the array might not be dirty.
- don't mark superblock dirty when switching to 'clean' if there
is no internal superblock (if external, userspace can choose to
update the superblock whenever it chooses to).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no need for kobject_unregister() anymore, thanks to Kay's
kobject cleanup changes, so replace all instances of it with
kobject_put().
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Now that the old kobject_init() function is gone, rename
kobject_init_ng() to kobject_init() to clean up the namespace.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Now that the old kobject_add() function is gone, rename kobject_add_ng()
to kobject_add() to clean up the namespace.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This converts the code to use the new kobject functions, cleaning up the
logic in doing so.
Cc: Neil Brown <neilb@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This moves the block devices to /sys/class/block. It will create a
flat list of all block devices, with the disks and partitions in one
directory. For compatibility /sys/block is created and contains symlinks
to the disks.
/sys/class/block
|-- sda -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda
|-- sda1 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda1
|-- sda10 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda10
|-- sda5 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda5
|-- sda6 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda6
|-- sda7 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda7
|-- sda8 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda8
|-- sda9 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda9
`-- sr0 -> ../../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0
/sys/block/
|-- sda -> ../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda
`-- sr0 -> ../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Stop using kobject_register, as this way we can control the sending of
the uevent properly, after everything is properly initialized.
Cc: Neil Brown <neilb@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Added blk_unplug interface, allowing all invocations of unplugs to result
in a generated blktrace UNPLUG.
Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.
The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.
[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The 'degraded' attribute is useful to quickly determine if the array is
degraded, instead of parsing 'mdadm -D' output or relying on the other
techniques (number of working devices against number of defined devices,
etc.). The md code already keeps track of this attribute, so it's useful to
export it.
Signed-off-by: Iustin Pop <iusty@k1024.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When an array is started read-only, MD_RECOVERY_NEEDED can be set but no
recovery will be running. This causes 'sync_action' to report the wrong
value.
We could remove the test for MD_RECOVERY_NEEDED, but doing so would leave a
small gap after requesting a sync action, where 'sync_action' would still
report the old value.
So make sure that for a read-only array, 'sync_action' always returns 'idle'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In current release kernels the md module (Software RAID) uses a static
array (dev_t[128]) to store partition/device info temporarily for
autostart.
I discovered this (and that the devices are added as disks/partitions are
discovered at boot) while I was debugging why only one of my MD arrays would
come up whole, while all the others were short a disk.
I eventually discovered that it was enumerating through all of 9 of my 11 hds
(2 had only 4 partitions apiece) while the other 9 have 15 partitions (I
wanted 64 per drive...). The last partition of the 8th drive in my 9 drive
raid 5 sets wasn't added, thus making the final md array short both a parity
and data disk, and it was started later, elsewhere.
This patch replaces that static array with a list.
[akpm@linux-foundation.org: removed unused var]
Signed-off-by: Michael J. Evans <mjevans1983@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A number of different drivers incorrect access the kobject name field
directly. This is not correct as the name might not be in the array.
Use the proper accessor function instead.
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant. Remove it.
Now there is no need for bio_endio to subtract the size completed
from bi_size. So don't do that either.
While we are at it, change bi_end_io to return void.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
bitmap_unplug only ever returns 0, so it may as well be void. Two callers try
to print a message if it returns non-zero, but that message is already printed
by bitmap_file_kick.
write_page returns an error which is not consistently checked. It always
causes BITMAP_WRITE_ERROR to be set on an error, and that can more
conveniently be checked.
When the return of write_page is checked, an error causes bitmap_file_kick to
be called - so move that call into write_page - and protect against recursive
calls into bitmap_file_kick.
bitmap_update_sb returns an error that is never checked.
So make these 'void' and be consistent about checking the bit.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We current completely trust user-space to set up metadata describing an
consistant array. In particlar, that the metadata, data, and bitmap do not
overlap.
But userspace can be buggy, and it is better to report an error than corrupt
data. So put in some appropriate checks.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't use 'unsigned' variable to track sync vs non-sync IO, as the only thing
we want to do with them is a signed comparison, and fix up the comment which
had become quite wrong.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
People try to use raid auto-detect with version-1 superblocks (which is not
supported) and get confused when they are told they have an invalid
superblock.
So be more explicit, and say it it is not a valid v0.90 superblock.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the freezer treats all tasks as freezable, except for the kernel
threads that explicitly set the PF_NOFREEZE flag for themselves. This
approach is problematic, since it requires every kernel thread to either
set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't
care for the freezing of tasks at all.
It seems better to only require the kernel threads that want to or need to
be frozen to use some freezer-related code and to remove any
freezer-related code from the other (nonfreezable) kernel threads, which is
done in this patch.
The patch causes all kernel threads to be nonfreezable by default (ie. to
have PF_NOFREEZE set by default) and introduces the set_freezable()
function that should be called by the freezable kernel threads in order to
unset PF_NOFREEZE. It also makes all of the currently freezable kernel
threads call set_freezable(), so it shouldn't cause any (intentional)
change of behaviour to appear. Additionally, it updates documentation to
describe the freezing of tasks more accurately.
[akpm@linux-foundation.org: build fixes]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The async_tx api tries to use a dma engine for an operation, but will fall
back to an optimized software routine otherwise. Xor support is
implemented using the raid5 xor routines. For organizational purposes this
routine is moved to a common area.
The following fixes are also made:
* rename xor_block => xor_blocks, suggested by Adrian Bunk
* ensure that xor.o initializes before md.o in the built-in case
* checkpatch.pl fixes
* mark calibrate_xor_blocks __init, Adrian Bunk
Cc: Adrian Bunk <bunk@stusta.de>
Cc: NeilBrown <neilb@suse.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Adding a drive to a linear array seems to have stopped working, due to changes
elsewhere in md, and insufficient ongoing testing...
So the patch to make linear hot-add work in the first place introduced a
subtle bug elsewhere that interracts poorly with older version of mdadm.
This fixes it all up.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During a 'resync' or similar activity, md checks if the devices in the
array are otherwise active and winds back resync activity when they are.
This test in done in is_mddev_idle, and it is somewhat fragile - it
sometimes thinks there is non-sync io when there isn't.
The test compares the total sectors of io (disk_stat_read) with the sectors
of resync io (disk->sync_io). This has problems because total sectors gets
updated when a request completes, while resync io gets updated when the
request is submitted. The time difference can cause large differenced
between the two which do not actually imply non-resync activity. The test
currently allows for some fuzz (+/- 4096) but there are some cases when it
is not enough.
The test currently looks for any (non-fuzz) difference, either positive or
negative. This clearly is not needed. Any non-sync activity will cause
the total sectors to grow faster than the sync_io count (never slower) so
we only need to look for a positive differences.
If we do this then the amount of in-flight sync io will never cause the
appearance of non-sync IO. Once enough non-sync IO to worry about starts
happening, resync will be slowed down and the measurements will thus be
more precise (as there is less in-flight) and control of resync will still
be suitably responsive.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 5b479c91da.
Quoth Neil Brown:
"It causes an oops when auto-detecting raid arrays, and it doesn't
seem easy to fix.
The array may not be 'open' when do_md_run is called, so
bdev->bd_disk might be NULL, so bd_set_size can oops.
This whole approach of opening an md device before it has been
assembled just seems to get more and more painful. I think I'm going
to have to come up with something clever to provide both backward
comparability with usage expectation, and sane integration into the
rest of the kernel."
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
md currently uses ->media_changed to make sure rescan_partitions
is call on md array after they are assembled.
However that doesn't happen until the array is opened, which is later
than some people would like.
So use blkdev_ioctl to do the rescan immediately that the
array has been assembled.
This means we can remove all the ->change infrastructure as it was only used
to trigger a partition rescan.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"reshape_position" records how much progress has been made on a "reshape"
(adding drives, changing layout or chunksize).
When it is set, the number of drives, layout and chunksize can have
two possible values, an old an a new.
So allow these different values to be visible, and allow both old and new to
be set: Set the old ones first, then the reshape_position, then the new
values.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If CONFIG_NET is not selected, csum_partial is not exported, so md.ko cannot
use it. We shouldn't really be using csum_partial anyway as it is an
internal-to-networking interface.
So replace it with C code to do the same thing. Speed is not crucial here, so
something simple and correct is best.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to check for internal-consistency of superblock in load_super.
validate_super is for inter-device consistency.
With the test in the wrong place, a badly created array will confuse md rather
an produce sensible errors.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can save some lines of code by using seq_release_private().
Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use ARRAY_SIZE macro already defined in kernel.h
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't
been used in 6 years (so akpm says).
find * -name \*.[ch] | xargs grep -l invalidate_bdev |
while read file; do
quilt add $file;
sed -ie 's/invalidate_bdev(\([^,]*\),[^)]*)/invalidate_bdev(\1)/g' $file;
done
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A device can be removed from an md array via e.g.
echo remove > /sys/block/md3/md/dev-sde/state
This will try to remove the 'dev-sde' subtree which will deadlock
since
commit e7b0d26a86
With this patch we run the kobject_del via schedule_work so as to
avoid the deadlock.
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... still not sure why we need this ....
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If this mddev and queue got reused for another array that doesn't register a
congested_fn, this function would get called incorretly.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An error always aborts any resync/recovery/reshape on the understanding that
it will immediately be restarted if that still makes sense. However a reshape
currently doesn't get restarted. With this patch it does.
To avoid restarting when it is not possible to do work, we call into the
personality to check that a reshape is ok, and strengthen raid5_check_reshape
to fail if there are too many failed devices.
We also break some code out into a separate function: remove_and_add_spares as
the indent level for that code was getting crazy.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The mddev and queue might be used for another array which does not set these,
so they need to be cleared.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
md tries to warn the user if they e.g. create a raid1 using two partitions of
the same device, as this does not provide true redundancy.
However it also warns if a raid0 is created like this, and there is nothing
wrong with that.
At the place where the warning is currently printer, we don't necessarily know
what level the array will be, so move the warning from the point where the
device is added to the point where the array is started.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.
I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.
So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The sysctls used by the md driver are have unique binary numbers so remove the
insert_at_head flag as it serves no useful purpose.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many struct file_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.
[akpm@sdl.org: dvb fix]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a GFP_KERNEL allocation is attempted in md while the mddev_lock is held,
it is possible for a deadlock to eventuate.
This happens if the array was marked 'clean', and the memalloc triggers a
write-out to the md device.
For the writeout to succeed, the array must be marked 'dirty', and that
requires getting the mddev_lock.
So, before attempting a GFP_KERNEL allocation while holding the lock, make
sure the array is marked 'dirty' (unless it is currently read-only).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we sometimes step the array events count backwards (when
transitioning dirty->clean where nothing else interesting has happened - so
that we don't need to write to spares all the time), it is possible for the
event count to return to zero, which is potentially confusing and triggers and
MD_BUG.
We could possibly remove the MD_BUG, but is just as easy, and probably safer,
to make sure we never return to zero.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While developing more functionality in mdadm I found some bugs in md...
- When we remove a device from an inactive array (write 'remove' to
the 'state' sysfs file - see 'state_store') would should not
update the superblock information - as we may not have
read and processed it all properly yet.
- initialise all raid_disk entries to '-1' else the 'slot sysfs file
will claim '0' for all devices in an array before the array is
started.
- all '\n' not to be present at the end of words written to
sysfs files
- when we use SET_ARRAY_INFO to set the md metadata version,
set the flag to say that there is persistant metadata.
- allow GET_BITMAP_FILE to be called on an array that hasn't
been started yet.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix few bugs that meant that:
- superblocks weren't alway written at exactly the right time (this
could show up if the array was not written to - writting to the array
causes lots of superblock updates and so hides these errors).
- restarting device recovery after a clean shutdown (version-1 metadata
only) didn't work as intended (or at all).
1/ Ensure superblock is updated when a new device is added.
2/ Remove an inappropriate test on MD_RECOVERY_SYNC in md_do_sync.
The body of this if takes one of two branches depending on whether
MD_RECOVERY_SYNC is set, so testing it in the clause of the if
is wrong.
3/ Flag superblock for updating after a resync/recovery finishes.
4/ If we find the neeed to restart a recovery in the middle (version-1
metadata only) make sure a full recovery (not just as guided by
bitmaps) does get done.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The autorun code is only used if this module is built into the static
kernel image. Adjust #ifdefs accordingly.
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
An md array can be stopped leaving all the setting still in place, or it can
torn down and destroyed. set_capacity and other change notifications only
happen in the latter case, but should happen in both.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
md_open takes ->reconfig_mutex which causes lockdep to complain. This
(normally) doesn't have deadlock potential as the possible conflict is with a
reconfig_mutex in a different device.
I say "normally" because if a loop were created in the array->member hierarchy
a deadlock could happen. However that causes bigger problems than a deadlock
and should be fixed independently.
So we flag the lock in md_open as a nested lock. This requires defining
mutex_lock_interruptible_nested.
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the old complex and crufty bd_mutex annotation.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jason Baron <jbaron@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move process freezing functions from include/linux/sched.h to freezer.h, so
that modifications to the freezer or the kernel configuration don't require
recompiling just about everything.
[akpm@osdl.org: fix ueagle driver]
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If there's a swap file on a software RAID, it should be possible to use this
file for saving the swsusp's suspend image. Also, this file should be
available to the memory management subsystem when memory is being freed before
the suspend image is created.
For the above reasons it seems that md_threads should not be frozen during the
suspend and the appended patch makes this happen, but then there is the
question if they don't cause any data to be written to disks after the suspend
image has been created, provided that all filesystems are frozen at that time.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It turns out that CHANGE is preferred to ONLINE/OFFLINE for various reasons
(not least of which being that udev understands it already).
So remove the recently added KOBJ_OFFLINE (no-one is likely to care anyway)
and change the ONLINE to a CHANGE event
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This allows udev to do something intelligent when an array becomes
available.
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When "mdadm --grow --size=xxx" is used to resize an array (use more or less of
each device), we check the new siza against the available space in each
device.
We already have that number recorded in rdev->size, so calculating it is
pointless (and wrong in one obscure case).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If save_raid_disk is >= 0, then the device could be a device that is already
in sync that is being re-added. So we need to default this value to -1.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Includes a couple of bugfixes found by sparse.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I have seen mdadm oops after successfully unloading md module.
This patch privents from unloading md module while
mdadm is polling /proc/mdstat.
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Akinbou Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This fixes a bug introduced in 2.6.18.
If a drive is added to a raid1 using older tools (mdadm-1.x or raidtools)
then it will be included in the array without any resync happening.
It has been submitted for 2.6.18.1.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This changes two if() BUG(); usages to BUG_ON(); so people
can disable it safely.
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Once upon a time we needed to fixed limit to the number of md devices,
probably because we preallocated some array. This need no longer exists, but
we still have an arbitrary limit.
So remove MAX_MD_DEVS and allow as many devices as we can fit into the 'minor'
part of a device number.
Also remove some useless noise at init time (which reports MAX_MD_DEVS) and
remove MD_THREAD_NAME_MAX which hasn't been used for a while.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is possible to request a 'check' of an md/raid array where the whole array
is read and consistancies are reported.
This uses the same mechanisms as 'resync' and so reports in the kernel logs
that a resync is being started. This understandably confuses/worries people.
Also the text in /proc/mdstat suggests a 'resync' is happen when it is just a
check.
This patch changes those messages to be more specific about what is happening.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new sysfs interface that allows the bitmap of an array to be dirtied.
The interface is write-only, and is used as follows:
echo "1000" > /sys/block/md2/md/bitmap
(dirty the bit for chunk 1000 [offset 0] in the in-memory and on-disk
bitmaps of array md2)
echo "1000-2000" > /sys/block/md1/md/bitmap
(dirty the bits for chunks 1000-2000 in md1's bitmap)
This is useful, for example, in cluster environments where you may need to
combine two disjoint bitmaps into one (following a server failure, after a
secondary server has taken over the array). By combining the bitmaps on
the two servers, a full resync can be avoided (This was discussed on the
list back on March 18, 2005, "[PATCH 1/2] md bitmap bug fixes" thread).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Instead of magic numbers (0,1,2,3) in sb_dirty, we have
some flags instead:
MD_CHANGE_DEVS
Some device state has changed requiring superblock update
on all devices.
MD_CHANGE_CLEAN
The array has transitions from 'clean' to 'dirty' or back,
requiring a superblock update on active devices, but possibly
not on spares
MD_CHANGE_PENDING
A superblock update is underway.
We wait for an update to complete by waiting for all flags to be clear. A
flag can be set at any time, even during an update, without risk that the
change will be lost.
Stop exporting md_update_sb - isn't needed.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch contains the scheduled removal of the START_ARRAY ioctl for md.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If we
- shut down a clean array,
- restart with one (or more) drive(s) missing
- make some changes
- pause, so that they array gets marked 'clean',
the event count on the superblock of included drives
will be the same as that of the removed drives.
So adding the removed drive back in will cause it
to be included with no resync.
To avoid this, we only update the eventcount backwards when the array
is not degraded. In this case there can (should) be no non-connected
drives that we can get confused with, and this is the particular case
where updating-backwards is valuable.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
During early MD setup (superblock reading), we don't have a personality yet.
But the error-handling code tries to dereference mddev->pers. Fix.
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The ioctl requires CAP_SYS_ADMIN, so sysfs should too. Note that we don't
require CAP_SYS_ADMIN for reading attributes even though the ioctl does.
There is no reason to limit the read access, and much of the information is
already available via /proc/mdstat
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some places we use number (0660) someplaces names (S_IRUGO). Change all
numbers to be names, and change 0655 to be what it should be.
Also make some formatting more consistent.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We introduced 'io_sectors' recently so we could count the sectors that causes
io during resync separate from sectors which didn't cause IO - there can be a
difference if a bitmap is being used to accelerate resync.
However when a speed is reported, we find the number of sectors processed
recently by subtracting an oldish io_sectors count from a current
'curr_resync' count. This is wrong because curr_resync counts all sectors,
not just io sectors.
So, add a field to mddev to store the curren io_sectors separately from
curr_resync, and use that in the calculations.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When an array is started we start one or two threads (two if there is a
reshape or recovery that needs to be completed).
We currently start these *before* the array is completely set up and in
particular before queue->queuedata is set. If the thread actually starts
very quickly on another CPU, we can end up dereferencing queue->queuedata
and oops.
This patch also makes sure we don't try to start a recovery if a reshape is
being restarted.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This has to be done in ->load_super, not ->validate_super
Without this, hot-adding devices to an array doesn't always
work right - though there is a work around in mdadm-2.5.2 to
make this less of an issue.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Teach special (recursive) locking code to the lock validator.
Effects on non-lockdep kernels:
- the introduction of the following function variants:
extern struct block_device *open_partition_by_devnum(dev_t, unsigned);
extern int blkdev_put_partition(struct block_device *);
static int
blkdev_get_whole(struct block_device *bdev, mode_t mode, unsigned flags);
which on non-lockdep are the same as open_by_devnum(), blkdev_put()
and blkdev_get().
- a subclass parameter to do_open(). [unused on non-lockdep]
- a subclass parameter to __blkdev_put(), which is a new internal
function for the main blkdev_put*() functions. [parameter unused
on non-lockdep kernels, except for two sanity check WARN_ON()s]
these functions carry no semantical difference - they only express
object dependencies towards the lockdep subsystem.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make needlessly global code static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It appears in /sys/mdX/md/dev-YYY/state
and can be set or cleared by writing 'writemostly' or '-writemostly'
respectively.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The md/dev-XXX/state file can now be written:
"faulty" simulates an error on the device
"remove" removes the device from the array (if it is not busy)
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.
Array states/settings:
clear
No devices, no size, no level
Equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiescent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending (not supported yet)
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (100msec).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- record the 'event' count on each individual device (they
might sometimes be slightly different now)
- add a new value for 'sb_dirty': '3' means that the super
block only needs to be updated to record a clean<->dirty
transition.
- Prefer odd event numbers for dirty states and even numbers
for clean states
- Using all the above, don't update the superblock on
a spare device if the update is just doing a clean-dirty
transition. To accomodate this, a transition from
dirty back to clean might now decrement the events counter
if nothing else has changed.
The net effect of this is that spare drives will not see any IO requests
during normal running of the array, so they can go to sleep if that is what
they want to do.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When an array has a bitmap, a device can be removed and re-added and only
blocks changes since the removal (as recorded in the bitmap) will be resynced.
It should be possible to do a similar thing to arrays without bitmaps. i.e.
if a device is removed and re-added and *no* changes have been made in the
interim, then the add should not require a resync.
This patch allows that option. This means that when assembling an array one
device at a time (e.g. during device discovery) the array can be enabled
read-only as soon as enough devices are available, but extra devices can still
be added without causing a resync.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
md/bitmap modifies i_writecount of a bitmap file to make sure that no-one else
writes to it. The reverting of the change is sometimes done twice, and there
is one error path where it is omitted.
This patch tidies that up.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When "mdadm --grow /dev/mdX --bitmap=none" is used to remove a filebacked
bitmap, the bitmap was disconnected from the array, but the file wasn't closed
(until the array was stopped).
The file also wasn't closed if adding the bitmap file failed.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch makes the needlessly global md_print_devices() static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For a while we have had checkpointing of resync. The version-1 superblock
allows recovery to be checkpointed as well, and this patch implements that.
Due to early carelessness we need to add a feature flag to signal that the
recovery_offset field is in use, otherwise older kernels would assume that a
partially recovered array is in fact fully recovered.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
At shutdown, we switch all arrays to read-only, which creates a message for
every instantiated array, even those which aren't actually active.
So remove the message for non-active arrays.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When a md array has been idle (no writes) for 20msecs it is marked as 'clean'.
This delay turns out to be too short for some real workloads. So increase it
to 200msec (the time to update the metadata should be a tiny fraction of that)
and make it sysfs-configurable.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This warning was slightly useful back in 2.2 days, but is more an annoyance
now. It makes it awkward to add new ioctls (that we we are likely to do that
in the current climate, but it is possible).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
From: NeilBrown <neilb@suse.de>
If an error is reported by a drive in a RAID array (which is done via
bi_end_io - in interrupt context), we call md_error and md_new_event which
calls sysfs_notify. However sysfs_notify grabs a mutex and so cannot be
called in interrupt context.
This patch just creates a variant of md_new_event which avoids the sysfs
call, and uses that. A better fix for later is to arrange for the event to
be called from user-context.
Note: avoiding the sysfs call isn't a problem as an error will not, by
itself, modify the sync_action attribute. (We do still need to
wake_up(&md_event_waiters) as an error by itself will modify /proc/mdstat).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
otherwise we get nasty messages about locks not being released.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We should be able to write 'repair' to /sys/block/mdX/md/sync_action,
however due to and inverted test, that always given EINVAL.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- fix mddev_lock() usage bugs in md_attr_show() and md_attr_store().
[they did not anticipate the possibility of getting a signal]
- remove mddev_lock_uninterruptible() [unused]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It works like this:
Open the file
Read all the contents.
Call poll requesting POLLERR or POLLPRI (so select/exceptfds works)
When poll returns,
close the file and go to top of loop.
or lseek to start of file and go back to the 'read'.
Events are signaled by an object manager calling
sysfs_notify(kobj, dir, attr);
If the dir is non-NULL, it is used to find a subdirectory which
contains the attribute (presumably created by sysfs_create_group).
This has a cost of one int per attribute, one wait_queuehead per kobject,
one int per open file.
The name "sysfs_notify" may be confused with the inotify
functionality. Maybe it would be nice to support inotify for sysfs
attributes as well?
This patch also uses sysfs_notify to allow /sys/block/md*/md/sync_action
to be pollable
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
And remove the comments that were put in inplace of a fix too....
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
... being careful that mutex_trylock is inverted wrt down_trylock
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Semaphore to mutex conversion.
The conversion was generated via scripts, and the result was validated
automatically via a script as well.
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
An md array can be asked to change the amount of each device that it is using,
and in particular can be asked to use the maximum available space. This
currently only works if the first device is not larger than the rest. As
'size' gets changed and so 'fit' becomes wrong. So check if a 'fit' is
required early and don't corrupt it.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This allows user-space to access data safely. This is needed for raid5
reshape as user-space needs to take a backup of the first few stripes before
allowing reshape to commence.
It will also be useful in cluster-aware raid1 configurations so that all
cluster members can leave a section of the array untouched while a
resync/recovery happens.
A 'start' and 'end' of the suspended range are written to 2 sysfs attributes.
Note that only one range can be suspended at a time.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This allows reshape to be triggerred via sysfs (which is the only way to start
it happening).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
check_reshape checks validity and does things that can be done instantly -
like adding devices to raid1. start_reshape initiates a restriping process to
convert the whole array.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We allow the superblock to record an 'old' and a 'new' geometry, and a
position where any conversion is up to. The geometry allows for changing
chunksize, layout and level as well as number of devices.
When using verion-0.90 superblock, we convert the version to 0.91 while the
conversion is happening so that an old kernel will refuse the assemble the
array. For version-1, we use a feature bit for the same effect.
When starting an array we check for an incomplete reshape and restart the
reshape process if needed. If the reshape stopped at an awkward time (like
when updating the first stripe) we refuse to assemble the array, and let
user-space worry about it.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds raid5_reshape and end_reshape which will start and finish the
reshape processes.
raid5_reshape is only enabled in CONFIG_MD_RAID5_RESHAPE is set, to discourage
accidental use.
Read the 'help' for the CONFIG_MD_RAID5_RESHAPE entry.
and Make sure that you have backups, just in case.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch provides the core of the resize/expand process.
sync_request notices if a 'reshape' is happening and acts accordingly.
It allocated new stripe_heads for the next chunk-wide-stripe in the target
geometry, marking them STRIPE_EXPANDING.
Then it finds which stripe heads in the old geometry can provide data needed
by these and marks them STRIPE_EXPAND_SOURCE. This causes stripe_handle to
read all blocks on those stripes.
Once all blocks on a STRIPE_EXPAND_SOURCE stripe_head are read, any that are
needed are copied into the corresponding STRIPE_EXPANDING stripe_head. Once a
STRIPE_EXPANDING stripe_head is full, it is marks STRIPE_EXPAND_READY and then
is written out and released.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Before a RAID-5 can be expanded, we need to be able to expand the stripe-cache
data structure.
This requires allocating new stripes in a new kmem_cache. If this succeeds,
we copy cache pages over and release the old stripes and kmem_cache.
We then allocate new pages. If that fails, we leave the stripe cache at it's
new size. It isn't worth the effort to shrink it back again.
Unfortuanately this means we need two kmem_cache names as we, for a short
period of time, we have two kmem_caches. So they are raid5/%s and
raid5/%s-alt
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
status_resync - used by /proc/mdstat to report the status of a resync, assumes
that device sizes will always fit into an 'unsigned long' This is no longer
the case...
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We are counting failed devices twice, once of the device that is failed, and
once for the hole that has been left in the array. Remove the former so
'failed' matches 'missing'. Storing these counts in the superblock is a bit
silly anyway....
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I really should make this a function of the personality....
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This flag should be set for a virtual device iff it is set for all underlying
devices.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use bd_claim_by_disk.
Following symlinks are created if md0 is built from sda and sdb
/sys/block/md0/slaves/sda --> /sys/block/sda
/sys/block/md0/slaves/sdb --> /sys/block/sdb
/sys/block/sda/holders/md0 --> /sys/block/md0
/sys/block/sdb/holders/md0 --> /sys/block/md0
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Sometimes it doesn't so make the code more like the version-0 code which
works.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- version-1 superblock
+ The default_bitmap_offset is in sectors, not bytes.
+ the 'size' field in the superblock is in sectors, not KB
- raid0_run should return a negative number on error, not '1'
- raid10_read_balance should not return a valid 'disk' number if
->rdev turned out to be NULL
- kmem_cache_destroy doesn't like being passed a NULL.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
mdu_array_info_t->size is 'int', which isn't big enough for the size (in KB of
each component in) some arrays.
So rather than a random overflow, set size to -1 when it cannot be set
correctly.
To update aspect on an array, userspace will sometimes:
get_array_info
change one field
set_array_info
in this case, we don't want the '-1' in 'size' to change to size, or look like
a size change at all. So test for that in update_array_info.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While a read-only array doesn't not really need a bitmap, we should
not remove the bitmap when switching an array to read-only because
a/ There is no code to re-add the bitmap which switching to read-write,
b/ There is insufficient locking - the bitmap could be accessed while it is
being removed.
Cc: Reuben Farrelly <reuben-lkml@reub.net>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
super_1_sync only updates fields in the superblock that might have changed.
'raid_disks' and 'size' could have changed, but this information doesn't get
updated.... until this patch.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As 'array_size' is a 'sector_t', it may overflow inappropriately when shifted
10 bits. So We should cast it to a loff_t first.
There are two places with this problem, but the second (in update_raid_disks)
isn't needed so just remove it:
The only personality that handles ->reshape currently is raid1,
and it doesn't change the size of the array.
When added for raid5/6, reshape again won't change the size of the array,
at least not straight away.
This code might be need for reshaping 'linear' but linear->shape,
if implemented, should probably do the i_size_write itself.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The 'level' of an md array can be set as either a number of a string. When
one is set, the other must be marked 'undefined'. This wasn't being done
in one place: where new arrays are created.
Result: if md1 is a raid1, it is stopped and a raid5 is created there, it
might still appear to be a raid1.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
e.g. The sx8 driver uses names like sx8/0.
This would make a md component dev name like
/sys/block/md0/md/dev-sx8/0
which is not allowed. So we change the '/' to '!' just like
fs/partitions/check.c(register_disk) does.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.
Modified-by: Ingo Molnar <mingo@elte.hu>
(finished the conversion)
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
HDIO_GETGEO is implemented in most block drivers, and all of them have to
duplicate the code to copy the structure to userspace, as well as getting
the start sector. This patch moves that to common code [1] and adds a
->getgeo method to fill out the raw kernel hd_geometry structure. For many
drivers this means ->ioctl can go away now.
[1] the s390 block drivers are odd in this respect. xpram sets ->start
to 4 always which seems more than odd, and the dasd driver shifts
the start offset around, probably because of it's non-standard
sector size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@suse.de>
Cc: <mike.miller@hp.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo Giarrusso <blaisorblade@yahoo.it>
Cc: Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Markus Lidel <Markus.Lidel@shadowconnect.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Also export current (average) speed and status in sysfs.
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Writing major:minor to md/new_dev will bind that device to the array.
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
drivers/md/md.c: In function `offset_show':
drivers/md/md.c:1670: warning: long long unsigned int format, different type arg (arg 3)
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This the role that a device has in an array can be viewed and set.
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move the checks - that dev size is never less than array size - into
bind_rdev_to_array to make sure it always happens properly (there is one place
where currently it doesn't).
Also reject any superblock which claims an array size smaller than the device
in question can hold.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If array is active, try to reshape, else just set the value.
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Store this total in superblock (As appropriate), and make it available to
userspace via sysfs.
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Allow it to be set to a particular version, or 'none'.
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
... only before array is started of course.
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When we do a user-requested check/repair, we lose count of the outstanding
requests...
Also make sure that when anything is written to md/sync_action, the
RECOVERY_NEEDED flag is set and the thread is woken up so any changes take
effect.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make the needlessly global function md_new_event() static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
.. because they aren't used outside md.c
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Commands written to sysfs files may, or my not, be \n terminated. We want to
accept with case. For this we use cmd_match.
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
md supports multiple different RAID level, each being implemented by a
'personality' (which is often in a separate module).
These personalities have fairly artificial 'numbers'. The numbers
are use to:
1- provide an index into an array where the various personalities
are recorded
2- identify the module (via an alias) which implements are particular
personality.
Neither of these uses really justify the existence of personality numbers.
The array can be replaced by a linked list which is searched (array lookup
only happens very rarely). Module identification can be done using an alias
based on level rather than 'personality' number.
The current 'raid5' modules support two level (4 and 5) but only one
personality. This slight awkwardness (which was handled in the mapping from
level to personality) can be better handled by allowing raid5 to register 2
personalities.
With this change in place, the core md module does not need to have an
exhaustive list of all possible personalities, so other personalities can be
added independently.
This patch also moves the check for chunksize being non-zero into the ->run
routines for the personalities that need it, rather than having it in core-md.
This has a side effect of allowing 'faulty' and 'linear' not to have a
chunk-size set.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
...because that seems to be the preferred practice these days.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Replace multiple kmalloc/memset pairs with kzalloc calls.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Substitute:
page_cache_get -> get_page
page_cache_release -> put_page
PAGE_CACHE_SHIFT -> PAGE_SHIFT
PAGE_CACHE_SIZE -> PAGE_SIZE
PAGE_CACHE_MASK -> PAGE_MASK
__free_page -> put_page
because we aren't using the page cache, we are just using pages.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.
It is similar to the poll-ability of /proc/mounts, though different in that:
We always report that the file is readable (because face it, it is, even if
only for EOF).
We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).
Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.
md_new_event takes an 'mddev' which isn't currently used, but it will be soon.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On a read-error we suspend the array, then synchronously read the block from
other arrays until we find one where we can read it. Then we try writing the
good data back everywhere and make sure it works. If any write or subsequent
read fails, only then do we fail the device out of the array.
To be able to suspend the array, we need to also keep track of how many
requests are queued for handling by raid1d.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is important because bitmap_create uses
mddev->resync_max_sectors
and that doesn't have a valid value until after the array
has been initialised (with pers->run()).
[It doesn't make a difference for current personalities that
support bitmaps, but will make a difference for raid10]
This has the added advantage of meaning with can move the thread->timeout
manipulation inside the bitmap.c code instead of sprinkling identical code
throughout all personalities.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
See patch to md.txt for more details
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I had thought that keeping the reported tail level clearly different
from the module name was a good idea, but I've changed my mind.
'raid5' is better and probably less confusing than 'RAID-5'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If an array is created using set_array_info, default_bitmap_offset isn't set
properly meaning that an internal bitmap cannot be hot-added until the array
is stopped and re-assembled.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
md needs to monitor the rate of requests to its devices when doing
resync/recovery so that it can back-off when there is non-resync IO. It
does this by comparing resync IO, which it counts, with total IO which is
taken from disk_stats.
disk_stats were recently changed to account sectors when a request
completes instead of when it is queued. This upsets md's calculations.
We could do the sync_io accounting at the end of requests too, but that has
problems. If an underlying device is an md array, the accounting will
still be done when the request is submitted. This could be changed for
some raid levels, but it cannot be changed for raid0 or linear without
substantial code changes.
So instead, we increase the error that is_mddev_idle allows, up to the
maximum amount of resync IO that can be in flight at any time. The
calculation is current fragile as each personality as different limits for
in-flight resync. This should be fixed up.
For now, this simple patch fixes the problem.
Increasing the error margin decreases the sensitivity to non-resync IO. To
partially compensate for this, the time to wait when non-resync IO is
detected is increased so that less steady IO is required to keep the resync
at bay.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Despite the fact that md threads don't need to be signalled, and won't
respond to signals anyway, we need to have an 'interruptible' wait, else
they stay in 'D' state and add to the load average.
(akpm: the signal_pending() test is unneeded - we'll fix that up in the next
round. For now, leave it there because that's how the code used to be).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This was marked deprecated "after 2.6" back in the 2.5 days. But now it
seems there isn't going to be any "after 2.6", and we deprecate by date
now. So set a date.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Document in Documentation/md.txt the files that now appear in sysfs, and make
a couple of small refinements to exactly when 'level' and 'raid_disks' are
empty, to make it match the documentation.
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The current sync_action for an array can be one of
idle - nothing happening
resync - reduncancy being recalcualted
recover - missing device being recoverred to spare
check - user initiated check of redundancy
repair - like resync but user-initiated and ignores
bitmap optimisation.
Each of these strings can also be written to the 'sync_action' file to cause
that action to happen (if appropriate).
While 'sync' is not technically correct, as a recovery is *not* a 'sync', I
think it is the most servicable word here. Also 'action' is a strong word
than 'mode'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are a few loose ends following the conversion of md to use kthreads:
- Some fields in mdk_thread_t that aren't needed (kthreads does it's own
completion and manages it's own name).
- thread->run is now never NULL, so no need to check
- Some tests for signal_pending that aren't needed (As we don't use signals
to stop threads any more)
- Some flush_signals are not needed
- Some waits are interruptible and don't need to be.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The 'auto-readonly' flag (which suppresses resync and superblock updates until
the first write) is not meaningful for personalities that don't support resync
or superblock writes (raid0, linear, etc).
So clear the setting early to avoid it confusing anything - e.g. appearing in
/proc/mdstat
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The introduction of 'resync=PENDING' (for read-only devices) caused that
message to appear for non-syncable arrays like raid0 and linear. Simplest
thing is to not try to print any resync info unless the personality clearly
supports it.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some, but not all, md array support data redundancy and hence support checking
and restoring that redundancy (resync, rebuild).
Some attributes apply specifically to functions involving this redundancy, and
so should only appear for md arrays for which they are meaningful. i.e. they
should not appear for raid0, linear, multpath, faulty.
This patch separates these into a distinct group and creates the group only if
the personality supports sync_request.
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1/ I really should be using the __ATTR macros for defining attributes, so
that the .owner field get set properly, otherwise modules can be removed
while sysfs files are open. This also involves some name changes of _show
routines.
2/ Always lock the mddev (against reconfiguration) for all sysfs attribute
access. This easily avoid certain races and is completely consistant with
other interfaces (ioctl and /proc/mdstat both always lock against
reconfiguration).
3/ raid5 attributes must check that the 'conf' structure actually exists
(the array could have been stopped while an attribute file was open).
4/ A missing 'kfree' from when the raid5_conf_t was converted to have a
kobject embedded, and then converted back again.
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If a block_device is a partition, then it's kobject is
bdev->bd_part->kobj
otherwise (if it is a full device), the kobject is
bdev->bd_disk->kobj
As md wants back-links to the correct object (whether partition or not), we
need to respect this difference... (Thus current code shows a link to the
whole device, whether we are using a partition or not, which is wrong).
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When an md array is started, the superblock will be written, and resync may
commense. This is not good if you want to be completely read-only as, for
example, when preparing to resume from a suspend-to-disk image.
So introduce a module parameter "start_ro" which can be set
to '1' at boot, at module load, or via
/sys/module/md_mod/parameters/start_ro
When this is set, new arrays get an 'auto-ro' mode, which disables all
internal io (superblock updates, resync, recovery) and is automatically
switched to 'rw' when the first write request arrives.
The array can be set to true 'ro' mode using 'mdadm -r' before the first
write request, or resync can be started without a write using 'mdadm -w'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With version-0.90 superblock, component devices on an md device to not have
any stable name related to the array -(version-1 assigns a fixed index when
a device is added to an array, and this remains despit any hot-swap).
The intial code for making these devices appear in sysfs used dynamic
names, which would change whenever a hot-spare was swapped for a failed or
missing device. This turns out not to be practical in sysfs for a number
of reasons.
This patch changes then naming of component devices to be based on the
result of 'bdevname'. This is stable and should be unique.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We can only accept BARRIER requests if all slaves handle
barriers, and that can, of course, change with time....
So we keep track of whether the whole array seems safe for barriers,
and also whether each individual rdev handles barriers.
We initially assumes barriers are OK.
When writing the superblock we try a barrier, and if that fails, we flag
things for no-barriers. This will usually clear the flags fairly quickly.
If writing the superblock finds that BIO_RW_BARRIER is -ENOTSUPP, we need to
resubmit, so introduce function "md_super_wait" which waits for requests to
finish, and retries ENOTSUPP requests without the barrier flag.
When writing the real raid1, write requests which were BIO_RW_BARRIER but
which aresn't supported need to be retried. So raid1d is enhanced to do this,
and when any bio write completes (i.e. no retry needed) we remove it from the
r1bio, so that devices needing retry are easy to find.
We should hardly ever get -ENOTSUPP errors when writing data to the raid.
It should only happen if:
1/ the device used to support BARRIER, but now doesn't. Few devices
change like this, though raid1 can!
or
2/ the array has no persistent superblock, so there was no opportunity to
pre-test for barriers when writing the superblock.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Current bitmaps use set_bit et.al and so are host-endian, which means
not-portable. Oops.
Define a new version number (4) for which bitmaps are little-endian.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This has the advantage of removing the confusion caused by 'rdev_t' and
'mddev_t' both having 'in_sync' fields.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Two refinements to the 'attempt-overwrite-on-read-error' mechanism.
1/ If the array is read-only, don't attempt an over-write.
2/ If there are more than max_nr_stripes read errors on a device with
no success, fail the drive. This will make sure a dead
drive will be eventually kicked even when we aren't trying
to rewrite (which would normally kick a dead drive more quickly.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There isn't really a need for raid5 attributes to be an a subdirectory,
so this patch moves them from
/sys/block/mdX/md/raid5/attribute
to
/sys/block/mdX/md/attribute
This suggests that all md personalities should co-operate about
namespace usage, but that shouldn't be a problem.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1/ Use reduce stack usage, because 'gcc' apparently doesn't overlay
different variables that are in separate scopes...
2/ Use test_bit instead of ( .. & 1<< ..) which in this case is buggy.
Thanks to Andrew Morton
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With this, raid5 can be asked to check parity without repairing it. It also
keeps a count of the number of incorrect parity blocks found (mismatches) and
reports them through sysfs.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
You can trigger a 'check' with
echo check > /sys/block/mdX/md/scan_mode
or a check-and-repair errors with
echo repair > /sys/block/mdX/md/scan_mode
and read the current state from the same file.
Note: personalities need to know the different between 'check' and 'repair',
but don't yet. Until they do, 'check' will be the same as 'repair' and will
just do a normal resync pass.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Each device in an md array how has a corresponding
/sys/block/mdX/md/devNN/
directory which can contain attributes. Currently there is only 'state' which
summarises the state, nd 'super' which has a copy of the superblock, and
'block' which is a symlink to the block device.
Also, /sys/block/mdX/md/rdNN represents slot 'NN' in the array, and is a
symlink to the relevant 'devNN'. Obviously spare devices do not have a slot
in the array, and so don't have such a symlink.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Start using kobjects in mddevs, and provide a couple of simple attributes
(level and disks). Attributes live in
/sys/block/mdX/md/attr-name
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Instead of having ->read_sectors and ->write_sectors, combine the two
into ->sectors[2] and similar for the other fields. This saves a branch
several places in the io path, since we don't have to care for what the
actual io direction is. On my x86-64 box, that's 200 bytes less text in
just the core (not counting the various drivers).
Signed-off-by: Jens Axboe <axboe@suse.de>
There are still a couple of cases where md threads (the resync/recovery
thread) is not interruptible since the change to use kthreads. All places
there it tests "signal_pending", it should also test kthread_should_stop,
as with this patch.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The main problem fixes is that in certain situations stopping md arrays may
take longer than you expect, or may require multiple attempts. This would
only happen when resync/recovery is happening.
This patch fixes three vaguely related bugs.
1/ The recent change to use kthreads got the setting of the
process name wrong. This fixes it.
2/ The recent change to use kthreads lost the ability for
md threads to be signalled with SIG_KILL. This restores that.
3/ There is a long standing bug in that if:
- An array needs recovery (onto a hot-spare) and
- The recovery is being blocked because some other array being
recovered shares a physical device and
- The recovery thread is killed with SIG_KILL
Then the recovery will appear to have completed with no IO being
done, which can cause data corruption.
This patch makes sure that incomplete recovery will be treated as
incomplete.
Note that any kernel affected by bug 2 will not suffer the problem of bug
3, as the signal can never be delivered. Thus the current 2.6.14-rc
kernels are not susceptible to data corruption. Note also that if arrays
are shutdown (with "mdadm -S" or "raidstop") then the problem doesn't
occur. It only happens if a SIGKILL is independently delivered as done by
'init' when shutting down.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch contains the most trivial from Rusty's trivial patches:
- spelling fixes
- remove duplicate includes
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There was another case where sb_size wasn't being set, so instead do the
sensible thing and set if when filling in the content of a superblock. That
ensures that whenever we write a superblock, the sb_size MUST be set.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are two ways to add devices to an md/raid array.
It can have superblock written to it, and then given to the md driver,
which will read the superblock (the new way)
or
md can be told (through SET_ARRAY_INFO) the shape of the array, and
the told about individual drives, and md will create the required
superblock (the old way).
The newly introduced sb_size was only set for drives being added the
new way, not the old ways. Oops :-(
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Just like failed drives have (F), so spare drives now have (S).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Leave it unchanged if the original (0.90) is used, incase it might be a
compatability problem.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Doh. I want the physical hard-sector-size, not the current block size...
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On reflection, a better default location for hot-adding bitmaps with version-1
superblocks is immediately after the superblock. There might not be much room
there, but there is usually atleast 3k, and that is a good start.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Switch MD to use the kthread infrastructure, to simplify the code and get rid
of tasklist_lock abuse in md_unregister_thread.
Also don't flush signals in md_thread, as the called thread will always do
that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is a direct port of the raid5 patch.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Most awkward part of this is delaying write requests until bitmap updates have
been flushed.
To achieve this, we have a sequence number (seq_flush) which is incremented
each time the raid5 is unplugged.
If the raid thread notices that this has changed, it flushes bitmap changes,
and assigned the value of seq_flush to seq_write.
When a write request arrives, it is given the number from seq_write, and that
write request may not complete until seq_flush is larger than the saved seq
number.
We have a new queue for storing stripes which are waiting for a bitmap flush
and an extra flag for stripes to record if the write was 'degraded' and so
should not clear the a bit in the bitmap.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
version-1 superblocks are not (normally) 4K long, and can be of variable size.
Writing the full 4K can cause corruption (but only in non-default
configurations).
With this patch the super-block-flavour can choose a size to read, and set a
size to write based on what it finds.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As this is used to flag an internal bitmap.
Also, introduce symbolic names for feature bits.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is possibly (and occasionally useful) to have a raid1 without persistent
superblocks. The code in add_new_disk for adding a device to such an array
always tries to read a superblock.
This will obviously fail.
So do the appropriate test and call md_import_device with
appropriate args.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This allows a device in a raid1 to be marked as "write mostly". Read requests
will only be sent if there is no other option.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Both file-bitmaps and superblock bitmaps are supported.
If you add a bitmap file on the array device, you lose.
This introduces a 'default_bitmap_offset' field in mddev, as the ioctl used
for adding a superblock bitmap doesn't have room for giving an offset. Later,
this value will be setable via sysfs.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
... otherwise we loose a reference and can never free the file.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It's possible for this to still have flags in it and a previous instance
has been stopped, and that confused the new array using the same mddev.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I just discovered this is needed for module auto-loading.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We weren't actually waking up the md thread after setting
MD_RECOVERY_NEEDED when assembling an array, so it is possible to lose a
race and not actually start resync.
So add a call to md_wakeup_thread, and while we are at it, remove all the
"if (mddev->thread)" guards as md_wake_thread does its own checking.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
... otherwise we might try to load a bitmap from an array which hasn't one.
The bug is that if you create an array with an internal bitmap, shut it down,
and then create an array with the same md device, the md drive will assume it
should have a bitmap too. As the array can be created with a different md
device, it is mostly an inconvenience. I'm pretty sure there is no risk of
data corruption.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The recent change to never ignore the bitmap, revealed that the bitmap isn't
begin flushed properly when an array is stopped.
We call bitmap_daemon_work three times as there is a three-stage pipeline for
flushing updates to the bitmap file.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Until the bitmap code was added,
modprobe md
would load the md module. But now the md module is called 'md-mod', so we
really need an alias for backwards comparability.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
`gcc -W' likes to complain if the static keyword is not at the beginning of
the declaration. This patch fixes all remaining occurrences of "inline
static" up with "static inline" in the entire kernel tree (140 occurrences in
47 files).
While making this change I came across a few lines with trailing whitespace
that I also fixed up, I have also added or removed a blank line or two here
and there, but there are no functional changes in the patch.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
insert a missing bio_put when writting the md superblock.
Without this we have a steady growth in the "bio" slab.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1. Establish a simple API for process freezing defined in linux/include/sched.h:
frozen(process) Check for frozen process
freezing(process) Check if a process is being frozen
freeze(process) Tell a process to freeze (go to refrigerator)
thaw_process(process) Restart process
frozen_process(process) Process is frozen now
2. Remove all references to PF_FREEZE and PF_FROZEN from all
kernel sources except sched.h
3. Fix numerous locations where try_to_freeze is manually done by a driver
4. Remove the argument that is no longer necessary from two function calls.
5. Some whitespace cleanup
6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
cleared before setting PF_FROZEN, recalc_sigpending does not check
PF_FROZEN).
This patch does not address the problem of freeze_processes() violating the rule
that a task may only modify its own flags by setting PF_FREEZE. This is not clean
in an SMP environment. freeze(process) is therefore not SMP safe!
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes some unneeded checks of pointers being NULL before
calling kfree() on them. kfree() handles NULL pointers just fine, checking
first is pointless.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1/ Must typecast int to (sector_t) before inverting or we
might not invert enough bits.
2/ When "bitmap_offset" was added to mdp_superblock_1, we didn't increase
the count of words-used (96 to 100).
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
currently, md updates all superblocks (one on each device) in series. It
waits for one write to complete before starting the next. This isn't a big
problem as superblock updates don't happen that often.
However it is neater to do it in parallel, and if the drives in the array have
gone to "sleep" after a period of idleness, then waking them is parallel is
faster (and someone else should be worrying about power drain).
Futher, we will need parallel superblock updates for a future patch which
keeps the intent-logging bitmap near the superblock.
Also remove the silly code that retired superblock updates 100 times. This
simply never made sense.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This provides an alternate to storing the bitmap in a separate file. The
bitmap can be stored at a given offset from the superblock. Obviously the
creator of the array must make sure this doesn't intersect with data....
After is good for version-0.90 superblocks.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Before completing a 'write' the md superblock might need to be updated.
This is best done by the md_thread.
The current code schedules this up and queues the write request for later
handling by the md_thread.
However some personalities (Raid5/raid6) will deadlock if the md_thread
tries to submit requests to its own array.
So this patch changes things so the processes submitting the request waits
for the superblock to be written and then submits the request itself.
This fixes a recently-created deadlock in raid5/raid6
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When an array is degraded, bit in the intent-bitmap are never cleared. So if
a recently failed drive is re-added, we only need to reconstruct the block
that are still reflected in the bitmap.
This patch adds support for this re-adding.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
bitmap_daemon_work clears bits in the bitmap for blocks that haven't been
written to for a while. It needs to be called regularly to make sure the
bitmap doesn't endup full of ones .... but it wasn't.
So call it from the increasingly-inaptly-named md_check_recovery
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1/ When init from disk, it is a BUG if there is nowhere
to init from,
2/ use seq_path to print path in /proc/mdstat
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With this patch, the intent to write to some block in the array can be logged
to a bitmap file. Each bit represents some number of sectors and is set
before any update happens, and only cleared when all writes relating to all
sectors are complete.
After an unclean shutdown, information in this bitmap can be used to optimise
resync - only sectors which could be out-of-sync need to be updated.
Also if a drive is removed and then added back into an array, the recovery can
make use of the bitmap to optimise reconstruction. This is not implemented in
this patch.
Currently the bitmap is stored in a file which must (obviously) be stored on a
separate device.
The patch only provided infrastructure. It does not update any personalities
to bitmap intent logging.
Md arrays can still be used with no bitmap file. This patch has minimal
impact on such arrays.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1/ change the return value (which is number-of-sectors synced)
from 'int' to 'sector_t'.
The number of sectors is usually easily small enough to fit
in an int, but if resync needs to abort, it may want to return
the total number of remaining sectors, which could be large.
Also errors cannot be returned as negative numbers now, so use
0 instead
2/ Add a 'skipped' return parameter to allow the array to report
that it skipped the sectors. This allows md to take this into account
in the speed calculations.
Currently there is no important skipping, but the bitmap-based-resync
that is coming will use this.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When md marks the superblock dirty before a write, it calls
generic_make_request (to write the superblock) from within
generic_make_request (to write the first dirty block), which could cause
problems later.
With this patch, the superblock write is always done by the helper thread, and
write request are delayed until that write completes.
Also, the locking around marking the array dirty and writing the superblock is
improved to avoid possible races.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
md_enter_safemode checks if it is time to mark the md superblock as 'clean'.
i.e. if all writes have completed and a suitable delay has passed.
This is currently called from md_handle_safemode which in-turn is called
(almost) every time md_check_recovery is called, and from the end of
md_do_sync which causes the mddev->thread to run, which will always call
md_check_recovery as well.
So it doesn't need to be a separate function and fits quite well into
md_check_recovery.
The "almost" is because multipathd calls md_check_recovery but not
md_handle_safemode. This is OK because the code from md_enter_safemode is a
no-op if mddev->safemode == 0, which it always is for a multipathd (providing
we don't allow it to be set to 2 on a signal...)
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently if add_new_disk is used to hot-add a drive to a degraded array,
recovery doesn't start ... because we didn't tell it to.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch makes some needlessly global identifiers static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Arjan van de Ven <arjanv@infradead.org>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The conditions that cause these calls to MD_BUG are not kernel bugs, just
oddities in what userspace is asking for.
Also convert analyze_sbs to return void, and the value it returned was
always 0.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is a tiny race when de-registering an MD thread, in that the thread
could disappear before it is set a SIGKILL, causing send_sig to have
problems.
This is most easily closed by holding tasklist_lock between enabling the
thread to exit (setting ->run to NULL) and telling it to exit.
(akpm: ick. Needs to use kthread API and stop using signals)
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
..as sync_page_io can be called on the write-out path.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!