Maintain two flows, one for pow2 chunk sizes (which uses masks and
shift), and a flow for the general case (which uses sector_div).
This is for the sake of performance.
- introduce map_sector and is_io_in_chunk_boundary to encapsulate
those two flows better for raid0_make_request
- fix blk_mergeable to support the two flows.
Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>
have raid0 check chunk size in run method instead of in md.
This is part of a series moving the checks from common code to
the personalities where they belong.
hardsect is short and chunksize is an int, so it is safe to use %.
Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>
Having a macro just to cast a void* isn't really helpful.
I would must rather see that we are simply de-referencing ->private,
than have to know what the macro does.
So open code the macro everywhere and remove the pointless cast.
Signed-off-by: NeilBrown <neilb@suse.de>
This setting doesn't seem to make sense (half the chunk size??) and
shouldn't be needed.
The segment boundary exported by raid0 should simply be the minimum
of the segment boundary of all component devices. And we already
get that right.
Signed-off-by: NeilBrown <neilb@suse.de>
If we treat conf->devlist more like a 2 dimensional array,
we can get the devlist for a particular zone simply by indexing
that array, so we don't need to store the pointers to subarrays
in strip_zone. This makes strip_zone smaller and so (hopefully)
searches faster.
Signed-of-by: NeilBrown <neilb@suse.de>
storing ->sectors is redundant as is can be computed from the
difference z->zone_end - (z-1)->zone_end
The one place where it is used, it is just as efficient to use
a zone_end value instead.
And removing it makes strip_zone smaller, so they array of these that
is searched on every request has a better chance to say in cache.
So discard the field and get the value from elsewhere.
Signed-off-by: NeilBrown <neilb@suse.de>
raid0_stop() removes all references to the raid0 configuration but
misses to free the ->devlist buffer.
This patch closes this leak, removes a pointless initialization and
fixes a coding style issue in raid0_stop().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Currently the raid0 configuration is allocated in raid0_run() while
the buffers for the strip_zone and the dev_list arrays are allocated
in create_strip_zones(). On errors, all three buffers are freed
in raid0_run().
It's easier and more readable to do the allocation and cleanup within
a single function. So move that code into create_strip_zones().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Currently raid0_run() always returns -ENOMEM on errors. This is
incorrect as running the array might fail for other reasons, for
example because not all component devices were available.
This patch changes create_strip_zones() so that it returns a proper
error code (either -ENOMEM or -EINVAL) rather than 1 on errors and
makes raid0_run(), its single caller, return that value instead
of -ENOMEM.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
The "sector_shift" and "spacing" fields of struct raid0_private_data
were only used for the hash table lookups. So the removal of the
hash table allows get rid of these fields as well which simplifies
create_strip_zones() and raid0_run() quite a bit.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
The raid0 hash table has become unused due to the changes in the
previous patch. This patch removes the hash table allocation and
setup code and kills the hash_table field of struct raid0_private_data.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
1/ remove current_start. The same value is available in
zone->dev_start and storing it separately doesn't gain anything.
2/ rename curr_zone_start to curr_zone_end as we are now more
focused on the 'end' of each zone. We end up storing the
same number though - the old name was a little confusing
(and what does 'current' mean in this context anyway).
Signed-off-by: NeilBrown <neilb@suse.de>
The number of strip_zones of a raid0 array is bounded by the number of
drives in the array and is in fact much smaller for typical setups. For
example, any raid0 array containing identical disks will have only
a single strip_zone.
Therefore, the hash tables which are used for quickly finding the
strip_zone that holds a particular sector are of questionable value
and add quite a bit of unnecessary complexity.
This patch replaces the hash table lookup by equivalent code which
simply loops over all strip zones to find the zone that holds the
given sector.
In order to make this loop as fast as possible, the zone->start field
of struct strip_zone has been renamed to zone_end, and it now stores
the beginning of the next zone in sectors. This allows to save one
addition in the loop.
Subsequent cleanup patches will remove the hash table structure.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Allow userspace to set the size of the array according to the following
semantics:
1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
a) If size is set before the array is running, do_md_run will fail
if size is greater than the default size
b) A reshape attempt that reduces the default size to less than the set
array size should be blocked
2/ once userspace sets the size the kernel will not change it
3/ writing 'default' to this attribute returns control of the size to the
kernel and reverts to the size reported by the personality
Also, convert locations that need to know the default size from directly
reading ->array_sectors to <pers>_size. Resync/reshape operations
always follow the default size.
Finally, fixup other locations that read a number of 1k-blocks from
userspace to use strict_blocks_to_sectors() which checks for unsigned
long long to sector_t overflow and blocks to sectors overflow.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Get personalities out of the business of directly modifying
->array_sectors. Lays groundwork to introduce policy on when
->array_sectors can be modified.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for giving userspace control over ->array_sectors we need
to be able to retrieve the 'default' size, and the 'anticipated' size
when a reshape is requested. For personalities that do not reshape emit
a warning if anything but the default size is requested.
In the raid5 case we need to update ->previous_raid_disks to make the
new 'default' size available.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This patch renames the "size" field of struct mdk_rdev_s to
"sectors" and changes this field to store sectors instead of
blocks.
All users of this field, linear.c, raid0.c and md.c, are fixed up
accordingly which gets rid of many multiplications and divisions.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
This makes the includes more explicit, and is preparation for moving
md_k.h to drivers/md/md.h
Remove include/raid/md.h as its only remaining use was to #include
other files.
Signed-off-by: NeilBrown <neilb@suse.de>
Move the headers with the local structures for the disciplines and
bitmap.h into drivers/md/ so that they are more easily grepable for
hacking and not far away. md.h is left where it is for now as there
are some uses from the outside.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to
list_for_each_entry_safe, from <linux/list.h>, it should be defined to
use list_for_each_entry_safe, instead of reinventing the wheel.
But some calls to each_entry_safe don't really need a safe version,
just a direct list_for_each_entry is enough, this could save a temp
variable (tmp) in every function that used rdev_for_each.
In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
totally save many tmp vars; and only in the other situations that will call
list_del to delete an entry, the safe version is used.
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This patch renames the hash_spacing and preshift members of struct
raid0_private_data to spacing and sector_shift respectively and
changes the semantics as follows:
We always have spacing = 2 * hash_spacing. In case
sizeof(sector_t) > sizeof(u32) we also have sector_shift = preshift + 1
while sector_shift = preshift = 0 otherwise.
Note that the values of nb_zone and zone are unaffected by these changes
because in the sector_div() preceeding the assignement of these two
variables both arguments double.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
This completes the block -> sector conversion of struct strip_zone.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
current_offset and curr_zone_offset stored the corresponding offsets
as 1K quantities. Rename them to current_start and curr_zone_start
to match the naming of struct strip_zone and store the offsets as
sector counts.
Also, add KERN_INFO to the printk() affected by this change to make
checkpatch happy.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
For the same reason as in the previous patch, rename it from zone_offset
to zone_start.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Rename zone->dev_offset to zone->dev_start to make sure all users
have been converted.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Since all bio_split calls refer the same single bio_split_pool, the bio_split
function can use bio_split_pool directly instead of the mempool_t parameter;
then the mempool_t parameter can be removed from bio_split param list, and
bio_split_pool is only referred in fs/bio.c file, can be marked static.
Signed-off-by: Denis ChengRq <crquan@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Move stats related fields - stamp, in_flight, dkstats - from disk to
part0 and unify stat handling such that...
* part_stat_*() now updates part0 together if the specified partition
is not part0. ie. part_stat_*() are now essentially all_stat_*().
* {disk|all}_stat_*() are gone.
* part_round_stats() is updated similary. It handles part0 stats
automatically and disk_round_stats() is killed.
* part_{inc|dec}_in_fligh() is implemented which automatically updates
part0 stats for parts other than part0.
* disk_map_sector_rcu() is updated to return part0 if no part matches.
Combined with the above changes, this makes NULL special case
handling in callers unnecessary.
* Separate stats show code paths for disk are collapsed into part
stats show code paths.
* Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
While at it, reposition stat handling macros a bit and add missing
parentheses around macro parameters.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
There are two variants of stat functions - ones prefixed with double
underbars which don't care about preemption and ones without which
disable preemption before manipulating per-cpu counters. It's unclear
whether the underbarred ones assume that preemtion is disabled on
entry as some callers don't do that.
This patch unifies diskstats access by implementing disk_stat_lock()
and disk_stat_unlock() which take care of both RCU (for partition
access) and preemption (for per-cpu counter access). diskstats access
should always be enclosed between the two functions. As such, there's
no need for the versions which disables preemption. They're removed
and double underbars ones are renamed to drop the underbars. As an
extra argument is added, there's no danger of using the old version
unconverted.
disk_stat_lock() uses get_cpu() and returns the cpu index and all
diskstat functions which access per-cpu counters now has @cpu
argument to help RT.
This change adds RCU or preemption operations at some places but also
collapses several preemption ops into one at others. Overall, the
performance difference should be negligible as all involved ops are
very lightweight per-cpu ones.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'for-linus' of git://neil.brown.name/md: (52 commits)
md: Protect access to mddev->disks list using RCU
md: only count actual openers as access which prevent a 'stop'
md: linear: Make array_size sector-based and rename it to array_sectors.
md: Make mddev->array_size sector-based.
md: Make super_type->rdev_size_change() take sector-based sizes.
md: Fix check for overlapping devices.
md: Tidy up rdev_size_store a bit:
md: Remove some unused macros.
md: Turn rdev->sb_offset into a sector-based quantity.
md: Make calc_dev_sboffset() return a sector count.
md: Replace calc_dev_size() by calc_num_sectors().
md: Make update_size() take the number of sectors.
md: Better control of when do_md_stop is allowed to stop the array.
md: get_disk_info(): Don't convert between signed and unsigned and back.
md: Simplify restart_array().
md: alloc_disk_sb(): Return proper error value.
md: Simplify sb_equal().
md: Simplify uuid_equal().
md: sb_equal(): Fix misleading printk.
md: Fix a typo in the comment to cmd_match().
...
This patch renames the array_size field of struct mddev_s to array_sectors
and converts all instances to use units of 512 byte sectors instead of 1k
blocks.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
When devices are stacked, one device's merge_bvec_fn may need to perform
the mapping and then call one or more functions for its underlying devices.
The following bio fields are used:
bio->bi_sector
bio->bi_bdev
bio->bi_size
bio->bi_rw using bio_data_dir()
This patch creates a new struct bvec_merge_data holding a copy of those
fields to avoid having to change them directly in the struct bio when
going down the stack only to have to change them back again on the way
back up. (And then when the bio gets mapped for real, the whole
exercise gets repeated, but that's a problem for another day...)
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
As setting and clearing queue flags now requires that we hold a spinlock
on the queue, and as blk_queue_stack_limits is called without that lock,
get the lock inside blk_queue_stack_limits.
For blk_queue_stack_limits to be able to find the right lock, each md
personality needs to set q->queue_lock to point to the appropriate lock.
Those personalities which didn't previously use a spin_lock, us
q->__queue_lock. So always initialise that lock when allocated.
With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
longer cause warnings as it will be clear that the proper lock is held.
Thanks to Dan Williams for review and fixing the silly bugs.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Alistair John Strachan <alistair@devzero.co.uk>
Cc: Nick Piggin <npiggin@suse.de>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Jacek Luczak <difrost.kernel@gmail.com>
Cc: Prakash Punnoor <prakash@punnoor.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As this is more in line with common practice in the kernel. Also swap the
args around to be more like list_for_each.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Added blk_unplug interface, allowing all invocations of unplugs to result
in a generated blktrace UNPLUG.
Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
http://bugzilla.kernel.org/show_bug.cgi?id=3277
There is a seq_printf here that isn't being passed a 'seq'. Howeve as the
code is inside #ifdef MD_DEBUG, nobody noticed.
Also remove some extra spaces.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant. Remove it.
Now there is no need for bio_endio to subtract the size completed
from bi_size. So don't do that either.
While we are at it, change bi_end_io to return void.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If a raid0 has a component device larger than 4TB, and is accessed on a 32bit
machines, then as 'chunk' is unsigned long,
chunk << chunksize_bits
can overflow (this can be as high as the size of the device in KB). chunk
itself will not overflow (without triggering a BUG).
So change 'chunk' to be 'sector_t, and get rid of the 'BUG' as it becomes
impossible to hit.
Cc: "Jeff Zheng" <Jeff.Zheng@endace.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each backing_dev needs to be able to report whether it is congested, either by
modulating BDI_*_congested in ->state, or by defining a ->congested_fn.
md/raid did neither of these. This patch add a congested_fn which simply
checks all component devices to see if they are congested.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This loop that sets up the hash_table has problems.
Careful examination will show that the last time through, everything but
the first line is pointless. This is because all it does is change 'cur'
and 'size' and neither of these are used after the loop. This should ring
warning bells... That last time through the loop,
size += conf->strip_zone[cur].size
can index off the end of the strip_zone array. Depending on what it finds
there, it might exit the loop cleanly, or it might spin going further and
further beyond the array until it hits an unmapped address.
This patch rearranges the code so that the last, pointless, iteration of
the loop never happens. i.e. the one statement of the last loop that is
needed is moved the the end of the previous loop - or to before the loop
starts - and the loop counter starts from 1 instead of 0.
Cc: "Don Dupuis" <dondster@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- version-1 superblock
+ The default_bitmap_offset is in sectors, not bytes.
+ the 'size' field in the superblock is in sectors, not KB
- raid0_run should return a negative number on error, not '1'
- raid10_read_balance should not return a valid 'disk' number if
->rdev turned out to be NULL
- kmem_cache_destroy doesn't like being passed a NULL.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove various things which were checking for gcc-1.x and gcc-2.x compilers.
From: Adrian Bunk <bunk@stusta.de>
Some documentation updates and removes some code paths for gcc < 3.2.
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
md supports multiple different RAID level, each being implemented by a
'personality' (which is often in a separate module).
These personalities have fairly artificial 'numbers'. The numbers
are use to:
1- provide an index into an array where the various personalities
are recorded
2- identify the module (via an alias) which implements are particular
personality.
Neither of these uses really justify the existence of personality numbers.
The array can be replaced by a linked list which is searched (array lookup
only happens very rarely). Module identification can be done using an alias
based on level rather than 'personality' number.
The current 'raid5' modules support two level (4 and 5) but only one
personality. This slight awkwardness (which was handled in the mapping from
level to personality) can be better handled by allowing raid5 to register 2
personalities.
With this change in place, the core md module does not need to have an
exhaustive list of all possible personalities, so other personalities can be
added independently.
This patch also moves the check for chunksize being non-zero into the ->run
routines for the personalities that need it, rather than having it in core-md.
This has a side effect of allowing 'faulty' and 'linear' not to have a
chunk-size set.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Replace multiple kmalloc/memset pairs with kzalloc calls.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Substitute:
page_cache_get -> get_page
page_cache_release -> put_page
PAGE_CACHE_SHIFT -> PAGE_SHIFT
PAGE_CACHE_SIZE -> PAGE_SIZE
PAGE_CACHE_MASK -> PAGE_MASK
__free_page -> put_page
because we aren't using the page cache, we are just using pages.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Instead of having ->read_sectors and ->write_sectors, combine the two
into ->sectors[2] and similar for the other fields. This saves a branch
several places in the io path, since we don't have to care for what the
actual io direction is. On my x86-64 box, that's 200 bytes less text in
just the core (not counting the various drivers).
Signed-off-by: Jens Axboe <axboe@suse.de>
md does not yet support BIO_RW_BARRIER, so be honest about it and fail
(-EOPNOTSUPP) any such requests.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Apparently sector_div is only guaranteed to work with a 32bit divisor, even
on 64bit architectures. So allow for this in raid0.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes some unneeded checks of pointers being NULL before
calling kfree() on them. kfree() handles NULL pointers just fine, checking
first is pointless.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!