2005-04-16 22:20:36 +00:00
|
|
|
#ifndef _RAID5_H
|
|
|
|
#define _RAID5_H
|
|
|
|
|
|
|
|
#include <linux/raid/xor.h>
|
2009-08-30 02:09:26 +00:00
|
|
|
#include <linux/dmaengine.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
*
|
2011-07-26 01:34:20 +00:00
|
|
|
* Each stripe contains one buffer per device. Each buffer can be in
|
2005-04-16 22:20:36 +00:00
|
|
|
* one of a number of states stored in "flags". Changes between
|
2011-07-26 01:34:20 +00:00
|
|
|
* these states happen *almost* exclusively under the protection of the
|
|
|
|
* STRIPE_ACTIVE flag. Some very specific changes can happen in bi_end_io, and
|
|
|
|
* these are not protected by STRIPE_ACTIVE.
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
|
|
|
* The flag bits that are used to represent these states are:
|
|
|
|
* R5_UPTODATE and R5_LOCKED
|
|
|
|
*
|
|
|
|
* State Empty == !UPTODATE, !LOCK
|
|
|
|
* We have no data, and there is no active request
|
|
|
|
* State Want == !UPTODATE, LOCK
|
|
|
|
* A read request is being submitted for this block
|
|
|
|
* State Dirty == UPTODATE, LOCK
|
|
|
|
* Some new data is in this buffer, and it is being written out
|
|
|
|
* State Clean == UPTODATE, !LOCK
|
|
|
|
* We have valid data which is the same as on disc
|
|
|
|
*
|
|
|
|
* The possible state transitions are:
|
|
|
|
*
|
|
|
|
* Empty -> Want - on read or write to get old data for parity calc
|
2011-12-22 23:17:52 +00:00
|
|
|
* Empty -> Dirty - on compute_parity to satisfy write/sync request.
|
2005-04-16 22:20:36 +00:00
|
|
|
* Empty -> Clean - on compute_block when computing a block for failed drive
|
|
|
|
* Want -> Empty - on failed read
|
|
|
|
* Want -> Clean - on successful completion of read request
|
|
|
|
* Dirty -> Clean - on successful completion of write request
|
|
|
|
* Dirty -> Clean - on failed write
|
|
|
|
* Clean -> Dirty - on compute_parity to satisfy write/sync (RECONSTRUCT or RMW)
|
|
|
|
*
|
|
|
|
* The Want->Empty, Want->Clean, Dirty->Clean, transitions
|
|
|
|
* all happen in b_end_io at interrupt time.
|
|
|
|
* Each sets the Uptodate bit before releasing the Lock bit.
|
|
|
|
* This leaves one multi-stage transition:
|
|
|
|
* Want->Dirty->Clean
|
|
|
|
* This is safe because thinking that a Clean buffer is actually dirty
|
|
|
|
* will at worst delay some action, and the stripe will be scheduled
|
|
|
|
* for attention after the transition is complete.
|
|
|
|
*
|
|
|
|
* There is one possibility that is not covered by these states. That
|
|
|
|
* is if one drive has failed and there is a spare being rebuilt. We
|
|
|
|
* can't distinguish between a clean block that has been generated
|
|
|
|
* from parity calculations, and a clean block that has been
|
|
|
|
* successfully written to the spare ( or to parity when resyncing).
|
2013-09-18 04:00:43 +00:00
|
|
|
* To distinguish these states we have a stripe bit STRIPE_INSYNC that
|
2005-04-16 22:20:36 +00:00
|
|
|
* is set whenever a write is scheduled to the spare, or to the parity
|
|
|
|
* disc if there is no spare. A sync request clears this bit, and
|
|
|
|
* when we find it set with no buffers locked, we know the sync is
|
|
|
|
* complete.
|
|
|
|
*
|
|
|
|
* Buffers for the md device that arrive via make_request are attached
|
|
|
|
* to the appropriate stripe in one of two lists linked on b_reqnext.
|
|
|
|
* One list (bh_read) for read requests, one (bh_write) for write.
|
|
|
|
* There should never be more than one buffer on the two lists
|
|
|
|
* together, but we are not guaranteed of that so we allow for more.
|
|
|
|
*
|
|
|
|
* If a buffer is on the read list when the associated cache buffer is
|
|
|
|
* Uptodate, the data is copied into the read buffer and it's b_end_io
|
|
|
|
* routine is called. This may happen in the end_request routine only
|
|
|
|
* if the buffer has just successfully been read. end_request should
|
|
|
|
* remove the buffers from the list and then set the Uptodate bit on
|
|
|
|
* the buffer. Other threads may do this only if they first check
|
|
|
|
* that the Uptodate bit is set. Once they have checked that they may
|
|
|
|
* take buffers off the read queue.
|
|
|
|
*
|
|
|
|
* When a buffer on the write list is committed for write it is copied
|
|
|
|
* into the cache buffer, which is then marked dirty, and moved onto a
|
|
|
|
* third list, the written list (bh_written). Once both the parity
|
|
|
|
* block and the cached buffer are successfully written, any buffer on
|
|
|
|
* a written list can be returned with b_end_io.
|
|
|
|
*
|
2011-07-26 01:34:20 +00:00
|
|
|
* The write list and read list both act as fifos. The read list,
|
|
|
|
* write list and written list are protected by the device_lock.
|
|
|
|
* The device_lock is only for list manipulations and will only be
|
|
|
|
* held for a very short time. It can be claimed from interrupts.
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
|
|
|
*
|
|
|
|
* Stripes in the stripe cache can be on one of two lists (or on
|
|
|
|
* neither). The "inactive_list" contains stripes which are not
|
|
|
|
* currently being used for any request. They can freely be reused
|
|
|
|
* for another stripe. The "handle_list" contains stripes that need
|
|
|
|
* to be handled in some way. Both of these are fifo queues. Each
|
|
|
|
* stripe is also (potentially) linked to a hash bucket in the hash
|
|
|
|
* table so that it can be found by sector number. Stripes that are
|
|
|
|
* not hashed must be on the inactive_list, and will normally be at
|
|
|
|
* the front. All stripes start life this way.
|
|
|
|
*
|
|
|
|
* The inactive_list, handle_list and hash bucket lists are all protected by the
|
|
|
|
* device_lock.
|
|
|
|
* - stripes have a reference counter. If count==0, they are on a list.
|
|
|
|
* - If a stripe might need handling, STRIPE_HANDLE is set.
|
|
|
|
* - When refcount reaches zero, then if STRIPE_HANDLE it is put on
|
|
|
|
* handle_list else inactive_list
|
|
|
|
*
|
|
|
|
* This, combined with the fact that STRIPE_HANDLE is only ever
|
|
|
|
* cleared while a stripe has a non-zero count means that if the
|
|
|
|
* refcount is 0 and STRIPE_HANDLE is set, then it is on the
|
|
|
|
* handle_list and if recount is 0 and STRIPE_HANDLE is not set, then
|
|
|
|
* the stripe is on inactive_list.
|
|
|
|
*
|
|
|
|
* The possible transitions are:
|
|
|
|
* activate an unhashed/inactive stripe (get_active_stripe())
|
|
|
|
* lockdev check-hash unlink-stripe cnt++ clean-stripe hash-stripe unlockdev
|
|
|
|
* activate a hashed, possibly active stripe (get_active_stripe())
|
|
|
|
* lockdev check-hash if(!cnt++)unlink-stripe unlockdev
|
|
|
|
* attach a request to an active stripe (add_stripe_bh())
|
|
|
|
* lockdev attach-buffer unlockdev
|
|
|
|
* handle a stripe (handle_stripe())
|
2011-07-26 01:34:20 +00:00
|
|
|
* setSTRIPE_ACTIVE, clrSTRIPE_HANDLE ...
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-02 20:52:30 +00:00
|
|
|
* (lockdev check-buffers unlockdev) ..
|
|
|
|
* change-state ..
|
2011-07-26 01:34:20 +00:00
|
|
|
* record io/ops needed clearSTRIPE_ACTIVE schedule io/ops
|
2005-04-16 22:20:36 +00:00
|
|
|
* release an active stripe (release_stripe())
|
|
|
|
* lockdev if (!--cnt) { if STRIPE_HANDLE, add to handle_list else add to inactive-list } unlockdev
|
|
|
|
*
|
|
|
|
* The refcount counts each thread that have activated the stripe,
|
|
|
|
* plus raid5d if it is handling it, plus one for each active request
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-02 20:52:30 +00:00
|
|
|
* on a cached buffer, and plus one if the stripe is undergoing stripe
|
|
|
|
* operations.
|
|
|
|
*
|
2011-07-26 01:34:20 +00:00
|
|
|
* The stripe operations are:
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-02 20:52:30 +00:00
|
|
|
* -copying data between the stripe cache and user application buffers
|
|
|
|
* -computing blocks to save a disk access, or to recover a missing block
|
|
|
|
* -updating the parity on a write operation (reconstruct write and
|
|
|
|
* read-modify-write)
|
|
|
|
* -checking parity correctness
|
|
|
|
* -running i/o to disk
|
|
|
|
* These operations are carried out by raid5_run_ops which uses the async_tx
|
|
|
|
* api to (optionally) offload operations to dedicated hardware engines.
|
|
|
|
* When requesting an operation handle_stripe sets the pending bit for the
|
|
|
|
* operation and increments the count. raid5_run_ops is then run whenever
|
|
|
|
* the count is non-zero.
|
|
|
|
* There are some critical dependencies between the operations that prevent some
|
|
|
|
* from being requested while another is in flight.
|
|
|
|
* 1/ Parity check operations destroy the in cache version of the parity block,
|
|
|
|
* so we prevent parity dependent operations like writes and compute_blocks
|
|
|
|
* from starting while a check is in progress. Some dma engines can perform
|
|
|
|
* the check without damaging the parity block, in these cases the parity
|
|
|
|
* block is re-marked up to date (assuming the check was successful) and is
|
|
|
|
* not re-read from disk.
|
|
|
|
* 2/ When a write operation is requested we immediately lock the affected
|
|
|
|
* blocks, and mark them as not up to date. This causes new read requests
|
|
|
|
* to be held off, as well as parity checks and compute block operations.
|
|
|
|
* 3/ Once a compute block operation has been requested handle_stripe treats
|
|
|
|
* that block as if it is up to date. raid5_run_ops guaruntees that any
|
|
|
|
* operation that is dependent on the compute block result is initiated after
|
|
|
|
* the compute block completes.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
|
2008-06-27 22:31:57 +00:00
|
|
|
/*
|
2014-09-30 04:23:59 +00:00
|
|
|
* Operations state - intermediate states that are visible outside of
|
2011-07-26 01:34:20 +00:00
|
|
|
* STRIPE_ACTIVE.
|
2008-06-27 22:31:57 +00:00
|
|
|
* In general _idle indicates nothing is running, _run indicates a data
|
|
|
|
* processing operation is active, and _result means the data processing result
|
|
|
|
* is stable and can be acted upon. For simple operations like biofill and
|
|
|
|
* compute that only have an _idle and _run state they are indicated with
|
|
|
|
* sh->state flags (STRIPE_BIOFILL_RUN and STRIPE_COMPUTE_RUN)
|
|
|
|
*/
|
|
|
|
/**
|
|
|
|
* enum check_states - handles syncing / repairing a stripe
|
|
|
|
* @check_state_idle - check operations are quiesced
|
|
|
|
* @check_state_run - check operation is running
|
|
|
|
* @check_state_result - set outside lock when check result is valid
|
|
|
|
* @check_state_compute_run - check failed and we are repairing
|
|
|
|
* @check_state_compute_result - set outside lock when compute result is valid
|
|
|
|
*/
|
|
|
|
enum check_states {
|
|
|
|
check_state_idle = 0,
|
2009-07-14 20:40:19 +00:00
|
|
|
check_state_run, /* xor parity check */
|
|
|
|
check_state_run_q, /* q-parity check */
|
|
|
|
check_state_run_pq, /* pq dual parity check */
|
2008-06-27 22:31:57 +00:00
|
|
|
check_state_check_result,
|
|
|
|
check_state_compute_run, /* parity repair */
|
|
|
|
check_state_compute_result,
|
|
|
|
};
|
|
|
|
|
|
|
|
/**
|
|
|
|
* enum reconstruct_states - handles writing or expanding a stripe
|
|
|
|
*/
|
|
|
|
enum reconstruct_states {
|
|
|
|
reconstruct_state_idle = 0,
|
2008-06-27 22:32:06 +00:00
|
|
|
reconstruct_state_prexor_drain_run, /* prexor-write */
|
2008-06-27 22:31:57 +00:00
|
|
|
reconstruct_state_drain_run, /* write */
|
|
|
|
reconstruct_state_run, /* expand */
|
2008-06-27 22:32:06 +00:00
|
|
|
reconstruct_state_prexor_drain_result,
|
2008-06-27 22:31:57 +00:00
|
|
|
reconstruct_state_drain_result,
|
|
|
|
reconstruct_state_result,
|
|
|
|
};
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
struct stripe_head {
|
2006-01-06 08:20:33 +00:00
|
|
|
struct hlist_node hash;
|
2009-03-31 03:39:38 +00:00
|
|
|
struct list_head lru; /* inactive_list or handle_list */
|
2013-08-27 09:50:39 +00:00
|
|
|
struct llist_node release_list;
|
2011-10-11 05:49:52 +00:00
|
|
|
struct r5conf *raid_conf;
|
2009-03-31 04:19:03 +00:00
|
|
|
short generation; /* increments with every
|
|
|
|
* reshape */
|
2009-03-31 03:39:38 +00:00
|
|
|
sector_t sector; /* sector of this row */
|
|
|
|
short pd_idx; /* parity disk index */
|
|
|
|
short qd_idx; /* 'Q' disk index for raid6 */
|
2009-03-31 03:39:38 +00:00
|
|
|
short ddf_layout;/* use DDF ordering to calculate Q */
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 04:16:17 +00:00
|
|
|
short hash_lock_index;
|
2009-03-31 03:39:38 +00:00
|
|
|
unsigned long state; /* state flags */
|
|
|
|
atomic_t count; /* nr of active thread/requests */
|
2005-09-09 23:23:54 +00:00
|
|
|
int bm_seq; /* sequence number for bitmap flushes */
|
2009-03-31 03:39:38 +00:00
|
|
|
int disks; /* disks in stripe */
|
2014-12-15 01:57:03 +00:00
|
|
|
int overwrite_disks; /* total overwrite disks in stripe,
|
|
|
|
* this is only checked when stripe
|
|
|
|
* has STRIPE_BATCH_READY
|
|
|
|
*/
|
2008-06-27 22:31:57 +00:00
|
|
|
enum check_states check_state;
|
md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states'
From: Dan Williams <dan.j.williams@intel.com>
Track the state of reconstruct operations (recalculating the parity block
usually due to incoming writes, or as part of array expansion) Reduces the
scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
a reconstruct operation has been requested via the ops_request field of struct
stripe_head_state.
This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
not track the state of the operation.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-27 22:32:05 +00:00
|
|
|
enum reconstruct_states reconstruct_state;
|
raid5: add a per-stripe lock
Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
lock contention of conf->device_lock.
stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio
list of the stripe is always serialized by this lock, so adding bio to the
lists (add_stripe_bio()) and removing bio from the lists (like
ops_run_biofill()) not race.
If bio in ->read, ->written ... list are not shared by multiple stripes, we
don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
protect them. If the bio are shared, there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe
Let's have an example:
| stripe1 | stripe2 | stripe3 |
...bio1......|bio2|bio3|....bio4.....
stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.
before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-19 06:01:31 +00:00
|
|
|
spinlock_t stripe_lock;
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 06:30:16 +00:00
|
|
|
int cpu;
|
2013-08-29 07:40:32 +00:00
|
|
|
struct r5worker_group *group;
|
2014-12-15 01:57:03 +00:00
|
|
|
|
|
|
|
struct stripe_head *batch_head; /* protected by stripe lock */
|
|
|
|
spinlock_t batch_lock; /* only header's lock is useful */
|
|
|
|
struct list_head batch_list; /* protected by head's batch lock*/
|
raid5: add basic stripe log
This introduces a simple log for raid5. Data/parity writing to raid
array first writes to the log, then write to raid array disks. If
crash happens, we can recovery data from the log. This can speed up
raid resync and fix write hole issue.
The log structure is pretty simple. Data/meta data is stored in block
unit, which is 4k generally. It has only one type of meta data block.
The meta data block can track 3 types of data, stripe data, stripe
parity and flush block. MD superblock will point to the last valid
meta data block. Each meta data block has checksum/seq number, so
recovery can scan the log correctly. We store a checksum of stripe
data/parity to the metadata block, so meta data and stripe data/parity
can be written to log disk together. otherwise, meta data write must
wait till stripe data/parity is finished.
For stripe data, meta data block will record stripe data sector and
size. Currently the size is always 4k. This meta data record can be made
simpler if we just fix write hole (eg, we can record data of a stripe's
different disks together), but this format can be extended to support
caching in the future, which must record data address/size.
For stripe parity, meta data block will record stripe sector. It's
size should be 4k (for raid5) or 8k (for raid6). We always store p
parity first. This format should work for caching too.
flush block indicates a stripe is in raid array disks. Fixing write
hole doesn't need this type of meta data, it's for caching extension.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-13 21:31:59 +00:00
|
|
|
|
|
|
|
struct r5l_io_unit *log_io;
|
|
|
|
struct list_head log_list;
|
md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 23:24:40 +00:00
|
|
|
sector_t log_start; /* first meta block on the journal */
|
|
|
|
struct list_head r5c; /* for r5c_cache->stripe_in_journal */
|
2009-10-16 05:25:22 +00:00
|
|
|
/**
|
|
|
|
* struct stripe_operations
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-02 20:52:30 +00:00
|
|
|
* @target - STRIPE_OP_COMPUTE_BLK target
|
2009-10-16 05:25:22 +00:00
|
|
|
* @target2 - 2nd compute target in the raid6 case
|
|
|
|
* @zero_sum_result - P and Q verification flags
|
|
|
|
* @request - async service request flags for raid_run_ops
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-02 20:52:30 +00:00
|
|
|
*/
|
|
|
|
struct stripe_operations {
|
2009-07-14 20:40:19 +00:00
|
|
|
int target, target2;
|
2009-08-30 02:09:26 +00:00
|
|
|
enum sum_check_flags zero_sum_result;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-02 20:52:30 +00:00
|
|
|
} ops;
|
2005-04-16 22:20:36 +00:00
|
|
|
struct r5dev {
|
2011-12-22 23:17:52 +00:00
|
|
|
/* rreq and rvec are used for the replacement device when
|
|
|
|
* writing data to both devices.
|
|
|
|
*/
|
|
|
|
struct bio req, rreq;
|
|
|
|
struct bio_vec vec, rvec;
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 09:57:44 +00:00
|
|
|
struct page *page, *orig_page;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-02 20:52:30 +00:00
|
|
|
struct bio *toread, *read, *towrite, *written;
|
2005-04-16 22:20:36 +00:00
|
|
|
sector_t sector; /* sector of this page */
|
|
|
|
unsigned long flags;
|
raid5: add basic stripe log
This introduces a simple log for raid5. Data/parity writing to raid
array first writes to the log, then write to raid array disks. If
crash happens, we can recovery data from the log. This can speed up
raid resync and fix write hole issue.
The log structure is pretty simple. Data/meta data is stored in block
unit, which is 4k generally. It has only one type of meta data block.
The meta data block can track 3 types of data, stripe data, stripe
parity and flush block. MD superblock will point to the last valid
meta data block. Each meta data block has checksum/seq number, so
recovery can scan the log correctly. We store a checksum of stripe
data/parity to the metadata block, so meta data and stripe data/parity
can be written to log disk together. otherwise, meta data write must
wait till stripe data/parity is finished.
For stripe data, meta data block will record stripe data sector and
size. Currently the size is always 4k. This meta data record can be made
simpler if we just fix write hole (eg, we can record data of a stripe's
different disks together), but this format can be extended to support
caching in the future, which must record data address/size.
For stripe parity, meta data block will record stripe sector. It's
size should be 4k (for raid5) or 8k (for raid6). We always store p
parity first. This format should work for caching too.
flush block indicates a stripe is in raid array disks. Fixing write
hole doesn't need this type of meta data, it's for caching extension.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-13 21:31:59 +00:00
|
|
|
u32 log_checksum;
|
2005-04-16 22:20:36 +00:00
|
|
|
} dev[1]; /* allocated with extra space depending of RAID geometry */
|
|
|
|
};
|
2007-07-09 18:56:43 +00:00
|
|
|
|
|
|
|
/* stripe_head_state - collects and tracks the dynamic state of a stripe_head
|
2011-07-26 01:34:20 +00:00
|
|
|
* for handle_stripe.
|
2007-07-09 18:56:43 +00:00
|
|
|
*/
|
|
|
|
struct stripe_head_state {
|
2011-12-22 23:17:53 +00:00
|
|
|
/* 'syncing' means that we need to read all devices, either
|
|
|
|
* to check/correct parity, or to reconstruct a missing device.
|
|
|
|
* 'replacing' means we are replacing one or more drives and
|
|
|
|
* the source is valid at this point so we don't need to
|
|
|
|
* read all devices, just the replacement targets.
|
|
|
|
*/
|
|
|
|
int syncing, expanding, expanded, replacing;
|
2007-07-09 18:56:43 +00:00
|
|
|
int locked, uptodate, to_read, to_write, failed, written;
|
2007-01-02 20:52:31 +00:00
|
|
|
int to_fill, compute, req_compute, non_overwrite;
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 23:24:39 +00:00
|
|
|
int injournal, just_cached;
|
2011-07-26 01:35:19 +00:00
|
|
|
int failed_num[2];
|
|
|
|
int p_failed, q_failed;
|
2011-07-26 01:35:20 +00:00
|
|
|
int dec_preread_active;
|
|
|
|
unsigned long ops_request;
|
|
|
|
|
2015-08-14 02:07:57 +00:00
|
|
|
struct bio_list return_bi;
|
2011-10-11 05:45:26 +00:00
|
|
|
struct md_rdev *blocked_rdev;
|
2011-07-28 01:39:22 +00:00
|
|
|
int handle_bad_blocks;
|
2015-10-09 04:54:08 +00:00
|
|
|
int log_failed;
|
2016-11-24 06:50:39 +00:00
|
|
|
int waiting_extra_page;
|
2007-07-09 18:56:43 +00:00
|
|
|
};
|
|
|
|
|
2011-12-22 23:17:52 +00:00
|
|
|
/* Flags for struct r5dev.flags */
|
|
|
|
enum r5dev_flags {
|
|
|
|
R5_UPTODATE, /* page contains current data */
|
|
|
|
R5_LOCKED, /* IO has been submitted on "req" */
|
2011-12-22 23:17:53 +00:00
|
|
|
R5_DOUBLE_LOCKED,/* Cannot clear R5_LOCKED until 2 writes complete */
|
2011-12-22 23:17:52 +00:00
|
|
|
R5_OVERWRITE, /* towrite covers whole page */
|
2005-04-16 22:20:36 +00:00
|
|
|
/* and some that are internal to handle_stripe */
|
2011-12-22 23:17:52 +00:00
|
|
|
R5_Insync, /* rdev && rdev->in_sync at start */
|
|
|
|
R5_Wantread, /* want to schedule a read */
|
|
|
|
R5_Wantwrite,
|
|
|
|
R5_Overlap, /* There is a pending overlapping request
|
|
|
|
* on this block */
|
2012-07-31 00:04:21 +00:00
|
|
|
R5_ReadNoMerge, /* prevent bio from merging in block-layer */
|
2011-12-22 23:17:52 +00:00
|
|
|
R5_ReadError, /* seen a read error here recently */
|
|
|
|
R5_ReWrite, /* have tried to over-write the readerror */
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-12-22 23:17:52 +00:00
|
|
|
R5_Expanded, /* This block now has post-expand data */
|
|
|
|
R5_Wantcompute, /* compute_block in progress treat as
|
|
|
|
* uptodate
|
|
|
|
*/
|
|
|
|
R5_Wantfill, /* dev->toread contains a bio that needs
|
|
|
|
* filling
|
|
|
|
*/
|
|
|
|
R5_Wantdrain, /* dev->towrite needs to be drained */
|
|
|
|
R5_WantFUA, /* Write should be FUA */
|
2012-05-22 03:55:05 +00:00
|
|
|
R5_SyncIO, /* The IO is sync */
|
2011-12-22 23:17:52 +00:00
|
|
|
R5_WriteError, /* got a write error - need to record it */
|
|
|
|
R5_MadeGood, /* A bad block has been fixed by writing to it */
|
|
|
|
R5_ReadRepl, /* Will/did read from replacement rather than orig */
|
|
|
|
R5_MadeGoodRepl,/* A bad block on the replacement device has been
|
|
|
|
* fixed by writing to it */
|
2011-12-22 23:17:53 +00:00
|
|
|
R5_NeedReplace, /* This device has a replacement which is not
|
|
|
|
* up-to-date at this stripe. */
|
|
|
|
R5_WantReplace, /* We need to update the replacement, we have read
|
|
|
|
* data in, and now is a good time to write it out.
|
|
|
|
*/
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 02:49:05 +00:00
|
|
|
R5_Discard, /* Discard the stripe */
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 09:57:44 +00:00
|
|
|
R5_SkipCopy, /* Don't copy data from bio to stripe cache */
|
md/r5cache: State machine for raid5-cache write back mode
This patch adds state machine for raid5-cache. With log device, the
raid456 array could operate in two different modes (r5c_journal_mode):
- write-back (R5C_MODE_WRITE_BACK)
- write-through (R5C_MODE_WRITE_THROUGH)
Existing code of raid5-cache only has write-through mode. For write-back
cache, it is necessary to extend the state machine.
With write-back cache, every stripe could operate in two different
phases:
- caching
- writing-out
In caching phase, the stripe handles writes as:
- write to journal
- return IO
In writing-out phase, the stripe behaviors as a stripe in write through
mode R5C_MODE_WRITE_THROUGH.
STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
writing-out phase.
Please note: this is a "no-op" patch for raid5-cache write-through
mode.
The following detailed explanation is copied from the raid5-cache.c:
/*
* raid5 cache state machine
*
* With rhe RAID cache, each stripe works in two phases:
* - caching phase
* - writing-out phase
*
* These two phases are controlled by bit STRIPE_R5C_CACHING:
* if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
* if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
*
* When there is no journal, or the journal is in write-through mode,
* the stripe is always in writing-out phase.
*
* For write-back journal, the stripe is sent to caching phase on write
* (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
* the write-out phase by clearing STRIPE_R5C_CACHING.
*
* Stripes in caching phase do not write the raid disks. Instead, all
* writes are committed from the log device. Therefore, a stripe in
* caching phase handles writes as:
* - write to log device
* - return IO
*
* Stripes in writing-out phase handle writes as:
* - calculate parity
* - write pending data and parity to journal
* - write data and parity to raid disks
* - return IO for pending writes
*/
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 23:24:38 +00:00
|
|
|
R5_InJournal, /* data being written is in the journal device.
|
|
|
|
* if R5_InJournal is set for parity pd_idx, all the
|
|
|
|
* data and parity being written are in the journal
|
|
|
|
* device
|
|
|
|
*/
|
2017-01-13 01:22:41 +00:00
|
|
|
R5_OrigPageUPTDODATE, /* with write back cache, we read old data into
|
|
|
|
* dev->orig_page for prexor. When this flag is
|
|
|
|
* set, orig_page contains latest data in the
|
|
|
|
* raid disk.
|
|
|
|
*/
|
2011-12-22 23:17:52 +00:00
|
|
|
};
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Stripe state
|
|
|
|
*/
|
2011-07-26 01:19:49 +00:00
|
|
|
enum {
|
2011-07-26 01:34:20 +00:00
|
|
|
STRIPE_ACTIVE,
|
2011-07-26 01:19:49 +00:00
|
|
|
STRIPE_HANDLE,
|
|
|
|
STRIPE_SYNC_REQUESTED,
|
|
|
|
STRIPE_SYNCING,
|
|
|
|
STRIPE_INSYNC,
|
2013-07-22 02:57:21 +00:00
|
|
|
STRIPE_REPLACED,
|
2011-07-26 01:19:49 +00:00
|
|
|
STRIPE_PREREAD_ACTIVE,
|
|
|
|
STRIPE_DELAYED,
|
|
|
|
STRIPE_DEGRADED,
|
|
|
|
STRIPE_BIT_DELAY,
|
|
|
|
STRIPE_EXPANDING,
|
|
|
|
STRIPE_EXPAND_SOURCE,
|
|
|
|
STRIPE_EXPAND_READY,
|
|
|
|
STRIPE_IO_STARTED, /* do not count towards 'bypass_count' */
|
|
|
|
STRIPE_FULL_WRITE, /* all blocks are set to be overwritten */
|
|
|
|
STRIPE_BIOFILL_RUN,
|
|
|
|
STRIPE_COMPUTE_RUN,
|
|
|
|
STRIPE_OPS_REQ_PENDING,
|
2012-08-01 22:33:00 +00:00
|
|
|
STRIPE_ON_UNPLUG_LIST,
|
2013-03-12 01:18:06 +00:00
|
|
|
STRIPE_DISCARD,
|
2013-08-27 09:50:39 +00:00
|
|
|
STRIPE_ON_RELEASE_LIST,
|
2014-12-15 01:57:03 +00:00
|
|
|
STRIPE_BATCH_READY,
|
2014-12-15 01:57:03 +00:00
|
|
|
STRIPE_BATCH_ERR,
|
2015-05-26 22:43:45 +00:00
|
|
|
STRIPE_BITMAP_PENDING, /* Being added to bitmap, don't add
|
|
|
|
* to batch yet.
|
|
|
|
*/
|
md/r5cache: State machine for raid5-cache write back mode
This patch adds state machine for raid5-cache. With log device, the
raid456 array could operate in two different modes (r5c_journal_mode):
- write-back (R5C_MODE_WRITE_BACK)
- write-through (R5C_MODE_WRITE_THROUGH)
Existing code of raid5-cache only has write-through mode. For write-back
cache, it is necessary to extend the state machine.
With write-back cache, every stripe could operate in two different
phases:
- caching
- writing-out
In caching phase, the stripe handles writes as:
- write to journal
- return IO
In writing-out phase, the stripe behaviors as a stripe in write through
mode R5C_MODE_WRITE_THROUGH.
STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
writing-out phase.
Please note: this is a "no-op" patch for raid5-cache write-through
mode.
The following detailed explanation is copied from the raid5-cache.c:
/*
* raid5 cache state machine
*
* With rhe RAID cache, each stripe works in two phases:
* - caching phase
* - writing-out phase
*
* These two phases are controlled by bit STRIPE_R5C_CACHING:
* if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
* if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
*
* When there is no journal, or the journal is in write-through mode,
* the stripe is always in writing-out phase.
*
* For write-back journal, the stripe is sent to caching phase on write
* (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
* the write-out phase by clearing STRIPE_R5C_CACHING.
*
* Stripes in caching phase do not write the raid disks. Instead, all
* writes are committed from the log device. Therefore, a stripe in
* caching phase handles writes as:
* - write to log device
* - return IO
*
* Stripes in writing-out phase handle writes as:
* - calculate parity
* - write pending data and parity to journal
* - write data and parity to raid disks
* - return IO for pending writes
*/
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 23:24:38 +00:00
|
|
|
STRIPE_LOG_TRAPPED, /* trapped into log (see raid5-cache.c)
|
|
|
|
* this bit is used in two scenarios:
|
|
|
|
*
|
|
|
|
* 1. write-out phase
|
|
|
|
* set in first entry of r5l_write_stripe
|
|
|
|
* clear in second entry of r5l_write_stripe
|
|
|
|
* used to bypass logic in handle_stripe
|
|
|
|
*
|
|
|
|
* 2. caching phase
|
|
|
|
* set in r5c_try_caching_write()
|
|
|
|
* clear when journal write is done
|
|
|
|
* used to initiate r5c_cache_data()
|
|
|
|
* also used to bypass logic in handle_stripe
|
|
|
|
*/
|
|
|
|
STRIPE_R5C_CACHING, /* the stripe is in caching phase
|
|
|
|
* see more detail in the raid5-cache.c
|
|
|
|
*/
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 23:24:39 +00:00
|
|
|
STRIPE_R5C_PARTIAL_STRIPE, /* in r5c cache (to-be/being handled or
|
|
|
|
* in conf->r5c_partial_stripe_list)
|
|
|
|
*/
|
|
|
|
STRIPE_R5C_FULL_STRIPE, /* in r5c cache (to-be/being handled or
|
|
|
|
* in conf->r5c_full_stripe_list)
|
|
|
|
*/
|
2016-11-19 00:46:50 +00:00
|
|
|
STRIPE_R5C_PREFLUSH, /* need to flush journal device */
|
2011-07-26 01:19:49 +00:00
|
|
|
};
|
2009-10-16 05:25:22 +00:00
|
|
|
|
2015-05-21 02:40:26 +00:00
|
|
|
#define STRIPE_EXPAND_SYNC_FLAGS \
|
2014-12-15 01:57:04 +00:00
|
|
|
((1 << STRIPE_EXPAND_SOURCE) |\
|
|
|
|
(1 << STRIPE_EXPAND_READY) |\
|
|
|
|
(1 << STRIPE_EXPANDING) |\
|
|
|
|
(1 << STRIPE_SYNC_REQUESTED))
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-02 20:52:30 +00:00
|
|
|
/*
|
2008-06-27 22:31:57 +00:00
|
|
|
* Operation request flags
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-02 20:52:30 +00:00
|
|
|
*/
|
2011-12-22 23:17:52 +00:00
|
|
|
enum {
|
|
|
|
STRIPE_OP_BIOFILL,
|
|
|
|
STRIPE_OP_COMPUTE_BLK,
|
|
|
|
STRIPE_OP_PREXOR,
|
|
|
|
STRIPE_OP_BIODRAIN,
|
|
|
|
STRIPE_OP_RECONSTRUCT,
|
|
|
|
STRIPE_OP_CHECK,
|
|
|
|
};
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 01:57:05 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* RAID parity calculation preferences
|
|
|
|
*/
|
|
|
|
enum {
|
|
|
|
PARITY_DISABLE_RMW = 0,
|
|
|
|
PARITY_ENABLE_RMW,
|
2014-12-15 01:57:05 +00:00
|
|
|
PARITY_PREFER_RMW,
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 01:57:05 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pages requested from set_syndrome_sources()
|
|
|
|
*/
|
|
|
|
enum {
|
|
|
|
SYNDROME_SRC_ALL,
|
|
|
|
SYNDROME_SRC_WANT_DRAIN,
|
|
|
|
SYNDROME_SRC_WRITTEN,
|
|
|
|
};
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Plugging:
|
|
|
|
*
|
|
|
|
* To improve write throughput, we need to delay the handling of some
|
|
|
|
* stripes until there has been a chance that several write requests
|
|
|
|
* for the one stripe have all been collected.
|
|
|
|
* In particular, any write request that would require pre-reading
|
|
|
|
* is put on a "delayed" queue until there are no stripes currently
|
|
|
|
* in a pre-read phase. Further, if the "delayed" queue is empty when
|
|
|
|
* a stripe is put on it then we "plug" the queue and do not process it
|
|
|
|
* until an unplug call is made. (the unplug_io_fn() is called).
|
|
|
|
*
|
|
|
|
* When preread is initiated on a stripe, we set PREREAD_ACTIVE and add
|
|
|
|
* it to the count of prereading stripes.
|
|
|
|
* When write is initiated, or the stripe refcnt == 0 (just in case) we
|
|
|
|
* clear the PREREAD_ACTIVE flag and decrement the count
|
2006-10-03 08:15:45 +00:00
|
|
|
* Whenever the 'handle' queue is empty and the device is not plugged, we
|
|
|
|
* move any strips from delayed to handle and clear the DELAYED flag and set
|
|
|
|
* PREREAD_ACTIVE.
|
2005-04-16 22:20:36 +00:00
|
|
|
* In stripe_handle, if we find pre-reading is necessary, we do it if
|
|
|
|
* PREREAD_ACTIVE is set, else we set DELAYED which will send it to the delayed queue.
|
2011-07-26 01:34:20 +00:00
|
|
|
* HANDLE gets cleared if stripe_handle leaves nothing locked.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2009-03-31 03:27:03 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
struct disk_info {
|
2011-12-22 23:17:52 +00:00
|
|
|
struct md_rdev *rdev, *replacement;
|
2016-11-24 06:50:39 +00:00
|
|
|
struct page *extra_page; /* extra page to use in prexor */
|
2005-04-16 22:20:36 +00:00
|
|
|
};
|
|
|
|
|
2016-11-17 23:24:37 +00:00
|
|
|
/*
|
|
|
|
* Stripe cache
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define NR_STRIPES 256
|
|
|
|
#define STRIPE_SIZE PAGE_SIZE
|
|
|
|
#define STRIPE_SHIFT (PAGE_SHIFT - 9)
|
|
|
|
#define STRIPE_SECTORS (STRIPE_SIZE>>9)
|
|
|
|
#define IO_THRESHOLD 1
|
|
|
|
#define BYPASS_THRESHOLD 1
|
|
|
|
#define NR_HASH (PAGE_SIZE / sizeof(struct hlist_head))
|
|
|
|
#define HASH_MASK (NR_HASH - 1)
|
|
|
|
#define MAX_STRIPE_BATCH 8
|
|
|
|
|
|
|
|
/* bio's attached to a stripe+device for I/O are linked together in bi_sector
|
|
|
|
* order without overlap. There may be several bio's per stripe+device, and
|
|
|
|
* a bio could span several devices.
|
|
|
|
* When walking this list for a particular stripe+device, we must never proceed
|
|
|
|
* beyond a bio that extends past this device, as the next bio might no longer
|
|
|
|
* be valid.
|
|
|
|
* This function is used to determine the 'next' bio in the list, given the
|
|
|
|
* sector of the current stripe+device
|
|
|
|
*/
|
|
|
|
static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
|
|
|
|
{
|
|
|
|
int sectors = bio_sectors(bio);
|
|
|
|
|
|
|
|
if (bio->bi_iter.bi_sector + sectors < sector + STRIPE_SECTORS)
|
|
|
|
return bio->bi_next;
|
|
|
|
else
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We maintain a biased count of active stripes in the bottom 16 bits of
|
|
|
|
* bi_phys_segments, and a count of processed stripes in the upper 16 bits
|
|
|
|
*/
|
|
|
|
static inline int raid5_bi_processed_stripes(struct bio *bio)
|
|
|
|
{
|
|
|
|
atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
|
|
|
|
|
|
|
|
return (atomic_read(segments) >> 16) & 0xffff;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int raid5_dec_bi_active_stripes(struct bio *bio)
|
|
|
|
{
|
|
|
|
atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
|
|
|
|
|
|
|
|
return atomic_sub_return(1, segments) & 0xffff;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void raid5_inc_bi_active_stripes(struct bio *bio)
|
|
|
|
{
|
|
|
|
atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
|
|
|
|
|
|
|
|
atomic_inc(segments);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void raid5_set_bi_processed_stripes(struct bio *bio,
|
|
|
|
unsigned int cnt)
|
|
|
|
{
|
|
|
|
atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
|
|
|
|
int old, new;
|
|
|
|
|
|
|
|
do {
|
|
|
|
old = atomic_read(segments);
|
|
|
|
new = (old & 0xffff) | (cnt << 16);
|
|
|
|
} while (atomic_cmpxchg(segments, old, new) != old);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
|
|
|
|
{
|
|
|
|
atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
|
|
|
|
|
|
|
|
atomic_set(segments, cnt);
|
|
|
|
}
|
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 04:16:17 +00:00
|
|
|
/* NOTE NR_STRIPE_HASH_LOCKS must remain below 64.
|
|
|
|
* This is because we sometimes take all the spinlocks
|
|
|
|
* and creating that much locking depth can cause
|
|
|
|
* problems.
|
|
|
|
*/
|
|
|
|
#define NR_STRIPE_HASH_LOCKS 8
|
|
|
|
#define STRIPE_HASH_LOCKS_MASK (NR_STRIPE_HASH_LOCKS - 1)
|
|
|
|
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 06:30:16 +00:00
|
|
|
struct r5worker {
|
|
|
|
struct work_struct work;
|
|
|
|
struct r5worker_group *group;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 04:16:17 +00:00
|
|
|
struct list_head temp_inactive_list[NR_STRIPE_HASH_LOCKS];
|
2013-08-29 07:40:32 +00:00
|
|
|
bool working;
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 06:30:16 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
struct r5worker_group {
|
|
|
|
struct list_head handle_list;
|
|
|
|
struct r5conf *conf;
|
|
|
|
struct r5worker *workers;
|
2013-08-29 07:40:32 +00:00
|
|
|
int stripes_cnt;
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 06:30:16 +00:00
|
|
|
};
|
|
|
|
|
md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 23:24:40 +00:00
|
|
|
enum r5_cache_state {
|
|
|
|
R5_INACTIVE_BLOCKED, /* release of inactive stripes blocked,
|
|
|
|
* waiting for 25% to be free
|
|
|
|
*/
|
|
|
|
R5_ALLOC_MORE, /* It might help to allocate another
|
|
|
|
* stripe.
|
|
|
|
*/
|
|
|
|
R5_DID_ALLOC, /* A stripe was allocated, don't allocate
|
|
|
|
* more until at least one has been
|
|
|
|
* released. This avoids flooding
|
|
|
|
* the cache.
|
|
|
|
*/
|
|
|
|
R5C_LOG_TIGHT, /* log device space tight, need to
|
|
|
|
* prioritize stripes at last_checkpoint
|
|
|
|
*/
|
|
|
|
R5C_LOG_CRITICAL, /* log device is running out of space,
|
|
|
|
* only process stripes that are already
|
|
|
|
* occupying the log
|
|
|
|
*/
|
2016-11-24 06:50:39 +00:00
|
|
|
R5C_EXTRA_PAGE_IN_USE, /* a stripe is using disk_info.extra_page
|
|
|
|
* for prexor
|
|
|
|
*/
|
md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 23:24:40 +00:00
|
|
|
};
|
|
|
|
|
2011-10-11 05:49:52 +00:00
|
|
|
struct r5conf {
|
2006-01-06 08:20:33 +00:00
|
|
|
struct hlist_head *stripe_hashtbl;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 04:16:17 +00:00
|
|
|
/* only protect corresponding hash list and inactive_list */
|
|
|
|
spinlock_t hash_locks[NR_STRIPE_HASH_LOCKS];
|
2011-10-11 05:47:53 +00:00
|
|
|
struct mddev *mddev;
|
2009-06-17 22:45:55 +00:00
|
|
|
int chunk_sectors;
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 01:57:05 +00:00
|
|
|
int level, algorithm, rmw_level;
|
2006-06-26 07:27:38 +00:00
|
|
|
int max_degraded;
|
2006-10-03 08:15:47 +00:00
|
|
|
int raid_disks;
|
2005-04-16 22:20:36 +00:00
|
|
|
int max_nr_stripes;
|
2015-02-26 01:47:56 +00:00
|
|
|
int min_nr_stripes;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-03-31 04:16:46 +00:00
|
|
|
/* reshape_progress is the leading edge of a 'reshape'
|
|
|
|
* It has value MaxSector when no reshape is happening
|
|
|
|
* If delta_disks < 0, it is the last sector we started work on,
|
|
|
|
* else is it the next sector to work on.
|
|
|
|
*/
|
|
|
|
sector_t reshape_progress;
|
|
|
|
/* reshape_safe is the trailing edge of a reshape. We know that
|
|
|
|
* before (or after) this address, all reshape has completed.
|
|
|
|
*/
|
|
|
|
sector_t reshape_safe;
|
2006-03-27 09:18:08 +00:00
|
|
|
int previous_raid_disks;
|
2009-06-17 22:45:55 +00:00
|
|
|
int prev_chunk_sectors;
|
|
|
|
int prev_algo;
|
2009-03-31 04:19:03 +00:00
|
|
|
short generation; /* increments with every reshape */
|
2013-08-27 05:52:13 +00:00
|
|
|
seqcount_t gen_lock; /* lock against generation changes */
|
2009-03-31 04:28:40 +00:00
|
|
|
unsigned long reshape_checkpoint; /* Time we last updated
|
|
|
|
* metadata */
|
2012-05-20 23:27:01 +00:00
|
|
|
long long min_offset_diff; /* minimum difference between
|
|
|
|
* data_offset and
|
|
|
|
* new_data_offset across all
|
|
|
|
* devices. May be negative,
|
|
|
|
* but is closest to zero.
|
|
|
|
*/
|
2006-03-27 09:18:08 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
struct list_head handle_list; /* stripes needing handling */
|
2008-04-28 09:15:53 +00:00
|
|
|
struct list_head hold_list; /* preread ready stripes */
|
2005-04-16 22:20:36 +00:00
|
|
|
struct list_head delayed_list; /* stripes that have plugged requests */
|
2005-09-09 23:23:54 +00:00
|
|
|
struct list_head bitmap_list; /* stripes delaying awaiting bitmap update */
|
2006-12-10 10:20:47 +00:00
|
|
|
struct bio *retry_read_aligned; /* currently retrying aligned bios */
|
|
|
|
struct bio *retry_read_aligned_list; /* aligned bios retry list */
|
2005-04-16 22:20:36 +00:00
|
|
|
atomic_t preread_active_stripes; /* stripes with scheduled io */
|
2006-12-10 10:20:47 +00:00
|
|
|
atomic_t active_aligned_reads;
|
2008-04-28 09:15:53 +00:00
|
|
|
atomic_t pending_full_writes; /* full write backlog */
|
|
|
|
int bypass_count; /* bypassed prereads */
|
|
|
|
int bypass_threshold; /* preread nice */
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 09:57:44 +00:00
|
|
|
int skip_copy; /* Don't copy data from bio to stripe cache */
|
2008-04-28 09:15:53 +00:00
|
|
|
struct list_head *last_hold; /* detect hold_list promotions */
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2015-08-14 02:47:33 +00:00
|
|
|
/* bios to have bi_end_io called after metadata is synced */
|
|
|
|
struct bio_list return_bi;
|
|
|
|
|
2006-03-27 09:18:11 +00:00
|
|
|
atomic_t reshape_stripes; /* stripes with pending writes for reshape */
|
2006-03-27 09:18:07 +00:00
|
|
|
/* unfortunately we need two cache names as we temporarily have
|
|
|
|
* two caches.
|
|
|
|
*/
|
|
|
|
int active_name;
|
2010-06-01 09:37:25 +00:00
|
|
|
char cache_name[2][32];
|
2015-07-06 02:49:23 +00:00
|
|
|
struct kmem_cache *slab_cache; /* for allocating stripes */
|
|
|
|
struct mutex cache_size_mutex; /* Protect changes to cache size */
|
2005-09-09 23:23:54 +00:00
|
|
|
|
|
|
|
int seq_flush, seq_write;
|
|
|
|
int quiesce;
|
|
|
|
|
|
|
|
int fullsync; /* set to 1 if a full sync is needed,
|
|
|
|
* (fresh device added).
|
|
|
|
* Cleared when a sync completes.
|
|
|
|
*/
|
2011-07-28 01:39:22 +00:00
|
|
|
int recovery_disabled;
|
2009-07-14 18:48:22 +00:00
|
|
|
/* per cpu variables */
|
|
|
|
struct raid5_percpu {
|
|
|
|
struct page *spare_page; /* Used when checking P/Q in raid6 */
|
2014-12-15 01:57:02 +00:00
|
|
|
struct flex_array *scribble; /* space for constructing buffer
|
2009-07-14 18:50:52 +00:00
|
|
|
* lists and performing address
|
|
|
|
* conversions
|
|
|
|
*/
|
2010-02-02 05:39:15 +00:00
|
|
|
} __percpu *percpu;
|
2016-02-25 01:38:28 +00:00
|
|
|
int scribble_disks;
|
|
|
|
int scribble_sectors;
|
2016-08-18 12:57:24 +00:00
|
|
|
struct hlist_node node;
|
2006-01-06 08:20:17 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Free stripes pool
|
|
|
|
*/
|
|
|
|
atomic_t active_stripes;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 04:16:17 +00:00
|
|
|
struct list_head inactive_list[NR_STRIPE_HASH_LOCKS];
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 23:24:39 +00:00
|
|
|
|
|
|
|
atomic_t r5c_cached_full_stripes;
|
|
|
|
struct list_head r5c_full_stripe_list;
|
|
|
|
atomic_t r5c_cached_partial_stripes;
|
|
|
|
struct list_head r5c_partial_stripe_list;
|
2017-02-11 00:18:09 +00:00
|
|
|
atomic_t r5c_flushing_full_stripes;
|
|
|
|
atomic_t r5c_flushing_partial_stripes;
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 23:24:39 +00:00
|
|
|
|
2013-11-14 04:16:17 +00:00
|
|
|
atomic_t empty_inactive_list_nr;
|
2013-08-27 09:50:39 +00:00
|
|
|
struct llist_head released_stripes;
|
md/raid5: split wait_for_stripe and introduce wait_for_quiescent
I noticed heavy spin lock contention at get_active_stripe(), introduced
at being wake up stage, where a bunch of processes try to re-hold the
spin lock again.
After giving some thoughts on this issue, I found the lock could be
relieved(and even avoided) if we turn the wait_for_stripe to per
waitqueue for each lock hash and make the wake up exclusive: wake up
one process each time, which avoids the lock contention naturally.
Before go hacking with wait_for_stripe, I found it actually has 2
usages: for the array to enter or leave the quiescent state, and also
to wait for an available stripe in each of the hash lists.
So this patch splits the first usage off into a separate wait_queue,
wait_for_quiescent, and the next patch will turn the second usage into
one waitqueue for each hash value, and make it exclusive, to relieve
the lock contention.
v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
Commit log refactor suggestion from Neil.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-08 08:19:06 +00:00
|
|
|
wait_queue_head_t wait_for_quiescent;
|
2016-02-26 00:24:42 +00:00
|
|
|
wait_queue_head_t wait_for_stripe;
|
2005-04-16 22:20:36 +00:00
|
|
|
wait_queue_head_t wait_for_overlap;
|
2015-02-26 01:21:04 +00:00
|
|
|
unsigned long cache_state;
|
2015-02-26 01:47:56 +00:00
|
|
|
struct shrinker shrinker;
|
2006-03-27 09:18:07 +00:00
|
|
|
int pool_size; /* number of disks in stripeheads in pool */
|
2005-04-16 22:20:36 +00:00
|
|
|
spinlock_t device_lock;
|
2006-03-27 09:18:06 +00:00
|
|
|
struct disk_info *disks;
|
2009-03-31 03:39:39 +00:00
|
|
|
|
|
|
|
/* When taking over an array from a different personality, we store
|
|
|
|
* the new thread here until we fully activate the array.
|
|
|
|
*/
|
2011-10-11 05:48:23 +00:00
|
|
|
struct md_thread *thread;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 04:16:17 +00:00
|
|
|
struct list_head temp_inactive_list[NR_STRIPE_HASH_LOCKS];
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 06:30:16 +00:00
|
|
|
struct r5worker_group *worker_groups;
|
|
|
|
int group_cnt;
|
|
|
|
int worker_cnt_per_group;
|
raid5: add basic stripe log
This introduces a simple log for raid5. Data/parity writing to raid
array first writes to the log, then write to raid array disks. If
crash happens, we can recovery data from the log. This can speed up
raid resync and fix write hole issue.
The log structure is pretty simple. Data/meta data is stored in block
unit, which is 4k generally. It has only one type of meta data block.
The meta data block can track 3 types of data, stripe data, stripe
parity and flush block. MD superblock will point to the last valid
meta data block. Each meta data block has checksum/seq number, so
recovery can scan the log correctly. We store a checksum of stripe
data/parity to the metadata block, so meta data and stripe data/parity
can be written to log disk together. otherwise, meta data write must
wait till stripe data/parity is finished.
For stripe data, meta data block will record stripe data sector and
size. Currently the size is always 4k. This meta data record can be made
simpler if we just fix write hole (eg, we can record data of a stripe's
different disks together), but this format can be extended to support
caching in the future, which must record data address/size.
For stripe parity, meta data block will record stripe sector. It's
size should be 4k (for raid5) or 8k (for raid6). We always store p
parity first. This format should work for caching too.
flush block indicates a stripe is in raid array disks. Fixing write
hole doesn't need this type of meta data, it's for caching extension.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-13 21:31:59 +00:00
|
|
|
struct r5l_log *log;
|
raid5: only dispatch IO from raid5d for harddisk raid
We made raid5 stripe handling multi-thread before. It works well for
SSD. But for harddisk, the multi-threading creates more disk seek, so
not always improve performance. For several hard disks based raid5,
multi-threading is required as raid5d becames a bottleneck especially
for sequential write.
To overcome the disk seek issue, we only dispatch IO from raid5d if the
array is harddisk based. Other threads can still handle stripes, but
can't dispatch IO.
Idealy, we should control IO dispatching order according to IO position
interrnally. Right now we still depend on block layer, which isn't very
efficient sometimes though.
My setup has 9 harddisks, each disk can do around 180M/s sequential
write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
write. The test machine uses an ATOM CPU. I measure sequential write
with large iodepth bandwidth to raid array:
without patch: ~600M/s
without patch and group_thread_cnt=4: 750M/s
with patch and group_thread_cnt=4: 950M/s
with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline.
Cc: NeilBrown <neilb@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-04 17:33:23 +00:00
|
|
|
|
|
|
|
struct bio_list pending_bios;
|
|
|
|
spinlock_t pending_bios_lock;
|
|
|
|
bool batch_bio_dispatch;
|
2005-04-16 22:20:36 +00:00
|
|
|
};
|
|
|
|
|
2015-02-26 01:21:04 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Our supported algorithms
|
|
|
|
*/
|
2009-03-31 03:39:38 +00:00
|
|
|
#define ALGORITHM_LEFT_ASYMMETRIC 0 /* Rotating Parity N with Data Restart */
|
|
|
|
#define ALGORITHM_RIGHT_ASYMMETRIC 1 /* Rotating Parity 0 with Data Restart */
|
|
|
|
#define ALGORITHM_LEFT_SYMMETRIC 2 /* Rotating Parity N with Data Continuation */
|
|
|
|
#define ALGORITHM_RIGHT_SYMMETRIC 3 /* Rotating Parity 0 with Data Continuation */
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-03-31 03:39:38 +00:00
|
|
|
/* Define non-rotating (raid4) algorithms. These allow
|
|
|
|
* conversion of raid4 to raid5.
|
|
|
|
*/
|
|
|
|
#define ALGORITHM_PARITY_0 4 /* P or P,Q are initial devices */
|
|
|
|
#define ALGORITHM_PARITY_N 5 /* P or P,Q are final devices. */
|
|
|
|
|
|
|
|
/* DDF RAID6 layouts differ from md/raid6 layouts in two ways.
|
|
|
|
* Firstly, the exact positioning of the parity block is slightly
|
|
|
|
* different between the 'LEFT_*' modes of md and the "_N_*" modes
|
|
|
|
* of DDF.
|
|
|
|
* Secondly, or order of datablocks over which the Q syndrome is computed
|
|
|
|
* is different.
|
|
|
|
* Consequently we have different layouts for DDF/raid6 than md/raid6.
|
|
|
|
* These layouts are from the DDFv1.2 spec.
|
|
|
|
* Interestingly DDFv1.2-Errata-A does not specify N_CONTINUE but
|
|
|
|
* leaves RLQ=3 as 'Vendor Specific'
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define ALGORITHM_ROTATING_ZERO_RESTART 8 /* DDF PRL=6 RLQ=1 */
|
|
|
|
#define ALGORITHM_ROTATING_N_RESTART 9 /* DDF PRL=6 RLQ=2 */
|
|
|
|
#define ALGORITHM_ROTATING_N_CONTINUE 10 /*DDF PRL=6 RLQ=3 */
|
|
|
|
|
|
|
|
/* For every RAID5 algorithm we define a RAID6 algorithm
|
|
|
|
* with exactly the same layout for data and parity, and
|
|
|
|
* with the Q block always on the last device (N-1).
|
|
|
|
* This allows trivial conversion from RAID5 to RAID6
|
|
|
|
*/
|
|
|
|
#define ALGORITHM_LEFT_ASYMMETRIC_6 16
|
|
|
|
#define ALGORITHM_RIGHT_ASYMMETRIC_6 17
|
|
|
|
#define ALGORITHM_LEFT_SYMMETRIC_6 18
|
|
|
|
#define ALGORITHM_RIGHT_SYMMETRIC_6 19
|
|
|
|
#define ALGORITHM_PARITY_0_6 20
|
|
|
|
#define ALGORITHM_PARITY_N_6 ALGORITHM_PARITY_N
|
|
|
|
|
|
|
|
static inline int algorithm_valid_raid5(int layout)
|
|
|
|
{
|
|
|
|
return (layout >= 0) &&
|
|
|
|
(layout <= 5);
|
|
|
|
}
|
|
|
|
static inline int algorithm_valid_raid6(int layout)
|
|
|
|
{
|
|
|
|
return (layout >= 0 && layout <= 5)
|
|
|
|
||
|
2009-10-16 05:27:34 +00:00
|
|
|
(layout >= 8 && layout <= 10)
|
2009-03-31 03:39:38 +00:00
|
|
|
||
|
|
|
|
(layout >= 16 && layout <= 20);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int algorithm_is_DDF(int layout)
|
|
|
|
{
|
|
|
|
return layout >= 8 && layout <= 10;
|
|
|
|
}
|
2010-07-26 01:57:07 +00:00
|
|
|
|
2011-10-11 05:49:52 +00:00
|
|
|
extern void md_raid5_kick_device(struct r5conf *conf);
|
2011-10-11 05:47:53 +00:00
|
|
|
extern int raid5_set_cache_size(struct mddev *mddev, int size);
|
2015-08-13 21:31:57 +00:00
|
|
|
extern sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous);
|
|
|
|
extern void raid5_release_stripe(struct stripe_head *sh);
|
|
|
|
extern sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector,
|
|
|
|
int previous, int *dd_idx,
|
|
|
|
struct stripe_head *sh);
|
|
|
|
extern struct stripe_head *
|
|
|
|
raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
|
|
|
|
int previous, int noblock, int noquiesce);
|
2017-01-24 18:45:30 +00:00
|
|
|
extern int raid5_calc_degraded(struct r5conf *conf);
|
raid5: add basic stripe log
This introduces a simple log for raid5. Data/parity writing to raid
array first writes to the log, then write to raid array disks. If
crash happens, we can recovery data from the log. This can speed up
raid resync and fix write hole issue.
The log structure is pretty simple. Data/meta data is stored in block
unit, which is 4k generally. It has only one type of meta data block.
The meta data block can track 3 types of data, stripe data, stripe
parity and flush block. MD superblock will point to the last valid
meta data block. Each meta data block has checksum/seq number, so
recovery can scan the log correctly. We store a checksum of stripe
data/parity to the metadata block, so meta data and stripe data/parity
can be written to log disk together. otherwise, meta data write must
wait till stripe data/parity is finished.
For stripe data, meta data block will record stripe data sector and
size. Currently the size is always 4k. This meta data record can be made
simpler if we just fix write hole (eg, we can record data of a stripe's
different disks together), but this format can be extended to support
caching in the future, which must record data address/size.
For stripe parity, meta data block will record stripe sector. It's
size should be 4k (for raid5) or 8k (for raid6). We always store p
parity first. This format should work for caching too.
flush block indicates a stripe is in raid array disks. Fixing write
hole doesn't need this type of meta data, it's for caching extension.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-13 21:31:59 +00:00
|
|
|
extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
|
|
|
|
extern void r5l_exit_log(struct r5l_log *log);
|
|
|
|
extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
|
|
|
|
extern void r5l_write_stripe_run(struct r5l_log *log);
|
raid5: log reclaim support
This is the reclaim support for raid5 log. A stripe write will have
following steps:
1. reconstruct the stripe, read data/calculate parity. ops_run_io
prepares to write data/parity to raid disks
2. hijack ops_run_io. stripe data/parity is appending to log disk
3. flush log disk cache
4. ops_run_io run again and do normal operation. stripe data/parity is
written in raid array disks. raid core can return io to upper layer.
5. flush cache of all raid array disks
6. update super block
7. log disk space used by the stripe can be reused
In practice, several stripes consist of an io_unit and we will batch
several io_unit in different steps, but the whole process doesn't
change.
It's possible io return just after data/parity hit log disk, but then
read IO will need read from log disk. For simplicity, IO return happens
at step 4, where read IO can directly read from raid disks.
Currently reclaim run if there is specific reclaimable space (1/4 disk
size or 10G) or we are out of space. Reclaim is just to free log disk
spaces, it doesn't impact data consistency. The size based force reclaim
is to make sure log isn't too big, so recovery doesn't scan log too
much.
Recovery make sure raid disks and log disk have the same data of a
stripe. If crash happens before 4, recovery might/might not recovery
stripe's data/parity depending on if data/parity and its checksum
matches. In either case, this doesn't change the syntax of an IO write.
After step 3, stripe is guaranteed recoverable, because stripe's
data/parity is persistent in log disk. In some cases, log disk content
and raid disks content of a stripe are the same, but recovery will still
copy log disk content to raid disks, this doesn't impact data
consistency. space reuse happens after superblock update and cache
flush.
There is one situation we want to avoid. A broken meta in the middle of
a log causes recovery can't find meta at the head of log. If operations
require meta at the head persistent in log, we must make sure meta
before it persistent in log too. The case is stripe data/parity is in
log and we start write stripe to raid disks (before step 4). stripe
data/parity must be persistent in log before we do the write to raid
disks. The solution is we restrictly maintain io_unit list order. In
this case, we only write stripes of an io_unit to raid disks till the
io_unit is the first one whose data/parity is in log.
The io_unit list order is important for other cases too. For example,
some io_unit are reclaimable and others not. They can be mixed in the
list, we shouldn't reuse space of an unreclaimable io_unit.
Includes fixes to problems which were...
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-13 21:32:00 +00:00
|
|
|
extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
|
|
|
|
extern void r5l_stripe_write_finished(struct stripe_head *sh);
|
2015-09-02 20:49:49 +00:00
|
|
|
extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
|
2015-10-04 16:20:12 +00:00
|
|
|
extern void r5l_quiesce(struct r5l_log *log, int state);
|
2015-10-09 04:54:08 +00:00
|
|
|
extern bool r5l_log_disk_error(struct r5conf *conf);
|
md/r5cache: State machine for raid5-cache write back mode
This patch adds state machine for raid5-cache. With log device, the
raid456 array could operate in two different modes (r5c_journal_mode):
- write-back (R5C_MODE_WRITE_BACK)
- write-through (R5C_MODE_WRITE_THROUGH)
Existing code of raid5-cache only has write-through mode. For write-back
cache, it is necessary to extend the state machine.
With write-back cache, every stripe could operate in two different
phases:
- caching
- writing-out
In caching phase, the stripe handles writes as:
- write to journal
- return IO
In writing-out phase, the stripe behaviors as a stripe in write through
mode R5C_MODE_WRITE_THROUGH.
STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
writing-out phase.
Please note: this is a "no-op" patch for raid5-cache write-through
mode.
The following detailed explanation is copied from the raid5-cache.c:
/*
* raid5 cache state machine
*
* With rhe RAID cache, each stripe works in two phases:
* - caching phase
* - writing-out phase
*
* These two phases are controlled by bit STRIPE_R5C_CACHING:
* if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
* if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
*
* When there is no journal, or the journal is in write-through mode,
* the stripe is always in writing-out phase.
*
* For write-back journal, the stripe is sent to caching phase on write
* (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
* the write-out phase by clearing STRIPE_R5C_CACHING.
*
* Stripes in caching phase do not write the raid disks. Instead, all
* writes are committed from the log device. Therefore, a stripe in
* caching phase handles writes as:
* - write to log device
* - return IO
*
* Stripes in writing-out phase handle writes as:
* - calculate parity
* - write pending data and parity to journal
* - write data and parity to raid disks
* - return IO for pending writes
*/
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 23:24:38 +00:00
|
|
|
extern bool r5c_is_writeback(struct r5l_log *log);
|
|
|
|
extern int
|
|
|
|
r5c_try_caching_write(struct r5conf *conf, struct stripe_head *sh,
|
|
|
|
struct stripe_head_state *s, int disks);
|
|
|
|
extern void
|
|
|
|
r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
|
|
|
|
struct stripe_head_state *s);
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 23:24:39 +00:00
|
|
|
extern void r5c_release_extra_page(struct stripe_head *sh);
|
2016-11-24 06:50:39 +00:00
|
|
|
extern void r5c_use_extra_page(struct stripe_head *sh);
|
md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 23:24:40 +00:00
|
|
|
extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 23:24:39 +00:00
|
|
|
extern void r5c_handle_cached_data_endio(struct r5conf *conf,
|
|
|
|
struct stripe_head *sh, int disks, struct bio_list *return_bi);
|
|
|
|
extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
|
|
|
|
struct stripe_head_state *s);
|
md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 23:24:40 +00:00
|
|
|
extern void r5c_make_stripe_write_out(struct stripe_head *sh);
|
|
|
|
extern void r5c_flush_cache(struct r5conf *conf, int num);
|
|
|
|
extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
|
|
|
|
extern void r5c_check_cached_full_stripe(struct r5conf *conf);
|
2016-11-17 23:24:41 +00:00
|
|
|
extern struct md_sysfs_entry r5c_journal_mode;
|
2017-01-24 18:45:30 +00:00
|
|
|
extern void r5c_update_on_rdev_error(struct mddev *mddev);
|
md/r5cache: enable chunk_aligned_read with write back cache
Chunk aligned read significantly reduces CPU usage of raid456.
However, it is not safe to fully bypass the write back cache.
This patch enables chunk aligned read with write back cache.
For chunk aligned read, we track stripes in write back cache at
a bigger granularity, "big_stripe". Each chunk may contain more
than one stripe (for example, a 256kB chunk contains 64 4kB-page,
so this chunk contain 64 stripes). For chunk_aligned_read, these
stripes are grouped into one big_stripe, so we only need one lookup
for the whole chunk.
For each big_stripe, struct big_stripe_info tracks how many stripes
of this big_stripe are in the write back cache. We count how many
stripes of this big_stripe are in the write back cache. These
counters are tracked in a radix tree (big_stripe_tree).
r5c_tree_index() is used to calculate keys for the radix tree.
chunk_aligned_read() calls r5c_big_stripe_cached() to look up
big_stripe of each chunk in the tree. If this big_stripe is in the
tree, chunk_aligned_read() aborts. This look up is protected by
rcu_read_lock().
It is necessary to remember whether a stripe is counted in
big_stripe_tree. Instead of adding new flag, we reuses existing flags:
STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
two flags are set, the stripe is counted in big_stripe_tree. This
requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
r5c_try_caching_write(); and moving clear_bit of
STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
r5c_finish_stripe_write_out().
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-11 21:39:14 +00:00
|
|
|
extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
|
2005-04-16 22:20:36 +00:00
|
|
|
#endif
|