Commit Graph

92 Commits

Author SHA1 Message Date
Song Liu
2d4f468753 md/r5cache: handle R5LOG_PAYLOAD_FLUSH in recovery
This patch adds handling of R5LOG_PAYLOAD_FLUSH in journal recovery.
Next patch will add logic that generate R5LOG_PAYLOAD_FLUSH on flush
finish.

When R5LOG_PAYLOAD_FLUSH is seen in recovery, pending data and parity
will be dropped from recovery. This will reduce the number of stripes
to replay, and thus accelerate the recovery process.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-16 16:55:57 -07:00
Artur Paszkiewicz
ff875738ed raid5: separate header for log functions
Move raid5-cache declarations from raid5.h to raid5-log.h, add inline
wrappers for functions which will be shared with ppl and use them in
raid5 core instead of direct calls to raid5-cache.

Remove unused parameter from r5c_cache_data(), move two duplicated
pr_debug() calls to r5l_init_log().

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-16 16:55:54 -07:00
Song Liu
effe6ee752 md/r5cache: improve recovery with read ahead page pool
In r5cache recovery, the journal device is scanned page by page.
Currently, we use sync_page_io() to read journal device. This is
not efficient when we have to recovery many stripes from the journal.

To improve the speed of recovery, this patch introduces a read ahead
page pool (ra_pool) to recovery_ctx. With ra_pool, multiple consecutive
pages are read in one IO. Then the recovery code read the journal from
ra_pool.

With ra_pool, r5l_recovery_ctx has become much bigger. Therefore,
r5l_recovery_log() is refactored so r5l_recovery_ctx is not using
stack space.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-16 16:55:53 -07:00
Shaohua Li
84890c03b6 md/raid5-cache: bump flush stripe batch size
Bump the flush stripe batch size to 2048. For my 12 disks raid
array, the stripes takes:
12 * 4k * 2048 = 96MB

This is still quite small. A hardware raid card generally has 1GB size,
which we suggest the raid5-cache has similar cache size.

The advantage of a big batch size is we can dispatch a lot of IO in the
same time, then we can do some scheduling to make better IO pattern.

Last patch prioritizes stripes, so we don't worry about a big flush
stripe batch will starve normal stripes.

Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-16 16:55:51 -07:00
Shaohua Li
e33fbb9cc7 md/raid5-cache: exclude reclaiming stripes in reclaim check
stripes which are being reclaimed are still accounted into cached
stripes. The reclaim takes time. r5c_do_reclaim isn't aware of the
stripes and does unnecessary stripe reclaim. In practice, I saw one
stripe is reclaimed one time. This will cause bad IO pattern. Fixing
this by excluding the reclaing stripes in the check.

Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-13 09:20:05 -08:00
Shaohua Li
e8fd52eec2 md/raid5-cache: stripe reclaim only counts valid stripes
When log space is tight, we try to reclaim stripes from log head. There
are stripes which can't be reclaimed right now if some conditions are
met. We skip such stripes but accidentally count them, which might cause
no stripes are claimed. Fixing this by only counting valid stripes.

Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-13 09:20:02 -08:00
Song Liu
39b99586b3 md/r5cache: improve journal device efficiency
It is important to be able to flush all stripes in raid5-cache.
Therefore, we need reserve some space on the journal device for
these flushes. If flush operation includes pending writes to the
stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
for the flush out. This reduces the efficiency of journal space.
If we exclude these pending writes from flush operation, we only
need (conf->max_degraded + 1) pages per stripe.

With this patch, when log space is critical (R5C_LOG_CRITICAL=1),
pending writes will be excluded from stripe flush out. Therefore,
we can reduce reserved space for flush out and thus improve journal
device efficiency.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-13 09:17:52 -08:00
Song Liu
03b047f45c md/r5cache: enable chunk_aligned_read with write back cache
Chunk aligned read significantly reduces CPU usage of raid456.
However, it is not safe to fully bypass the write back cache.
This patch enables chunk aligned read with write back cache.

For chunk aligned read, we track stripes in write back cache at
a bigger granularity, "big_stripe". Each chunk may contain more
than one stripe (for example, a 256kB chunk contains 64 4kB-page,
so this chunk contain 64 stripes). For chunk_aligned_read, these
stripes are grouped into one big_stripe, so we only need one lookup
for the whole chunk.

For each big_stripe, struct big_stripe_info tracks how many stripes
of this big_stripe are in the write back cache. We count how many
stripes of this big_stripe are in the write back cache. These
counters are tracked in a radix tree (big_stripe_tree).
r5c_tree_index() is used to calculate keys for the radix tree.

chunk_aligned_read() calls r5c_big_stripe_cached() to look up
big_stripe of each chunk in the tree. If this big_stripe is in the
tree, chunk_aligned_read() aborts. This look up is protected by
rcu_read_lock().

It is necessary to remember whether a stripe is counted in
big_stripe_tree. Instead of adding new flag, we reuses existing flags:
STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
two flags are set, the stripe is counted in big_stripe_tree. This
requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
r5c_try_caching_write(); and moving clear_bit of
STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
r5c_finish_stripe_write_out().

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-13 09:17:51 -08:00
Song Liu
2e38a37f23 md/r5cache: disable write back for degraded array
write-back cache in degraded mode introduces corner cases to the array.
Although we try to cover all these corner cases, it is safer to just
disable write-back cache when the array is in degraded mode.

In this patch, we disable writeback cache for degraded mode:
1. On device failure, if the array enters degraded mode, raid5_error()
   will submit async job r5c_disable_writeback_async to disable
   writeback;
2. In r5c_journal_mode_store(), it is invalid to enable writeback in
   degraded mode;
3. In r5c_try_caching_write(), stripes with s->failed>0 will be handled
   in write-through mode.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 11:26:06 -08:00
Song Liu
a85dd7b8df md/r5cache: flush data only stripes in r5l_recovery_log()
For safer operation, all arrays start in write-through mode, which has been
better tested and is more mature. And actually the write-through/write-mode
isn't persistent after array restarted, so we always start array in
write-through mode. However, if recovery found data-only stripes before the
shutdown (from previous write-back mode), it is not safe to start the array in
write-through mode, as write-through mode can not handle stripes with data in
write-back cache. To solve this problem, we flush all data-only stripes in
r5l_recovery_log(). When r5l_recovery_log() returns, the array starts with
empty cache in write-through mode.

This logic is implemented in r5c_recovery_flush_data_only_stripes():

1. enable write back cache
2. flush all stripes
3. wake up conf->mddev->thread
4. wait for all stripes get flushed (reuse wait_for_quiescent)
5. disable write back cache

The wait in 4 will be waked up in release_inactive_stripe_list()
when conf->active_stripes reaches 0.

It is safe to wake up mddev->thread here because all the resource
required for the thread has been initialized.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 11:20:15 -08:00
Song Liu
86aa1397dd md/r5cache: read data into orig_page for prexor of cached data
With write back cache, we use orig_page to do prexor. This patch
makes sure we read data into orig_page for it.

Flag R5_OrigPageUPTDODATE is added to show whether orig_page
has the latest data from raid disk.

We introduce a helper function uptodate_for_rmw() to simplify
the a couple conditions in handle_stripe_dirtying().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 11:20:14 -08:00
Shaohua Li
d46d29f072 md/raid5-cache: delete meaningless code
sector_t is unsigned long, it's never < 0

Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 11:20:13 -08:00
Colin Ian King
99f17890f0 md/r5cache: fix spelling mistake on "recoverying"
Trivial fix to spelling mistake "recoverying" to "recovering" in
pr_dbg message.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 11:44:39 -08:00
Song Liu
d2250f105f md/r5cache: assign conf->log before r5l_load_log()
r5l_load_log() calls functions that requires a proper conf->log,
for example, r5c_is_writeback(). Therefore, we should set
conf->log before calling r5l_load_log(). If r5l_load_log() fails,
conf->log is set back to NULL.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 11:44:38 -08:00
Song Liu
3c66abbaaf md/r5cache: simplify handling of sh->log_start in recovery
We only need to update sh->log_start at the end of recovery,
which is r5c_recovery_rewrite_data_only_stripes(), so it is not
necessary to set it before that. In this patch, log_start is
removed from r5c_recovery_alloc_stripe().

After updating all sh->log_start, rewrite_data_only_stripes()
also updates log->next_checkpoints to the last sh->log_start.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 11:44:38 -08:00
JackieLiu
28ca833ecf md/raid5-cache: removes unnecessary write-through mode judgments
The write-through mode has been returned in front of the function,
do not need to do it again.

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 11:44:37 -08:00
Shaohua Li
20737738d3 Merge branch 'md-next' into md-linus 2016-12-13 12:40:15 -08:00
Linus Torvalds
36869cb93d Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main block pull request this series. Contrary to previous
  release, I've kept the core and driver changes in the same branch. We
  always ended up having dependencies between the two for obvious
  reasons, so makes more sense to keep them together. That said, I'll
  probably try and keep more topical branches going forward, especially
  for cycles that end up being as busy as this one.

  The major parts of this pull request is:

   - Improved support for O_DIRECT on block devices, with a small
     private implementation instead of using the pig that is
     fs/direct-io.c. From Christoph.

   - Request completion tracking in a scalable fashion. This is utilized
     by two components in this pull, the new hybrid polling and the
     writeback queue throttling code.

   - Improved support for polling with O_DIRECT, adding a hybrid mode
     that combines pure polling with an initial sleep. From me.

   - Support for automatic throttling of writeback queues on the block
     side. This uses feedback from the device completion latencies to
     scale the queue on the block side up or down. From me.

   - Support from SMR drives in the block layer and for SD. From Hannes
     and Shaun.

   - Multi-connection support for nbd. From Josef.

   - Cleanup of request and bio flags, so we have a clear split between
     which are bio (or rq) private, and which ones are shared. From
     Christoph.

   - A set of patches from Bart, that improve how we handle queue
     stopping and starting in blk-mq.

   - Support for WRITE_ZEROES from Chaitanya.

   - Lightnvm updates from Javier/Matias.

   - Supoort for FC for the nvme-over-fabrics code. From James Smart.

   - A bunch of fixes from a whole slew of people, too many to name
     here"

* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
  blk-stat: fix a few cases of missing batch flushing
  blk-flush: run the queue when inserting blk-mq flush
  elevator: make the rqhash helpers exported
  blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  blk-mq: add blk_mq_start_stopped_hw_queue()
  block: improve handling of the magic discard payload
  blk-wbt: don't throttle discard or write zeroes
  nbd: use dev_err_ratelimited in io path
  nbd: reset the setup task for NBD_CLEAR_SOCK
  nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
  nvme-fabrics: Add target support for FC transport
  nvme-fabrics: Add host support for FC transport
  nvme-fabrics: Add FC transport LLDD api definitions
  nvme-fabrics: Add FC transport FC-NVME definitions
  nvme-fabrics: Add FC transport error codes to nvme.h
  Add type 0x28 NVME type code to scsi fc headers
  nvme-fabrics: patch target code in prep for FC transport support
  nvme-fabrics: set sqe.command_id in core not transports
  parser: add u64 number parser
  nvme-rdma: align to generic ib_event logging helper
  ...
2016-12-13 10:19:16 -08:00
Shaohua Li
2953079c69 md: separate flags for superblock changes
The mddev->flags are used for different purposes. There are a lot of
places we check/change the flags without masking unrelated flags, we
could check/change unrelated flags. These usage are most for superblock
write, so spearate superblock related flags. This should make the code
clearer and also fix real bugs.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-08 22:01:47 -08:00
Song Liu
3c6edc6608 md/r5cache: after recovery, increase journal seq by 10000
Currently, we increase journal entry seq by 10 after recovery.
However, this is not sufficient in the following case.

After crash the journal looks like

| seq+0 | +1 | +2 | +3 | +4 | +5 | +6 | +7 | ... | +11 | +12 |

If +1 is not valid, we dropped all entries from +1 to +12; and
write seq+10:

| seq+0 | +10 | +2 | +3 | +4 | +5 | +6 | +7 | ... | +11 | +12 |

However, if we write a big journal entry with seq+11, it will
connect with some stale journal entry:

| seq+0 | +10 |                     +11                 | +12 |

To reduce the risk of this issue, we increase seq by 10000 instead.

Shaohua: use 10000 instead of 1000. The risk should be very unlikely. The total
stripe cache size is less than 2k typically, and several stripes can fit into
one meta data block. So the total inflight meta data blocks would be quite
small, which means the the total sequence number used should be quite small.
The 10000 sequence number increase should be far more than safe.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-08 10:54:47 -08:00
Song Liu
5c88f403a5 md/raid5-cache: fix crc in rewrite_data_only_stripes()
r5l_recovery_create_empty_meta_block() creates crc for the empty
metablock. After the metablock is updated, we need clear the
checksum before recalculate it.

Shaohua: moved checksum calculation out of
r5l_recovery_create_empty_meta_block. We should calculate it after all fields
are updated.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-08 10:34:03 -08:00
JackieLiu
d30dfeb9be md/raid5-cache: no recovery is required when create super-block
When create the super-block information, We do not need to do this
recovery stage, only need to initialize some variables.

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-08 10:01:17 -08:00
Zhengyuan Liu
3d7e7e1d9d md/r5cache: do r5c_update_log_state after log recovery
We should update log state after we did a log recovery, current completion
may get wrong log state since log->log_start wasn't initalized until we
called r5l_recovery_log.

At log recovery stage, no lock needed as there is no race conditon.
next_checkpoint field will be initialized in r5l_recovery_log too.

Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-05 17:11:02 -08:00
JackieLiu
43b9674832 md/raid5-cache: adjust the write position of the empty block if no data blocks
When recovery is complete, we write an empty block and record his
position first, then make the data-only stripes rewritten done,
the location of the empty block as the last checkpoint position
to write into the super block. And we should update last_checkpoint
to this empty block position.

------------------------------------------------------------------
|  old log       | empty block | data only stripes | invalid log |
------------------------------------------------------------------
^                ^                                 ^
|                |- log->last_checkpoint           |- log->log_start
|                |- log->last_cp_seq               |- log->next_checkpoint
|- log->seq=n    |- log->seq=10+n

At the same time, if there is no data-only stripes, this scene may appear,
| meta1 | meta2 | meta3 |
meta 1 is valid, meta 2 is invalid. meta 3 could be valid. so we should
The solution is we create a new meta in meta2 with its seq == meta1's
seq + 10 and let superblock points to meta2.

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Reviewed-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-05 17:11:02 -08:00
Song Liu
f687a33ef0 md/r5cache: run_no_space_stripes() when R5C_LOG_CRITICAL == 0
With writeback cache, we define log space critical as

   free_space < 2 * reclaim_required_space

So the deassert of R5C_LOG_CRITICAL could happen when
  1. free_space increases
  2. reclaim_required_space decreases

Currently, run_no_space_stripes() is called when 1 happens, but
not (always) when 2 happens.

With this patch, run_no_space_stripes() is call when
R5C_LOG_CRITICAL is cleared.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-02 12:03:52 -08:00
JackieLiu
1a0ec5c30c md/raid5-cache: do not need to set STRIPE_PREREAD_ACTIVE repeatedly
R5c_make_stripe_write_out has set this flag, do not need to set again.

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 14:46:23 -08:00
JackieLiu
dbd22c8d7f md/raid5-cache: remove the unnecessary next_cp_seq field from the r5l_log
The next_cp_seq field is useless, remove it.

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 14:46:22 -08:00
JackieLiu
bc8f167f9c md/raid5-cache: release the stripe_head at the appropriate location
If we released the 'stripe_head' in r5c_recovery_flush_log,
ctx->cached_list will both release the data-parity stripes and
data-only stripes, which will become empty.
And we also need to use the data-only stripes in
r5c_recovery_rewrite_data_only_stripes, so we should wait util rewrite
data-only stripes is done before releasing them.

Reviewed-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 14:46:22 -08:00
JackieLiu
fc833c2a2f md/raid5-cache: use ring add to prevent overflow
'write_pos' must be protected with 'r5l_ring_add', or it may overflow

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 14:46:21 -08:00
JackieLiu
9b69173e5c md/raid5-cache: remove unnecessary function parameters
The function parameter 'recovery_list' is not used in
body, we can delete it

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 14:46:21 -08:00
Zhengyuan Liu
462eb7d872 raid5-cache: don't set STRIPE_R5C_PARTIAL_STRIPE flag while load stripe into cache
r5c_recovery_load_one_stripe should not set STRIPE_R5C_PARTIAL_STRIPE flag,as
the data-only stripe may be STRIPE_R5C_FULL_STRIPE stripe. The state machine
would release the stripe later and add it into neither r5c_cached_full_stripes
list or r5c_cached_partial_stripes list and set correct flag.

Reviewed-by: JackieLiu <liuyun01@kylinos.cn>
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 14:45:14 -08:00
Zhengyuan Liu
f7b7bee75e raid5-cache: add another check conditon before replaying one stripe
New stripe that was just allocated has no STRIPE_R5C_CACHING state too,
add this check condition could avoid unnecessary replaying for empty stripe.

r5l_recovery_replay_one_stripe would reset stripe for any case, delete it
to make code more clean.

Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 11:56:20 -08:00
Dan Carpenter
d3014e21e1 md/r5cache: enable IRQs on error path
We need to re-enable the IRQs here before returning.

Fixes: a39f7afde3 ("md/r5cache: write-out phase and reclaim support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-27 21:38:08 -08:00
Song Liu
d7bd398e97 md/r5cache: handle alloc_page failure
RMW of r5c write back cache uses an extra page to store old data for
prexor. handle_stripe_dirtying() allocates this page by calling
alloc_page(). However, alloc_page() may fail.

To handle alloc_page() failures, this patch adds an extra page to
disk_info. When alloc_page fails, handle_stripe() trys to use these
pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE),
the stripe is added to delayed_list.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-27 21:35:38 -08:00
Shaohua Li
ce1ccd079f raid5-cache: suspend reclaim thread instead of shutdown
There is mechanism to suspend a kernel thread. Use it instead of playing
create/destroy game.

Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Cc: Song Liu <songliubraving@fb.com>
2016-11-23 19:30:06 -08:00
Ming Lei
3a83f46775 block: bio: pass bvec table to bio_init()
Some drivers often use external bvec table, so introduce
this helper for this case. It is always safe to access the
bio->bi_io_vec in this way for this case.

After converting to this usage, it will becomes a bit easier
to evaluate the remaining direct access to bio->bi_io_vec,
so it can help to prepare for the following multipage bvec
support.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

Fixed up the new O_DIRECT cases.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:21 -07:00
Song Liu
3bddb7f8f2 md/r5cache: handle FLUSH and FUA
With raid5 cache, we committing data from journal device. When
there is flush request, we need to flush journal device's cache.
This was not needed in raid5 journal, because we will flush the
journal before committing data to raid disks.

This is similar to FUA, except that we also need flush journal for
FUA. Otherwise, corruptions in earlier meta data will stop recovery
from reaching FUA data.

slightly changed the code by Shaohua

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 17:13:49 -08:00
Song Liu
5aabf7c49d md/r5cache: r5cache recovery: part 2
1. In previous patch, we:
      - add new data to r5l_recovery_ctx
      - add new functions to recovery write-back cache
   The new functions are not used in this patch, so this patch does not
   change the behavior of recovery.

2. In this patchpatch, we:
      - modify main recovery procedure r5l_recovery_log() to call new
        functions
      - remove old functions

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:28:28 -08:00
Song Liu
b4c625c673 md/r5cache: r5cache recovery: part 1
Recovery of write-back cache has different logic to write-through only
cache. Specifically, for write-back cache, the recovery need to scan
through all active journal entries before flushing data out. Therefore,
large portion of the recovery logic is rewritten here.

To make the diffs cleaner, we split the rewrite as follows:

1. In this patch, we:
      - add new data to r5l_recovery_ctx
      - add new functions to recovery write-back cache
   The new functions are not used in this patch, so this patch does not
   change the behavior of recovery.

2. In next patch, we:
      - modify main recovery procedure r5l_recovery_log() to call new
        functions
      - remove old functions

With cache feature, there are 2 different scenarios of recovery:
1. Data-Parity stripe: a stripe with complete parity in journal.
2. Data-Only stripe: a stripe with only data in journal (or partial
   parity).

The code differentiate Data-Parity stripe from Data-Only stripe with
flag STRIPE_R5C_CACHING.

For Data-Parity stripes, we use the same procedure as raid5 journal,
where all the data and parity are replayed to the RAID devices.

For Data-Only strips, we need to finish complete calculate parity and
finish the full reconstruct write or RMW write. For simplicity, in
the recovery, we load the stripe to stripe cache. Once the array is
started, the stripe cache state machine will handle these stripes
through normal write path.

r5c_recovery_flush_log contains the main procedure of recovery. The
recovery code first scans through the journal and loads data to
stripe cache. The code keeps tracks of all these stripes in a list
(use sh->lru and ctx->cached_list), stripes in the list are
organized in the order of its first appearance on the journal.
During the scan, the recovery code assesses each stripe as
Data-Parity or Data-Only.

During scan, the array may run out of stripe cache. In these cases,
the recovery code will also call raid5_set_cache_size to increase
stripe cache size. If the array still runs out of stripe cache
because there isn't enough memory, the array will not assemble.

At the end of scan, the recovery code replays all Data-Parity
stripes, and sets proper states for Data-Only stripes. The recovery
code also increases seq number by 10 and rewrites all Data-Only
stripes to journal. This is to avoid confusion after repeated
crashes. More details is explained in raid5-cache.c before
r5c_recovery_rewrite_data_only_stripes().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:28:14 -08:00
Song Liu
9ed988f5dc md/r5cache: refactoring journal recovery code
1. rename r5l_read_meta_block() as r5l_recovery_read_meta_block();
2. pull the code that initialize r5l_meta_block from
   r5l_log_write_empty_meta_block() to a separate function
   r5l_recovery_create_empty_meta_block(), so that we can reuse this
   piece of code.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:27:45 -08:00
Song Liu
2c7da14b90 md/r5cache: sysfs entry journal_mode
With write cache, journal_mode is the knob to switch between
write-back and write-through.

Below is an example:

root@virt-test:~/# cat /sys/block/md0/md/journal_mode
[write-through] write-back
root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode
root@virt-test:~/# cat /sys/block/md0/md/journal_mode
write-through [write-back]

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:27:24 -08:00
Song Liu
a39f7afde3 md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.

In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
   if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
   or cached stripes is enough for a full stripe (chunk size / 4k)
   (r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)

r5c_do_reclaim() contains new logic of reclaim.

For stripe cache:

When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.

For log space:

To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.

Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.

r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.

When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).

This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:26:48 -08:00
Song Liu
1e6d690b93 md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
   (r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
   (r5c_handle_data_cached, r5c_return_dev_pending_writes).

Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks

This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.

Writing-out phase of the cache is implemented in the next patch.

With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.

This patch adds 2 flags to stripe_head.state:
 - STRIPE_R5C_PARTIAL_STRIPE,
 - STRIPE_R5C_FULL_STRIPE,

Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".

For RMW, the code allocates an extra page for each data block
being updated.  This is stored in r5dev->orig_page and the old data
is read into it.  Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.

r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.

There are some known limitations of the cache implementation:

1. Write cache only covers full page writes (R5_OVERWRITE). Writes
   of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
   writes for the same stripe have to wait. This can be improved by
   moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
   is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
   the log, so recovery code has to replay more than necessary data
   (sometimes all the log from last_checkpoint). This reduces
   availability of the array.

This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:26:30 -08:00
Song Liu
2ded370373 md/r5cache: State machine for raid5-cache write back mode
This patch adds state machine for raid5-cache. With log device, the
raid456 array could operate in two different modes (r5c_journal_mode):
  - write-back (R5C_MODE_WRITE_BACK)
  - write-through (R5C_MODE_WRITE_THROUGH)

Existing code of raid5-cache only has write-through mode. For write-back
cache, it is necessary to extend the state machine.

With write-back cache, every stripe could operate in two different
phases:
  - caching
  - writing-out

In caching phase, the stripe handles writes as:
  - write to journal
  - return IO

In writing-out phase, the stripe behaviors as a stripe in write through
mode R5C_MODE_WRITE_THROUGH.

STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
writing-out phase.

Please note: this is a "no-op" patch for raid5-cache write-through
mode.

The following detailed explanation is copied from the raid5-cache.c:

/*
 * raid5 cache state machine
 *
 * With rhe RAID cache, each stripe works in two phases:
 *      - caching phase
 *      - writing-out phase
 *
 * These two phases are controlled by bit STRIPE_R5C_CACHING:
 *   if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
 *   if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
 *
 * When there is no journal, or the journal is in write-through mode,
 * the stripe is always in writing-out phase.
 *
 * For write-back journal, the stripe is sent to caching phase on write
 * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
 * the write-out phase by clearing STRIPE_R5C_CACHING.
 *
 * Stripes in caching phase do not write the raid disks. Instead, all
 * writes are committed from the log device. Therefore, a stripe in
 * caching phase handles writes as:
 *      - write to log device
 *      - return IO
 *
 * Stripes in writing-out phase handle writes as:
 *      - calculate parity
 *      - write pending data and parity to journal
 *      - write data and parity to raid disks
 *      - return IO for pending writes
 */

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:26:07 -08:00
Song Liu
c757ec95c2 md/r5cache: Check array size in r5l_init_log
Currently, r5l_write_stripe checks meta size for each stripe write,
which is not necessary.

With this patch, r5l_init_log checks maximal meta size of the array,
which is (r5l_meta_block + raid_disks x r5l_payload_data_parity).
If this is too big to fit in one page, r5l_init_log aborts.

With current meta data, r5l_log support raid_disks up to 203.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:24:46 -08:00
Shaohua Li
354b445b5f raid5-cache: fix lockdep warning
lockdep reports warning of the rcu_dereference usage. Using normal rdev
access pattern to avoid the warning.

Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 11:30:27 -08:00
JackieLiu
3fd880af41 raid5-cache: restrict the use area of the log_offset variable
We can calculate this offset by using ctx->meta_total_blocks,
without passing in from the function

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:22 -08:00
Christoph Hellwig
70fd76140a block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Shaohua Li
9a8b27fac5 raid5-cache: correct condition for empty metadata write
As long as we recover one metadata block, we should write the empty metadata
write. The original code could make recovery corrupted if only one meta is
valid.

Reported-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-28 22:04:03 -07:00
Zhengyuan Liu
56056c2e7d md/raid5: write an empty meta-block when creating log super-block
If superblock points to an invalid meta block, r5l_load_log will set
create_super with true and create an new superblock, this runtime path
would always happen if we do no writing I/O to this array since it was
created. Writing an empty meta block could avoid this unnecessary
action at the first time we created log superblock.

Another reason is for the corretness of log recovery. Currently we have
bellow code to guarantee log revocery to be correct.

        if (ctx.seq > log->last_cp_seq + 1) {
                int ret;

                ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10);
                if (ret)
                        return ret;
                log->seq = ctx.seq + 11;
                log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
                r5l_write_super(log, ctx.pos);
        } else {
                log->log_start = ctx.pos;
                log->seq = ctx.seq;
        }

If we just created a array with a journal device, log->log_start and
log->last_checkpoint should all be 0, then we write three meta block
which are valid except mid one and supposed crash happened. The ctx.seq
would equal to log->last_cp_seq + 1 and log->log_start would be set to
position of mid invalid meta block after we did a recovery, this will
lead to problems which could be avoided with this patch.

Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24 15:28:18 -07:00