We need to return -EINTR after a split because we invalidated iterators
(and freed the btree node) - but if we were finished inserting, we don't
want to redo the traversal.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
When deciding what order to reuse buckets we take into account both the bucket's
priority (which indicates lru order) and also the amount of live data in that
bucket. The way they were scaled together wasn't as correct as it could be...
this patch improves and documents it.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Checks if two keys have equivalent header fields.
(good enough for replacement or merging)
Used in bch_bkey_try_merge, and replacing a key
in the btree.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Added generic header checks to bch_bkey_try_merge,
which then calls the bkey specific function
Removed extraneous checks from bch_extent_merge
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
We're in the process of turning bset.c into library code, so none of the code in
that file should know about struct cache_set or struct btree - so, move the
btree traversal part of the stats code to sysfs.c.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
More disentangling bset.c from the rest of the bcache code - soon, the
sorting routines won't have any dependencies on any outside structs.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Only use extent comparison for comparing extents, so we're not using
START_KEY() on other key types (i.e. btree pointers)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Getting away from KEY_PTRS and moving toward KEY_U64s - and getting rid of magic
2s
Also - split out the part that checks against journal entry size so as to avoid
a dependancy on struct cache_set in bset.c
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Now that we've got code for raid5/6 stripe awareness, bcache just needs
to know about the stripes and when writing partial stripes is expensive
- we probably don't want to enable this optimization for raid1 or 10,
even though they have stripes. So add a flag to queue_limits.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
This error path shouldn't have been hit in practice.. and we've got reworked
reserve code coming soon so that it shouldn't _ever_ be bit... but if we've got
code for this error path it should be correct.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
We need a reserve for allocating buckets for new btree nodes - and now that
we've got multiple btrees, it really needs to be per btree.
This reworks the reserves so we've got separate freelists for each reserve
instead of watermarks, which seems to make things a bit cleaner, and it adds
some code so that btree_split() can make sure the reserve is available before it
starts.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Unnecessary since a bucket that has dirty pointers pointing to it can
never be invalidated - and skipping it is a measurable performance
boost, since the bucket gen will usually be a cache miss.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
We were unnecessarily waiting on a journal write to complete when we just needed
to start a journal write and start setting up the next one.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
The real fix is where we check the bytes we need against how much is
remaining - we also need to check for a journal entry bigger than our
buffer, we'll never write those and it would be bad if we tried to read
one.
Also improve the diagnostic messages.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
The code that handles overlapping extents that we've just read back in from disk
was depending on the behaviour of the code that handles overlapping extents as
we're inserting into a btree node in the case of an insert that forced an
existing extent to be split: on insert, if we had to split we'd also insert a
new extent to represent the top part of the old extent - and then that new
extent would get written out.
The code that read the extents back in thus not bother with splitting extents -
if it saw an extent that ovelapped in the middle of an older extent, it would
trim the old extent to only represent the bottom part, assuming that the
original insert would've inserted a new extent to represent the top part.
I still haven't figured out _how_ it can happen, but I'm now pretty convinced
(and testing has confirmed) that there's some kind of an obscure corner case
(probably involving extent merging, and multiple overwrites in different sets)
that breaks this. The fix is to change the mergesort fixup code to split extents
itself when required.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA
189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m
LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7
k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an
eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ
YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU=
=/4/R
-----END PGP SIGNATURE-----
Merge tag 'v3.13-rc6' into for-3.14/core
Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.
Fixup merge issues related to the immutable biovec changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conflicts:
block/blk-flush.c
fs/btrfs/check-integrity.c
fs/btrfs/extent_io.c
fs/btrfs/scrub.c
fs/logfs/dev_bdev.c
The old writeback PD controller could get into states where it had throttled all
the way down and take way too long to recover - it was too complicated to really
understand what it was doing.
This rewrites a good chunk of it to hopefully be simpler and make more sense,
and it also pays more attention to units which should make the behaviour a bit
easier to understand.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
There is a possibility for a bucket to be invalidated by the allocator
while moving_gc was copying it's contents to another bucket, if the
bucket only held cached data. To prevent this moving checks for
a stale ptr (to an invalidated bucket), before and after reads.
It it finds one, it simply ignores moving that data. This only
affects bcache if the moving_gc was turned on, note that it's
off by default.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Garbage collector needs to check keys in the writeback keybuf to
make sure it's not invalidating buckets to which the writeback
keys point to.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Removed gc_move_threshold because picking buckets only by
threshold could lead moving extra buckets (ei. if there are
buckets at the threshold that aren't supposed to be moved
do to space considerations).
This is replaced by a GC_MOVE bit in the gc_mark bitmask.
Now only marked buckets get moved.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Dirty data accounting wasn't quite right - firstly, we were adding the key we're
inserting after it could have merged with another dirty key already in the
btree, and secondly we could sometimes pass the wrong offset to
bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is
important when tracking dirty data by stripe.
NOTE FOR BACKPORTERS: For 3.10 (and 3.11?) there's other accounting fixes
necessary that got squashed in with other patches; the full patch against 3.10
is 408cc2f47eeac93a, available at:
git://evilpiepirate.org/~kent/linux-bcache.git bcache-3.10-writeback-fixes
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2a46036..4a12b2f 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1817,7 +1817,8 @@ static bool fix_overlapping_extents(struct btree *b, struct bkey *insert,
if (KEY_START(k) > KEY_START(insert) + sectors_found)
goto check_failed;
- if (KEY_PTRS(replace_key) != KEY_PTRS(k))
+ if (KEY_PTRS(k) != KEY_PTRS(replace_key) ||
+ KEY_DIRTY(k) != KEY_DIRTY(replace_key))
goto check_failed;
/* skip past gen */
at the beginning (schedule_timout_interuptible) and others
do his on their own
This prevents wrong load average calculation (load of 1 per thread)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Fixes the following sparse warning:
drivers/md/bcache/btree.c:2220:5: warning:
symbol 'btree_insert_fn' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
The new bio_split() can split arbitrary bios - it's not restricted to
single page bios, like the old bio_split() (previously renamed to
bio_pair_split()). It also has different semantics - it doesn't allocate
a struct bio_pair, leaving it up to the caller to handle completions.
Then convert the existing bio_pair_split() users to the new bio_split()
- and also nvme, which was open coding bio splitting.
(We have to take that BUG_ON() out of bio_integrity_trim() because this
bio_split() needs to use it, and there's no reason it has to be used on
bios marked as cloned; BIO_CLONED doesn't seem to have clearly
documented semantics anyways.)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Neil Brown <neilb@suse.de>
This adds a generic mechanism for chaining bio completions. This is
going to be used for a bio_split() replacement, and it turns out to be
very useful in a fair amount of driver code - a fair number of drivers
were implementing this in their own roundabout ways, often painfully.
Note that this means it's no longer to call bio_endio() more than once
on the same bio! This can cause problems for drivers that save/restore
bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all
- in all but the simplest cases they'd be better off just cloning the
bio, and immutable biovecs is making bio cloning cheaper. But for now,
we add a bio_endio_nodec() for these cases.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Now that drivers have been converted to the new bvec_iter primitives,
there's no need to trim the bvec before we submit it; and we can't trim
it once we start sharing bvecs.
It used to be that passing a partially completed bio (i.e. one with
nonzero bi_idx) to generic_make_request() was a dangerous thing -
various drivers would choke on such things. But with immutable biovecs
and our new bio splitting that shares the biovecs, submitting partially
completed bios has to work (and should work, now that all the drivers
have been completed to the new primitives)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
bio_clone() just got more expensive - however, most users of bio_clone()
don't actually need to modify the biovec. If they aren't modifying the
biovec, and they can guarantee that the original bio isn't freed before
the clone (also true in most cases), we can just point the clone at the
original bio's biovec.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>