Commit Graph

541 Commits

Author SHA1 Message Date
Linus Torvalds
3419b45039 Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block
Pull block IO poll support from Jens Axboe:
 "Various groups have been doing experimentation around IO polling for
  (really) fast devices.  The code has been reviewed and has been
  sitting on the side for a few releases, but this is now good enough
  for coordinated benchmarking and further experimentation.

  Currently O_DIRECT sync read/write are supported.  A framework is in
  the works that allows scalable stats tracking so we can auto-tune
  this.  And we'll add libaio support as well soon.  Fow now, it's an
  opt-in feature for test purposes"

* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
  direct-io: be sure to assign dio->bio_bdev for both paths
  directio: add block polling support
  NVMe: add blk polling support
  block: add block polling support
  blk-mq: return tag/queue combo in the make_request_fn handlers
  block: change ->make_request_fn() and users to return a queue cookie
2015-11-10 17:23:49 -08:00
Linus Torvalds
75021d2859 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial updates from Jiri Kosina:
 "Trivial stuff from trivial tree that can be trivially summed up as:

   - treewide drop of spurious unlikely() before IS_ERR() from Viresh
     Kumar

   - cosmetic fixes (that don't really affect basic functionality of the
     driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek

   - various comment / printk fixes and updates all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  bcache: Really show state of work pending bit
  hwmon: applesmc: fix comment typos
  Kconfig: remove comment about scsi_wait_scan module
  class_find_device: fix reference to argument "match"
  debugfs: document that debugfs_remove*() accepts NULL and error values
  net: Drop unlikely before IS_ERR(_OR_NULL)
  mm: Drop unlikely before IS_ERR(_OR_NULL)
  fs: Drop unlikely before IS_ERR(_OR_NULL)
  drivers: net: Drop unlikely before IS_ERR(_OR_NULL)
  drivers: misc: Drop unlikely before IS_ERR(_OR_NULL)
  UBI: Update comments to reflect UBI_METAONLY flag
  pktcdvd: drop null test before destroy functions
2015-11-07 13:05:44 -08:00
Jens Axboe
dece16353e block: change ->make_request_fn() and users to return a queue cookie
No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-07 10:40:46 -07:00
Petr Mladek
8d090f4731 bcache: Really show state of work pending bit
WORK_STRUCT_PENDING is a mask for testing the pending bit.
test_bit() expects the number of the bit and we need to
use WORK_STRUCT_PENDING_BIT there.

Also work_data_bits() is defined in workqueues.h now.

I have noticed this just by chance when looking how
WORK_STRUCT_PENDING_BIT is used. The change is compile
tested.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-11-06 15:06:05 +01:00
Kent Overstreet
749b61dab3 bcache: remove driver private bio splitting code
The bcache driver has always accepted arbitrarily large bios and split
them internally.  Now that every driver must accept arbitrarily large
bios this code isn't nessecary anymore.

Cc: linux-bcache@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:31:40 -06:00
Christoph Hellwig
4246a0b63b block: add a bi_error field to struct bio
Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:15 -06:00
Jens Axboe
2bb4cd5cc4 block: have drivers use blk_queue_max_discard_sectors()
Some drivers use it now, others just set the limits field manually.
But in preparation for splitting this into a hard and soft limit,
ensure that they all call the proper function for setting the hw
limit for discards.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-17 08:41:53 -06:00
Jens Axboe
77b5a08427 bcache: don't embed 'return' statements in closure macros
This is horribly confusing, it breaks the flow of the code without
it being apparent in the caller.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
2015-07-11 09:57:32 -06:00
Joe Perches
d1aa1ab33d MAINTAINERS: BCACHE: Kent Overstreet has changed email address
Kent's email address in MAINTAINERS seems to be invalid.
This was his last sign-off address, so use that if appropriate.

Fix the S: status entry while there.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:45:01 -07:00
Pekka Enberg
958b43384e bcache: use kvfree() in various places
Use kvfree() instead of open-coding it.

Signed-off-by: Pekka Enberg <penberg@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:45:00 -07:00
Tejun Heo
66114cad64 writeback: separate out include/linux/backing-dev-defs.h
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it.  c files
which need access to more backing-dev details now include
backing-dev.h directly.  This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:34 -06:00
Mike Snitzer
326e1dbb57 block: remove management of bi_remaining when restoring original bi_end_io
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
 1) saving a bio's original bi_end_io
 2) wiring up an intermediate bi_end_io
 3) restoring the original bi_end_io from intermediate bi_end_io
 4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called.  For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0.  When bio_inc_remaining() occurred during step
3 it brought it to a value of 2.  When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining.  For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

Also, the bio_inc_remaining() interface has been moved local to bio.c.

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 08:58:55 -06:00
Jens Axboe
dac56212e8 bio: skip atomic inc/dec of ->bi_cnt for most use cases
Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.

If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.

Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:32:49 -06:00
Gu Zheng
aae4933da9 md/bcache: use generic io stats accounting functions to simplify io stat accounting
Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Kent Overstreet <kmo@datera.io>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24 08:05:12 -07:00
Mike Snitzer
b277da0a8a block: disable entropy contributions for nonrot devices
Clear QUEUE_FLAG_ADD_RANDOM in all block drivers that set
QUEUE_FLAG_NONROT.

Historically, all block devices have automatically made entropy
contributions.  But as previously stated in commit e2e1a148 ("block: add
sysfs knob for turning off disk entropy contributions"):
    - On SSD disks, the completion times aren't as random as they
      are for rotational drives. So it's questionable whether they
      should contribute to the random pool in the first place.
    - Calling add_disk_randomness() has a lot of overhead.

There are more reliable sources for randomness than non-rotational block
devices.  From a security perspective it is better to err on the side of
caution than to allow entropy contributions from unreliable "random"
sources.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-04 10:55:32 -06:00
Kent Overstreet
0781c8748c bcache: Drop unneeded blk_sync_queue() calls
this is needed for the queue/block device we created (it's done by
blk_cleanup_queue() which we do call) - but calling it for the block devices we
only opened is pointless.

Change-Id: I53dfded14ed15b9581d10ca8399d5e1b3abbf9f2
2014-08-04 15:23:04 -07:00
Jianjian Huo
789d21dbd9 bcache: add mutex lock for bch_is_open
Since bch_is_open will iterate linked list bch_cache_sets and
uncached_devices, it needs bch_register_lock.

Signed-off-by: Jianjian Huo <samuel.huo@gmail.com>
2014-08-04 15:23:04 -07:00
Surbhi Palande
5b25abade2 bcache: Correct printing of btree_gc_max_duration_ms
time_stats::btree_gc_max_duration_mc is not bit shifted by 8

Fixes BUG #138

Change-Id: I44fc6e1d0579674016acc533f1a546b080e5371a
Signed-off-by: Surbhi Palande <sap@daterainc.com>
2014-08-04 15:23:04 -07:00
Slava Pestov
2452cc8906 bcache: try to set b->parent properly
bcache_flash_dev.ktest would reliably crash with 8k and 16k bucket size
before; now it passes.

Change-Id: Ib542232235e39298c3a7548fe52b645cabb823d1
2014-08-04 15:23:04 -07:00
Slava Pestov
c9a78332b4 bcache: fix memory corruption in init error path
If register_cache_set() failed, we would touch ca->set after
it had already been freed. Also, fix an assertion to catch
this.

Change-Id: I748e5f5b223e2d9b2602075dec2f997cced2394d
2014-08-04 15:23:04 -07:00
Slava Pestov
bf0c55c986 bcache: fix crash with incomplete cache set
Change-Id: I6abde52afe917633480caaf4e2518f42a816d886
2014-08-04 15:23:04 -07:00
Kent Overstreet
d83353b319 bcache: Fix more early shutdown bugs
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:04 -07:00
Slava Pestov
400ffaa2ac bcache: fix use-after-free in btree_gc_coalesce()
If we goto out_nocoalesce after we free new_nodes[0], we end up freeing
new_nodes[0] again. This was generating a lockdep warning. The fix is
to set new_nodes[0] to NULL, since the out_nocoalesce path safely
ignores NULL entries in the new_nodes array.

This regression was introduced in 2d7f9531.

Change-Id: I76564d7257800583214376b4bacf236cda90c89c
2014-08-04 15:23:04 -07:00
Kent Overstreet
6b708de64a bcache: Fix an infinite loop in journal replay
When running with multiple cache devices, if one of the devices has a completely
empty journal but we'd already found some journal entries on a previosu device
we'd go into an infinite loop.

Change-Id: I1dcdc0d738192746de28f40e8b08825b0dea5e2b
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
913dc33fb2 bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
'b' was NULL.

Change-Id: Icac0fd04afa2d23f213d96d51afd53374e6dd0c0
2014-08-04 15:23:03 -07:00
Slava Pestov
60ae81eee8 bcache: bcache_write tracepoint was crashing
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
8e09480806 bcache: fix typo in bch_bkey_equal_header
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Kent Overstreet
501d52a90c bcache: Allocate bounce buffers with GFP_NOWAIT
There's no point in blocking on these allocations, since our fallback paths will
probably go faster than blocking.

Change-Id: I733ca202c25cb36bde02607a0a60552229a4241c
2014-08-04 15:23:03 -07:00
Kent Overstreet
bcf090e004 bcache: Make sure to pass GFP_WAIT to mempool_alloc()
this was very wrong - mempool_alloc() only guarantees success with GFP_WAIT.
bcache uses GFP_NOWAIT in various other places where we have a fallback,
circuits must've gotten crossed when writing this code or something.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
9e5c353510 bcache: fix uninterruptible sleep in writeback thread
There were two issues here:

- writeback thread did not start until the device first became dirty
- writeback thread used uninterruptible sleep once running

Without this patch I see kernel warnings printed and a load average of
1.52 after booting my test VM. With this patch the warnings are gone and
the load average is near 0.00 as expected.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
c5aa4a3157 bcache: wait for buckets when allocating new btree root
Tested:
- sometimes bcache_tier test would hang on startup with a failure
  to allocate the btree root -- no longer seeing this

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
a664d0f05a bcache: fix crash on shutdown in passthrough mode
We never started the writeback thread in this case, so don't stop it.
2014-08-04 15:23:03 -07:00
Slava Pestov
e5112201c1 bcache: fix lockdep warnings on shutdown 2014-08-04 15:23:03 -07:00
Slava Pestov
8b326d3a2a bcache allocator: send discards with correct size 2014-08-04 15:23:03 -07:00
Surbhi Palande
dbd810ab67 bcache: Fix to remove the rcu_sched stalls.
while loop was executing infinitely.
This fix ends the while loop gracefully.

Signed-off-by: Surbhi Palande <sap@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:02 -07:00
Kent Overstreet
9aa61a992a bcache: Fix a journal replay bug
journal replay wansn't validating pointers with bch_extent_invalid() before
derefing, fixed

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:02 -07:00
Kent Overstreet
5b1016e62f bcache: Fix a bug when detaching
After detaching a backing device from a cache set, a bit wasn't getting
reset meaning the second detach wouldn't work correctly.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:02 -07:00
Peter Zijlstra
4e857c58ef arch: Mass conversion of smp_mb__*()
Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18 14:20:48 +02:00
John Sheu
cb85114956 bcache: remove nested function usage
Uninlined nested functions can cause crashes when using ftrace, as they don't
follow the normal calling convention and confuse the ftrace function graph
tracer as it examines the stack.

Also, nested functions are supported as a gcc extension, but may fail on other
compilers (e.g. llvm).

Signed-off-by: John Sheu <john.sheu@gmail.com>
2014-03-18 12:39:28 -07:00
Kent Overstreet
3a2fd9d509 bcache: Kill bucket->gc_gen
gc_gen was a temporary used to recalculate last_gc, but since we only need
bucket->last_gc when gc isn't running (gc_mark_valid = 1), we can just update
last_gc directly.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:24:54 -07:00
Kent Overstreet
2531d9ee61 bcache: Kill unused freelist
This was originally added as at optimization that for various reasons isn't
needed anymore, but it does add a lot of nasty corner cases (and it was
responsible for some recently fixed bugs). Just get rid of it now.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:23:36 -07:00
Kent Overstreet
0a63b66db5 bcache: Rework btree cache reserve handling
This changes the bucket allocation reserves to use _real_ reserves - separate
freelists - instead of watermarks, which if nothing else makes the current code
saner to reason about and is going to be important in the future when we add
support for multiple btrees.

It also adds btree_check_reserve(), which checks (and locks) the reserves for
both bucket allocation and memory allocation for btree nodes; the old code just
kinda sorta assumed that since (e.g. for btree node splits) it had the root
locked and that meant no other threads could try to make use of the same
reserve; this technically should have been ok for memory allocation (we should
always have a reserve for memory allocation (the btree node cache is used as a
reserve and we preallocate it)), but multiple btrees will mean that locking the
root won't be sufficient anymore, and for the bucket allocation reserve it was
technically possible for the old code to deadlock.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:23:35 -07:00
Kent Overstreet
56b30770b2 bcache: Kill btree_io_wq
With the locking rework in the last patch, this shouldn't be needed anymore -
btree_node_write_work() only takes b->write_lock which is never held for very
long.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:23:35 -07:00
Kent Overstreet
2a285686c1 bcache: btree locking rework
Add a new lock, b->write_lock, which is required to actually modify - or write -
a btree node; this lock is only held for short durations.

This means we can write out a btree node without taking b->lock, which _is_ held
for long durations - solving a deadlock when btree_flush_write() (from the
journalling code) is called with a btree node locked.

Right now just occurs in bch_btree_set_root(), but with an upcoming journalling
rework is going to happen a lot more.

This also turns b->lock is now more of a read/intent lock instead of a
read/write lock - but not completely, since it still blocks readers. May turn it
into a real intent lock at some point in the future.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:23:35 -07:00
Kent Overstreet
05335cff9f bcache: Fix a race when freeing btree nodes
This isn't a bulletproof fix; btree_node_free() -> bch_bucket_free() puts the
bucket on the unused freelist, where it can be reused right away without any
ordering requirements. It would be better to wait on at least a journal write to
go down before reusing the bucket. bch_btree_set_root() does this, and inserting
into non leaf nodes is completely synchronous so we should be ok, but future
patches are just going to get rid of the unused freelist - it was needed in the
past for various reasons but shouldn't be anymore.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:23:34 -07:00
Kent Overstreet
4fe6a81670 bcache: Add a real GC_MARK_RECLAIMABLE
This means the garbage collection code can better check for data and metadata
pointers to the same buckets.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:36 -07:00
Kent Overstreet
c13f3af924 bcache: Add bch_keylist_init_single()
This will potentially save us an allocation when we've got inode/dirent bkeys
that don't fit in the keylist's inline keys.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:36 -07:00
Kent Overstreet
1575402052 bcache: Improve priority_stats
Break down data into clean data/dirty data/metadata.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:35 -07:00
Kent Overstreet
7159b1ad3d bcache: Better alloc tracepoints
Change the invalidate tracepoint to indicate how much data we're invalidating,
and change the alloc tracepoints to indicate what offset they're for.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:35 -07:00
Kent Overstreet
3f5e0a34da bcache: Kill dead cgroup code
This hasn't been used or even enabled in ages.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:35 -07:00
Nicholas Swenson
3f6ef38110 bcache: stop moving_gc marking buckets that can't be moved.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
2014-03-18 12:22:34 -07:00
Kent Overstreet
10d9dcf6ee bcache: Fix moving_pred()
Avoid a potential null pointer deref (e.g. from check keys for cache misses)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:34 -07:00
Nicholas Swenson
da415a096f bcache: Fix moving_gc deadlocking with a foreground write
Deadlock happened because a foreground write slept, waiting for a bucket
to be allocated. Normally the gc would mark buckets available for invalidation.
But the moving_gc was stuck waiting for outstanding writes to complete.
These writes used the bcache_wq, the same queue foreground writes used.

This fix gives moving_gc its own work queue, so it was still finish moving
even if foreground writes are stuck waiting for allocation. It also makes
work queue a parameter to the data_insert path, so moving_gc can use its
workqueue for writes.

Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:33 -07:00
Kent Overstreet
90db6919f5 bcache: Fix discard granularity
blk_stack_limits() doesn't like a discard granularity of 0.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:33 -07:00
Kent Overstreet
487dded86e bcache: Fix another bug recovering from unclean shutdown
The on disk bucket gens are allowed to be out of date, when we reuse buckets
that didn't have any live data in them. To deal with this, the initial gc has to
update the bucket gen when we find a pointer gen newer than the bucket's gen.

Unfortunately we weren't doing this for pointers in the journal that we're about
to replay.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:33 -07:00
Kent Overstreet
0bd143fd80 bcache: Fix a bug recovering from unclean shutdown
The code to fixup incorrect bucket prios incorrectly did not skip btree node
freeing keys

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:32 -07:00
Kent Overstreet
27201cfdaa bcache: Fix a journalling reclaim after recovery bug
On recovery we weren't correctly keeping track of what journal buckets had open
journal entries, thus it was possible for them to be overwritten until we'd
written all new journal entries.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:21:48 -07:00
Kent Overstreet
65ddf45a31 bcache: Fix a null ptr deref in journal replay
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-17 19:01:03 -07:00
Kent Overstreet
4fa03402cd bcache: Fix a lockdep splat in an error path
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-17 18:59:09 -07:00
Kent Overstreet
dabb443340 bcache: Fix a shutdown bug
Shutdown wasn't cancelling/waiting on journal_write_work()

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-02-25 18:42:49 -08:00
Kent Overstreet
1b4eaf3d38 bcache: Fix flash_dev_cache_miss() for real this time
The code was using sectors to count the number of sectors it was zeroing... but
then it passed it to bio_advance()... after it had been set to 0. Amusing...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-02-25 18:41:11 -08:00
Kent Overstreet
85cbe1f88c bcache: Fix another compiler warning on m68k
Use a bigger hammer this time

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org>
2014-02-18 08:55:05 -08:00
Jens Axboe
96d2e8b5e2 Merge branch 'bcache-for-3.14' of git://evilpiepirate.org/~kent/linux-bcache into for-linus 2014-01-30 12:57:55 -07:00
Linus Torvalds
53d8ab29f8 Merge branch 'for-3.14/drivers' of git://git.kernel.dk/linux-block
Pull block IO driver changes from Jens Axboe:

 - bcache update from Kent Overstreet.

 - two bcache fixes from Nicholas Swenson.

 - cciss pci init error fix from Andrew.

 - underflow fix in the parallel IDE pg_write code from Dan Carpenter.
   I'm sure the 1 (or 0) users of that are now happy.

 - two PCI related fixes for sx8 from Jingoo Han.

 - floppy init fix for first block read from Jiri Kosina.

 - pktcdvd error return miss fix from Julia Lawall.

 - removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from
   Michael Opdenacker.

 - comment typo fix for the loop driver from Olaf Hering.

 - potential oops fix for null_blk from Raghavendra K T.

 - two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing
   an OOM problem and a problem with handling security locked conditions

* 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits)
  mg_disk: Spelling s/finised/finished/
  null_blk: Null pointer deference problem in alloc_page_buffers
  mtip32xx: Correctly handle security locked condition
  mtip32xx: Make SGL container per-command to eliminate high order dma allocation
  drivers/block/loop.c: fix comment typo in loop_config_discard
  drivers/block/cciss.c:cciss_init_one(): use proper errnos
  drivers/block/paride/pg.c: underflow bug in pg_write()
  drivers/block/sx8.c: remove unnecessary pci_set_drvdata()
  drivers/block/sx8.c: use module_pci_driver()
  floppy: bail out in open() if drive is not responding to block0 read
  bcache: Fix auxiliary search trees for key size > cacheline size
  bcache: Don't return -EINTR when insert finished
  bcache: Improve bucket_prio() calculation
  bcache: Add bch_bkey_equal_header()
  bcache: update bch_bkey_try_merge
  bcache: Move insert_fixup() to btree_keys_ops
  bcache: Convert sorting to btree_keys
  bcache: Convert debug code to btree_keys
  bcache: Convert btree_iter to struct btree_keys
  bcache: Refactor bset_tree sysfs stats
  ...
2014-01-30 11:40:10 -08:00
Linus Torvalds
f568849eda Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "The major piece in here is the immutable bio_ve series from Kent, the
  rest is fairly minor.  It was supposed to go in last round, but
  various issues pushed it to this release instead.  The pull request
  contains:

   - Various smaller blk-mq fixes from different folks.  Nothing major
     here, just minor fixes and cleanups.

   - Fix for a memory leak in the error path in the block ioctl code
     from Christian Engelmayer.

   - Header export fix from CaiZhiyong.

   - Finally the immutable biovec changes from Kent Overstreet.  This
     enables some nice future work on making arbitrarily sized bios
     possible, and splitting more efficient.  Related fixes to immutable
     bio_vecs:

        - dm-cache immutable fixup from Mike Snitzer.
        - btrfs immutable fixup from Muthu Kumar.

  - bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
  xtensa: fixup simdisk driver to work with immutable bio_vecs
  block/blk-mq-cpu.c: use hotcpu_notifier()
  blk-mq: for_each_* macro correctness
  block: Fix memory leak in rw_copy_check_uvector() handling
  bio-integrity: Fix bio_integrity_verify segment start bug
  block: remove unrelated header files and export symbol
  blk-mq: uses page->list incorrectly
  blk-mq: use __smp_call_function_single directly
  btrfs: fix missing increment of bi_remaining
  Revert "block: Warn and free bio if bi_end_io is not set"
  block: Warn and free bio if bi_end_io is not set
  blk-mq: fix initializing request's start time
  block: blk-mq: don't export blk_mq_free_queue()
  block: blk-mq: make blk_sync_queue support mq
  block: blk-mq: support draining mq queue
  dm cache: increment bi_remaining when bi_end_io is restored
  block: fixup for generic bio chaining
  block: Really silence spurious compiler warnings
  block: Silence spurious compiler warnings
  block: Kill bio_pair_split()
  ...
2014-01-30 11:19:05 -08:00
Nicholas Swenson
e3b4825b85 bcache: bugfix - gc thread now gets woken when cache is full
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
2014-01-29 13:06:42 -08:00
Kent Overstreet
3572324af0 bcache: Minor fixes from kbuild robot
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-29 13:06:41 -08:00
Darrick J. Wong
9471744767 bcache: fix BUG_ON due to integer overflow with GC_SECTORS_USED
The BUG_ON at the end of __bch_btree_mark_key can be triggered due to
an integer overflow error:

BITMASK(GC_SECTORS_USED, struct bucket, gc_mark, 2, 13);
...
SET_GC_SECTORS_USED(g, min_t(unsigned,
	     GC_SECTORS_USED(g) + KEY_SIZE(k),
	     (1 << 14) - 1));
BUG_ON(!GC_SECTORS_USED(g));

In bcache.h, the SECTORS_USED bitfield is defined to be 13 bits wide.
While the SET_ code tries to ensure that the field doesn't overflow by
clamping it to (1<<14)-1 == 16383, this is incorrect because 16383
requires 14 bits.  Therefore, if GC_SECTORS_USED() + KEY_SIZE() =
8192, the SET_ statement tries to store 8192 into a 13-bit field.  In
a 13-bit field, 8192 becomes zero, thus triggering the BUG_ON.

Therefore, create a field width constant and a max value constant, and
use those to create the bitfield and check the inputs to
SET_GC_SECTORS_USED.  Arguably the BITMASK() template ought to have
BUG_ON checks for too-large values, but that's a separate patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2014-01-29 13:06:15 -08:00
Linus Torvalds
f075e0f699 Merge branch 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "The bulk of changes are cleanups and preparations for the upcoming
  kernfs conversion.

   - cgroup_event mechanism which is and will be used only by memcg is
     moved to memcg.

   - pidlist handling is updated so that it can be served by seq_file.

     Also, the list is not sorted if sane_behavior.  cgroup
     documentation explicitly states that the file is not sorted but it
     has been for quite some time.

   - All cgroup file handling now happens on top of seq_file.  This is
     to prepare for kernfs conversion.  In addition, all operations are
     restructured so that they map 1-1 to kernfs operations.

   - Other cleanups and low-pri fixes"

* 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (40 commits)
  cgroup: trivial style updates
  cgroup: remove stray references to css_id
  doc: cgroups: Fix typo in doc/cgroups
  cgroup: fix fail path in cgroup_load_subsys()
  cgroup: fix missing unlock on error in cgroup_load_subsys()
  cgroup: remove for_each_root_subsys()
  cgroup: implement for_each_css()
  cgroup: factor out cgroup_subsys_state creation into create_css()
  cgroup: combine css handling loops in cgroup_create()
  cgroup: reorder operations in cgroup_create()
  cgroup: make for_each_subsys() useable under cgroup_root_mutex
  cgroup: css iterations and css_from_dir() are safe under cgroup_mutex
  cgroup: unify pidlist and other file handling
  cgroup: replace cftype->read_seq_string() with cftype->seq_show()
  cgroup: attach cgroup_open_file to all cgroup files
  cgroup: generalize cgroup_pidlist_open_file
  cgroup: unify read path so that seq_file is always used
  cgroup: unify cgroup_write_X64() and cgroup_write_string()
  cgroup: remove cftype->read(), ->read_map() and ->write()
  hugetlb_cgroup: convert away from cftype->read()
  ...
2014-01-21 17:51:34 -08:00
Hugh Dickins
b3ff8a2f95 cgroup: remove stray references to css_id
Trivial: remove the few stray references to css_id, which itself
was removed in v3.13's 2ff2a7d03b "cgroup: kill css_id".

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-01-13 10:48:18 -05:00
Kent Overstreet
9dd6358a21 bcache: Fix auxiliary search trees for key size > cacheline size
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:15 -08:00
Kent Overstreet
3b3e9e50dd bcache: Don't return -EINTR when insert finished
We need to return -EINTR after a split because we invalidated iterators
(and freed the btree node) - but if we were finished inserting, we don't
want to redo the traversal.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:14 -08:00
Kent Overstreet
e0a985a4b1 bcache: Improve bucket_prio() calculation
When deciding what order to reuse buckets we take into account both the bucket's
priority (which indicates lru order) and also the amount of live data in that
bucket. The way they were scaled together wasn't as correct as it could be...
this patch improves and documents it.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:14 -08:00
Nicholas Swenson
3bdad1e40d bcache: Add bch_bkey_equal_header()
Checks if two keys have equivalent header fields.
(good enough for replacement or merging)

Used in bch_bkey_try_merge, and replacing a key
in the btree.

Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:14 -08:00
Nicholas Swenson
0f49cf3d83 bcache: update bch_bkey_try_merge
Added generic header checks to bch_bkey_try_merge,
which then calls the bkey specific function

Removed extraneous checks from bch_extent_merge

Signed-off-by: Nicholas Swenson <nks@daterainc.com>
2014-01-08 13:05:14 -08:00
Kent Overstreet
829a60b905 bcache: Move insert_fixup() to btree_keys_ops
Now handling overlapping extents/keys is a method that's specific to what the
btree node contains.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:14 -08:00
Kent Overstreet
89ebb4a28b bcache: Convert sorting to btree_keys
More work to disentangle various code from struct btree

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:13 -08:00
Kent Overstreet
dc9d98d621 bcache: Convert debug code to btree_keys
More work to disentangle various code from struct btree

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:13 -08:00
Kent Overstreet
c052dd9a26 bcache: Convert btree_iter to struct btree_keys
More work to disentangle bset.c from struct btree

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:13 -08:00
Kent Overstreet
f67342dd34 bcache: Refactor bset_tree sysfs stats
We're in the process of turning bset.c into library code, so none of the code in
that file should know about struct cache_set or struct btree - so, move the
btree traversal part of the stats code to sysfs.c.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:13 -08:00
Kent Overstreet
59158fde42 bcache: Add bch_btree_keys_u64s_remaining()
Helper function to explicitly check how much space is free in a btree node

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:13 -08:00
Kent Overstreet
a85e968e66 bcache: Add struct btree_keys
Soon, bset.c won't need to depend on struct btree.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:13 -08:00
Kent Overstreet
65d45231b5 bcache: Abstract out stuff needed for sorting
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:12 -08:00
Kent Overstreet
ee811287c9 bcache: Rename/shuffle various code around
More work to disentangle bset.c from the rest of the code:

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:12 -08:00
Kent Overstreet
67539e8528 bcache: Add struct bset_sort_state
More disentangling bset.c from the rest of the bcache code - soon, the
sorting routines won't have any dependencies on any outside structs.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:12 -08:00
Kent Overstreet
911c961009 bcache: Split out sort_extent_cmp()
Only use extent comparison for comparing extents, so we're not using
START_KEY() on other key types (i.e. btree pointers)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:12 -08:00
Kent Overstreet
fafff81cea bcache: Bkey indexing renaming
More refactoring:

node() -> bset_bkey_idx()
end() -> bset_bkey_last()

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:12 -08:00
Kent Overstreet
085d2a3dd4 bcache: Make bch_keylist_realloc() take u64s, not nptrs
Getting away from KEY_PTRS and moving toward KEY_U64s - and getting rid of magic
2s

Also - split out the part that checks against journal entry size so as to avoid
a dependancy on struct cache_set in bset.c

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:11 -08:00
Kent Overstreet
9a02b7eeeb bcache: Remove/fix some header dependencies
In the process of disentagling/libraryizing bset.c from the rest of the
bcache code.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:11 -08:00
Kent Overstreet
0a45114534 bcache: Use a mempool for mergesort temporary space
It was a single element mempool before, it's slightly cleaner to just use a real
mempool.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:11 -08:00
Kent Overstreet
78b77bf8b2 bcache: Btree verify code improvements
Used this fixed code to find and fix the bug fixed by
a4d885097b0ac0cd1337f171f2d4b83e946094d4.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:10 -08:00
Kent Overstreet
88b9f8c426 bcache: kill index()
That was a terrible name for a macro, add some better helpers to replace it.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:10 -08:00
Kent Overstreet
5c41c8a713 bcache: Trivial error handling fix
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:10 -08:00
Kent Overstreet
c78afc6261 bcache/md: Use raid stripe size
Now that we've got code for raid5/6 stripe awareness, bcache just needs
to know about the stripes and when writing partial stripes is expensive
- we probably don't want to enable this optimization for raid1 or 10,
even though they have stripes. So add a flag to queue_limits.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:09 -08:00
Kent Overstreet
5f5837d2d6 bcache: Do bkey_put() in btree_split() error path
This error path shouldn't have been hit in practice.. and we've got reworked
reserve code coming soon so that it shouldn't _ever_ be bit... but if we've got
code for this error path it should be correct.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:09 -08:00
Kent Overstreet
78365411b3 bcache: Rework allocator reserves
We need a reserve for allocating buckets for new btree nodes - and now that
we've got multiple btrees, it really needs to be per btree.

This reworks the reserves so we've got separate freelists for each reserve
instead of watermarks, which seems to make things a bit cleaner, and it adds
some code so that btree_split() can make sure the reserve is available before it
starts.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:09 -08:00
Kent Overstreet
1dd13c8d3c bcache: kill closure locking code
Also flesh out the documentation a bit

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:08 -08:00
Kent Overstreet
cb7a583e6a bcache: kill closure locking usage
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:08 -08:00
Kent Overstreet
a5ae4300c1 bcache: Zero less memory
Another minor performance optimization

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:08 -08:00
Kent Overstreet
d56d000a1f bcache: Don't touch bucket gen for dirty ptrs
Unnecessary since a bucket that has dirty pointers pointing to it can
never be invalidated - and skipping it is a measurable performance
boost, since the bucket gen will usually be a cache miss.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:07 -08:00
Kent Overstreet
b0f32a56f2 bcache: Minor btree cache fix
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:07 -08:00
Kent Overstreet
5775e2133d bcache: Performance fix for when journal entry is full
We were unnecessarily waiting on a journal write to complete when we just needed
to start a journal write and start setting up the next one.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:07 -08:00
Kent Overstreet
b3fa7e77e6 bcache: Minor journal fix
The real fix is where we check the bytes we need against how much is
remaining - we also need to check for a journal entry bigger than our
buffer, we'll never write those and it would be bad if we tried to read
one.

Also improve the diagnostic messages.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:06 -08:00
Kent Overstreet
ef71ec0000 bcache: Data corruption fix
The code that handles overlapping extents that we've just read back in from disk
was depending on the behaviour of the code that handles overlapping extents as
we're inserting into a btree node in the case of an insert that forced an
existing extent to be split: on insert, if we had to split we'd also insert a
new extent to represent the top part of the old extent - and then that new
extent would get written out.

The code that read the extents back in thus not bother with splitting extents -
if it saw an extent that ovelapped in the middle of an older extent, it would
trim the old extent to only represent the bottom part, assuming that the
original insert would've inserted a new extent to represent the top part.

I still haven't figured out _how_ it can happen, but I'm now pretty convinced
(and testing has confirmed) that there's some kind of an obscure corner case
(probably involving extent merging, and multiple overwrites in different sets)
that breaks this. The fix is to change the mergesort fixup code to split extents
itself when required.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2014-01-08 13:05:06 -08:00
Jens Axboe
b28bc9b38c Linux 3.13-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA
 189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m
 LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7
 k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an
 eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ
 YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU=
 =/4/R
 -----END PGP SIGNATURE-----

Merge tag 'v3.13-rc6' into for-3.14/core

Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.

Fixup merge issues related to the immutable biovec changes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Conflicts:
	block/blk-flush.c
	fs/btrfs/check-integrity.c
	fs/btrfs/extent_io.c
	fs/btrfs/scrub.c
	fs/logfs/dev_bdev.c
2013-12-31 09:51:02 -07:00
Kent Overstreet
16749c23c0 bcache: New writeback PD controller
The old writeback PD controller could get into states where it had throttled all
the way down and take way too long to recover - it was too complicated to really
understand what it was doing.

This rewrites a good chunk of it to hopefully be simpler and make more sense,
and it also pays more attention to units which should make the behaviour a bit
easier to understand.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-12-16 14:22:59 -08:00
Kent Overstreet
6d3d1a9c54 bcache: bugfix for race between moving_gc and bucket_invalidate
There is a possibility for a bucket to be invalidated by the allocator
while moving_gc was copying it's contents to another bucket, if the
bucket only held cached data. To prevent this moving checks for
a stale ptr (to an invalidated bucket), before and after reads.
It it finds one, it simply ignores moving that data. This only
affects bcache if the moving_gc was turned on, note that it's
off by default.

Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-12-16 14:22:58 -08:00
Nicholas Swenson
bf0a628a95 bcache: fix for gc and writeback race
Garbage collector needs to check keys in the writeback keybuf to
make sure it's not invalidating buckets to which the writeback
keys point to.

Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-12-16 14:22:58 -08:00
Nicholas Swenson
981aa8c091 bcache: bugfix - moving_gc now moves only correct buckets
Removed gc_move_threshold because picking buckets only by
threshold could lead moving extra buckets (ei. if there are
buckets at the threshold that aren't supposed to be moved
do to space considerations).

This is replaced by a GC_MOVE bit in the gc_mark bitmask.
Now only marked buckets get moved.

Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-12-16 14:22:58 -08:00
Nicholas Swenson
bee63f40cb bcache: fix for gc crashing when no sectors are used
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-12-16 14:22:57 -08:00
Nicholas Swenson
97d11a660f bcache: Fix heap_peek() macro
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-12-16 14:22:57 -08:00
Nicholas Swenson
9eb8ebeb24 bcache: Fix for can_attach_cache()
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-12-16 14:22:57 -08:00
Kent Overstreet
d24a6e1087 bcache: Fix dirty_data accounting
Dirty data accounting wasn't quite right - firstly, we were adding the key we're
inserting after it could have merged with another dirty key already in the
btree, and secondly we could sometimes pass the wrong offset to
bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is
important when tracking dirty data by stripe.

NOTE FOR BACKPORTERS: For 3.10 (and 3.11?) there's other accounting fixes
necessary that got squashed in with other patches; the full patch against 3.10
is 408cc2f47eeac93a, available at:
  git://evilpiepirate.org/~kent/linux-bcache.git bcache-3.10-writeback-fixes

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2a46036..4a12b2f 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1817,7 +1817,8 @@ static bool fix_overlapping_extents(struct btree *b, struct bkey *insert,
 			if (KEY_START(k) > KEY_START(insert) + sectors_found)
 				goto check_failed;

-			if (KEY_PTRS(replace_key) != KEY_PTRS(k))
+			if (KEY_PTRS(k) != KEY_PTRS(replace_key) ||
+			    KEY_DIRTY(k) != KEY_DIRTY(replace_key))
 				goto check_failed;

 			/* skip past gen */
2013-12-16 14:22:16 -08:00
Kent Overstreet
ce2b3f595e bcache: Use uninterruptible sleep in writeback
We're just waiting on kthread_should_stop(), nothing else, so
interruptible sleep was wrong here.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-12-16 14:04:57 -08:00
Stefan Priebe
f665c0f852 bcache: kthread don't set writeback task to INTERUPTIBLE
at the beginning (schedule_timout_interuptible) and others
do his on their own

This prevents wrong load average calculation (load of 1 per thread)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-12-16 14:04:57 -08:00
Wei Yongjun
08239ca2a0 bcache: fix sparse non static symbol warning
Fixes the following sparse warning:

drivers/md/bcache/btree.c:2220:5: warning:
 symbol 'btree_insert_fn' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-28 17:05:58 -08:00
Kent Overstreet
20d0189b10 block: Introduce new bio_split()
The new bio_split() can split arbitrary bios - it's not restricted to
single page bios, like the old bio_split() (previously renamed to
bio_pair_split()). It also has different semantics - it doesn't allocate
a struct bio_pair, leaving it up to the caller to handle completions.

Then convert the existing bio_pair_split() users to the new bio_split()
- and also nvme, which was open coding bio splitting.

(We have to take that BUG_ON() out of bio_integrity_trim() because this
bio_split() needs to use it, and there's no reason it has to be used on
bios marked as cloned; BIO_CLONED doesn't seem to have clearly
documented semantics anyways.)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Neil Brown <neilb@suse.de>
2013-11-23 22:33:57 -08:00
Kent Overstreet
196d38bccf block: Generic bio chaining
This adds a generic mechanism for chaining bio completions. This is
going to be used for a bio_split() replacement, and it turns out to be
very useful in a fair amount of driver code - a fair number of drivers
were implementing this in their own roundabout ways, often painfully.

Note that this means it's no longer to call bio_endio() more than once
on the same bio! This can cause problems for drivers that save/restore
bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all
- in all but the simplest cases they'd be better off just cloning the
bio, and immutable biovecs is making bio cloning cheaper. But for now,
we add a bio_endio_nodec() for these cases.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
2013-11-23 22:33:56 -08:00
Kent Overstreet
e90abc8ec3 block: Remove bi_idx hacks
Now that drivers have been converted to the new bvec_iter primitives,
there's no need to trim the bvec before we submit it; and we can't trim
it once we start sharing bvecs.

It used to be that passing a partially completed bio (i.e. one with
nonzero bi_idx) to generic_make_request() was a dangerous thing -
various drivers would choke on such things. But with immutable biovecs
and our new bio splitting that shares the biovecs, submitting partially
completed bios has to work (and should work, now that all the drivers
have been completed to the new primitives)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
2013-11-23 22:33:55 -08:00
Kent Overstreet
59d276fe02 block: Add bio_clone_fast()
bio_clone() just got more expensive - however, most users of bio_clone()
don't actually need to modify the biovec. If they aren't modifying the
biovec, and they can guarantee that the original bio isn't freed before
the clone (also true in most cases), we can just point the clone at the
original bio's biovec.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-23 22:33:54 -08:00
Kent Overstreet
458b76ed2f block: Kill bio_segments()/bi_vcnt usage
When we start sharing biovecs, keeping bi_vcnt accurate for splits is
going to be error prone - and unnecessary, if we refactor some code.

So bio_segments() has to go - but most of the existing users just needed
to know if the bio had multiple segments, which is easier - add a
bio_multiple_segments() for them.

(Two of the current uses of bio_segments() are going to go away in a
couple patches, but the current implementation of bio_segments() is
unsafe as soon as we start doing driver conversions for immutable
biovecs - so implement a dumb version for bisectability, it'll go away
in a couple patches)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
2013-11-23 22:33:51 -08:00
Kent Overstreet
7988613b0e block: Convert bio_for_each_segment() to bvec_iter
More prep work for immutable biovecs - with immutable bvecs drivers
won't be able to use the biovec directly, they'll need to use helpers
that take into account bio->bi_iter.bi_bvec_done.

This updates callers for the new usage without changing the
implementation yet.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Cc: Jim Paris <jim@jtan.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
Cc: support@lsi.com
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-m68k@lists.linux-m68k.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: drbd-user@lists.linbit.com
Cc: nbd-general@lists.sourceforge.net
Cc: cbe-oss-dev@lists.ozlabs.org
Cc: xen-devel@lists.xensource.com
Cc: virtualization@lists.linux-foundation.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: DL-MPTFusionLinux@lsi.com
Cc: linux-scsi@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: linux-fsdevel@vger.kernel.org
Cc: cluster-devel@redhat.com
Cc: linux-mm@kvack.org
Acked-by: Geoff Levand <geoff@infradead.org>
2013-11-23 22:33:49 -08:00
Kent Overstreet
a4ad39b1d1 block: Convert bio_iovec() to bvec_iter
For immutable biovecs, we'll be introducing a new bio_iovec() that uses
our new bvec iterator to construct a biovec, taking into account
bvec_iter->bi_bvec_done - this patch updates existing users for the new
usage.

Some of the existing users really do need a pointer into the bvec array
- those uses are all going to be removed, but we'll need the
functionality from immutable to remove them - so for now rename the
existing bio_iovec() -> __bio_iovec(), and it'll be removed in a couple
patches.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
2013-11-23 22:33:49 -08:00
Kent Overstreet
4f024f3797 block: Abstract out bvec iterator
Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
2013-11-23 22:33:47 -08:00
Kent Overstreet
ed9c47bebe bcache: Kill unaligned bvec hack
Bcache has a hack to avoid cloning the biovec if it's all full pages -
but with immutable biovecs coming this won't be necessary anymore.

For now, we remove the special case and always clone the bvec array so
that the immutable biovec patches are simpler.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-23 22:33:47 -08:00
Kees Cook
c86949486d bcache: defensively handle format strings
Just to be safe, call the error reporting function with "%s" to avoid
any possible future format string leak.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:43 -08:00
Kent Overstreet
5ceaaad704 bcache: Bypass torture test
More testing ftw! Also, now verify mode doesn't break if you read dirty
data.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:43 -08:00
Kent Overstreet
098fb25498 bcache: Delete some slower inline asm
Never saw a profile of bset_search_tree() where it wasn't bottlenecked
on memory until I got my new Haswell machine, but when I tried it there
it was suddenly burning 20% of the cpu in the inner loop on shrd...

Turns out, the version of shrd that takes 64 bit operands has a 9 cycle
latency. hah.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:42 -08:00
Kent Overstreet
28935ab516 bcache: Use ida for bcache block dev minor
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:42 -08:00
Kent Overstreet
c4d951ddb6 bcache: Fix sysfs splat on shutdown with flash only devs
Whoops.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:41 -08:00
Kent Overstreet
48a915a87f bcache: Better full stripe scanning
The old scanning-by-stripe code burned too much CPU, this should be
better.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:41 -08:00
Kent Overstreet
17e21a9f24 bcache: Have btree_split() insert into parent directly
The flow control in btree_insert_node() was... fragile... before,
this'll use more stack (but since our btrees are never more than depth
1, that shouldn't matter) and it should be significantly clearer and
less fragile.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:40 -08:00
Kent Overstreet
65d22e911b bcache: Move spinlock into struct time_stats
Minor cleanup.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:40 -08:00
Kent Overstreet
8aee122071 bcache: Kill sequential_merge option
It never really made sense to expose this, so just kill it.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:39 -08:00
Kent Overstreet
50310164bc bcache: Kill bch_next_recurse_key()
This dates from before the btree iterator, and now it's finally gone

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:39 -08:00
Kent Overstreet
bc9389eefe bcache: Avoid deadlocking in garbage collection
Not a complete fix - we could still deadlock if btree_insert_node() has
to split...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:38 -08:00
Kent Overstreet
a1f0358b2b bcache: Incremental gc
Big garbage collection rewrite; now, garbage collection uses the same
mechanisms as used elsewhere for inserting/updating btree node pointers,
instead of rewriting interior btree nodes in place.

This makes the code significantly cleaner and less fragile, and means we
can now make garbage collection incremental - it doesn't have to hold a
write lock on the root of the btree for the entire duration of garbage
collection.

This means that there's less of a latency hit for doing garbage
collection, which means we can gc more frequently (and do a better job
of reclaiming from the cache), and we can coalesce across more btree
nodes (improving our space efficiency).

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:37 -08:00
Kent Overstreet
8835c1234d bcache: Add make_btree_freeing_key()
Refactoring, prep work for incremental garbage collection.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:37 -08:00
Kent Overstreet
f269af5a07 bcache: Add btree_node_write_sync()
More refactoring - mostly making the interfaces more explicit about what
we actually want to do.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:36 -08:00
Kent Overstreet
0eacac2203 bcache: PRECEDING_KEY()
btree_insert_key() was open coding this, this is just refactoring.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:36 -08:00
Kent Overstreet
d5cc66e957 bcache: bch_(btree|extent)_ptr_invalid()
Trying to treat btree pointers and leaf node pointers the same way was a
mistake - going to start being more explicit about the type of
key/pointer we're dealing with. This is the first part of that
refactoring; this patch shouldn't change any actual behaviour.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:35 -08:00
Kent Overstreet
3a3b6a4e07 bcache: Don't bother with bucket refcount for btree node allocations
The bucket refcount (dropped with bkey_put()) is only needed to prevent
the newly allocated bucket from being garbage collected until we've
added a pointer to it somewhere. But for btree node allocations, the
fact that we have btree nodes locked is enough to guard against races
with garbage collection.

Eventually the per bucket refcount is going to be replaced with
something specific to bch_alloc_sectors().

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:34 -08:00
Kent Overstreet
280481d06c bcache: Debug code improvements
Couple changes:
 * Consolidate bch_check_keys() and bch_check_key_order(), and move the
   checks that only check_key_order() could do to bch_btree_iter_next().

 * Get rid of CONFIG_BCACHE_EDEBUG - now, all that code is compiled in
   when CONFIG_BCACHE_DEBUG is enabled, and there's now a sysfs file to
   flip on the EDEBUG checks at runtime.

 * Dropped an old not terribly useful check in rw_unlock(), and
   refactored/improved a some of the other debug code.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:34 -08:00
Kent Overstreet
e58ff15503 bcache: Fix bch_ptr_bad()
Previously, bch_ptr_bad() could return false when there was a pointer to
a nonexistant device... it only filtered out keys with PTR_CHECK_DEV
pointers.

This behaviour was intended for multiple cache device support; for that,
just because the device for one of the pointers has gone away doesn't
mean we want to filter out the rest of the pointers.

But we don't yet explicitly filter/check individual pointers, so without
that this behaviour was wrong - a corrupt bkey with a bad device pointer
could cause us to deref a bad pointer. Doh.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:33 -08:00
Kent Overstreet
81ab4190ac bcache: Pull on disk data structures out into a separate header
Now, the on disk data structures are in a header that can be exported to
userspace - and having them all centralized is nice too.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:33 -08:00
Kent Overstreet
2599b53b7b bcache: Move sector allocator to alloc.c
Just reorganizing things a bit.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:32 -08:00
Kent Overstreet
220bb38c21 bcache: Break up struct search
With all the recent refactoring around struct btree op struct search has
gotten rather large.

But we can now easily break it up in a different way - we break out
struct btree_insert_op which is for inserting data into the cache, and
that's now what the copying gc code uses - struct search is now specific
to request.c

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:32 -08:00
Kent Overstreet
cc7b881921 bcache: Convert bch_btree_insert() to bch_btree_map_leaf_nodes()
Last of the btree_map() conversions. Main visible effect is
bch_btree_insert() is no longer taking a struct btree_op as an argument
anymore - there's no fancy state machine stuff going on, it's just a
normal function.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:31 -08:00
Kent Overstreet
6054c6d4da bcache: Don't use op->insert_collision
When we convert bch_btree_insert() to bch_btree_map_leaf_nodes(), we
won't be passing struct btree_op to bch_btree_insert() anymore - so we
need a different way of returning whether there was a collision (really,
a replace collision).

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:30 -08:00
Kent Overstreet
1b207d80d5 bcache: Kill op->replace
This is prep work for converting bch_btree_insert to
bch_btree_map_leaf_nodes() - we have to convert all its arguments to
actual arguments. Bunch of churn, but should be straightforward.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:29 -08:00
Kent Overstreet
faadf0c965 bcache: Drop some closure stuff
With a the recent bcache refactoring, some of the closure code isn't
needed anymore.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:10 -08:00
Kent Overstreet
b54d6934da bcache: Kill op->cl
This isn't used for waiting asynchronously anymore - so this is a fairly
trivial refactoring.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:09 -08:00
Kent Overstreet
c18536a72d bcache: Prune struct btree_op
Eventual goal is for struct btree_op to contain only what is necessary
for traversing the btree.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:08 -08:00
Kent Overstreet
cc23196631 bcache: Clean up cache_lookup_fn
There was some looping in submit_partial_cache_hit() and
submit_partial_cache_hit() that isn't needed anymore - originally, we
wouldn't necessarily process the full hit or miss all at once because
when splitting the bio, we took into account the restrictions of the
device we were sending it to.

But, device bio size restrictions are now handled elsewhere, with a
wrapper around generic_make_request() - so that looping has been
unnecessary for awhile now and we can now do quite a bit of cleanup.

And if we trim the key we're reading from to match the subset we're
actually reading, we don't have to explicitly calculate bi_sector
anymore. Neat.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:08 -08:00
Kent Overstreet
2c1953e201 bcache: Convert bch_btree_read_async() to bch_btree_map_keys()
This is a fairly straightforward conversion, mostly reshuffling -
op->lookup_done goes away, replaced by MAP_DONE/MAP_CONTINUE. And the
code for handling cache hits and misses wasn't really btree code, so it
gets moved to request.c.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:07 -08:00
Kent Overstreet
df8e89701f bcache: Move some stuff to btree.c
With the new btree_map() functions, we don't need to export the stuff
needed for traversing the btree anymore.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:07 -08:00
Kent Overstreet
48dad8baf9 bcache: Add btree_map() functions
Lots of stuff has been open coding its own btree traversal - which is
generally pretty simple code, but there are a few subtleties.

This adds new new functions, bch_btree_map_nodes() and
bch_btree_map_keys(), which do the traversal for you. Everything that's
open coding btree traversal now (with the exception of garbage
collection) is slowly going to be converted to these two functions;
being able to write other code at a higher level of abstraction  is a
big improvement w.r.t. overall code quality.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:06 -08:00
Kent Overstreet
5e6926daac bcache: Convert writeback to a kthread
This simplifies the writeback flow control quite a bit - previously, it
was conceptually two coroutines, refill_dirty() and read_dirty(). This
makes the code quite a bit more straightforward.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:05 -08:00
Kent Overstreet
72a44517f3 bcache: Convert gc to a kthread
We needed a dedicated rescuer workqueue for gc anyways... and gc was
conceptually a dedicated thread, just one that wasn't running all the
time. Switch it to a dedicated thread to make the code a bit more
straightforward.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:04 -08:00
Kent Overstreet
35fcd848d7 bcache: Convert bucket_wait to wait_queue_head_t
At one point we did do fancy asynchronous waiting stuff with
bucket_wait, but that's all gone (and bucket_wait is used a lot less
than it used to be). So use the standard primitives.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:04 -08:00
Kent Overstreet
e8e1d4682c bcache: Convert try_wait to wait_queue_head_t
We never waited on c->try_wait asynchronously, so just use the standard
primitives.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:03 -08:00
Kent Overstreet
0b93207abb bcache: Move keylist out of btree_op
Slowly working on pruning struct btree_op - the aim is for it to only
contain things that are actually necessary for traversing the btree.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:02 -08:00
Kent Overstreet
a34a8bfd4e bcache: Refactor journalling flow control
Making things less asynchronous that don't need to be - bch_journal()
only has to block when the journal or journal entry is full, which is
emphatically not a fast path. So make it a normal function that just
returns when it finishes, to make the code and control flow easier to
follow.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:02 -08:00
Kent Overstreet
cdd972b164 bcache: Refactor read request code a bit
More refactoring, and renaming.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:01 -08:00
Kent Overstreet
84f0db03ea bcache: Refactor request_write()
Try to improve some of the naming a bit to be more consistent, and also
improve the flow of control in request_write() a bit.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:00 -08:00
Kent Overstreet
c2f95ae2eb bcache: Clean up keylist code
More random refactoring.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:00 -08:00
Kent Overstreet
4f3d40147b bcache: Add explicit keylist arg to btree_insert()
Some refactoring - better to explicitly pass stuff around instead of
having it all in the "big bag of state", struct btree_op. Going to prune
struct btree_op quite a bit over time.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:59 -08:00
Kent Overstreet
e7c590eb63 bcache: Convert btree_insert_check_key() to btree_insert_node()
This was the main point of all this refactoring - now,
btree_insert_check_key() won't fail just because the leaf node happened
to be full.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:59 -08:00
Kent Overstreet
403b6cdeb1 bcache: Insert multiple keys at a time
We'll often end up with a list of adjacent keys to insert -
because bch_data_insert() may have to fragment the data it writes.

Originally, to simplify things and avoid having to deal with corner
cases bch_btree_insert() would pass keys from this list one at a time to
btree_insert_recurse() - mainly because the list of keys might span leaf
nodes, so it was easier this way.

With the btree_insert_node() refactoring, it's now a lot easier to just
pass down the whole list and have btree_insert_recurse() iterate over
leaf nodes until it's done.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:58 -08:00
Kent Overstreet
26c949f806 bcache: Add btree_insert_node()
The flow of control in the old btree insertion code was rather -
backwards; we'd recurse down the btree (in btree_insert_recurse()), and
then if we needed to split the keys to be inserted into the parent node
would be effectively returned up to btree_insert_recurse(), which would
notice there was more work to do and finish the insertion.

The main problem with this was that the full logic for btree insertion
could only be used by calling btree_insert_recurse; if you'd gotten to a
btree leaf some other way and had a key to insert, if it turned out that
node needed to be split you were SOL.

This inverts the flow of control so btree_insert_node() does _full_
btree insertion, including splitting - and takes a (leaf) btree node to
insert into as a parameter.

This means we can now _correctly_ handle cache misses - for cache
misses, we need to insert a fake "check" key into the btree when we
discover we have a cache miss - while we still have the btree locked.
Previously, if the btree node was full inserting a cache miss would just
fail.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:57 -08:00
Kent Overstreet
d6fd3b11ce bcache: Explicitly track btree node's parent
This is prep work for the reworked btree insertion code.

The way we set b->parent is ugly and hacky... the problem is, when
btree_split() or garbage collection splits or rewrites a btree node, the
parent changes for all its (potentially already cached) children.

I may change this later and add some code to look through the btree node
cache and find all our cached child nodes and change the parent pointer
then...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:57 -08:00
Kent Overstreet
8304ad4dc8 bcache: Remove unnecessary check in should_split()
Checking i->seq was redundant, because since ages ago we always
initialize the new bset when advancing b->written

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:56 -08:00
Kent Overstreet
2d679fc756 bcache: Stripe size isn't necessarily a power of two
Originally I got this right... except that the divides didn't use
do_div(), which broke 32 bit kernels. When I went to fix that, I forgot
that the raid stripe size usually isn't a power of two... doh

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:55 -08:00
Kent Overstreet
77c320eb46 bcache: Add on error panic/unregister setting
Works kind of like the ext4 setting, to panic or remount read only on
errors.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:55 -08:00
Kent Overstreet
49b1212dfa bcache: Use blkdev_issue_discard()
The old asynchronous discard code was really a relic from when all the
allocation code was asynchronous - now that allocation runs out of a
dedicated thread there's no point in keeping around all that complicated
machinery.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:54 -08:00
Kent Overstreet
dd9ec84da5 bcache: Fix a lockdep splat
bch_keybuf_del() takes a spinlock that can't be taken in interrupt context -
whoops. Fortunately, this code isn't enabled by default (you have to toggle a
sysfs thing).

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:54 -08:00
Kent Overstreet
7857d5d470 bcache: Fix a journalling performance bug 2013-11-10 21:55:53 -08:00
Kent Overstreet
1fa8455deb bcache: Fix dirty_data accounting
Dirty data accounting wasn't quite right - firstly, we were adding the key we're
inserting after it could have merged with another dirty key already in the
btree, and secondly we could sometimes pass the wrong offset to
bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is
important when tracking dirty data by stripe.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2013-11-10 21:55:27 -08:00
Kent Overstreet
d4eddd42f5 bcache: Fixed incorrect order of arguments to bio_alloc_bioset()
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-23 07:55:36 +01:00
Kent Overstreet
2fe80d3bbf bcache: Fix a null ptr deref regression
Commit c0f04d88e4 ("bcache: Fix flushes in writeback mode") was fixing
a reported data corruption bug, but it seems some last minute
refactoring or rebasing introduced a null pointer deref.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Reported-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-10 18:17:39 -07:00
Kent Overstreet
c0f04d88e4 bcache: Fix flushes in writeback mode
In writeback mode, when we get a cache flush we need to make sure we
issue a flush to the backing device.

The code for sending down an extra flush was wrong - by cloning the bio
we were probably getting flags that didn't make sense for a bare flush,
and also the old code was firing for FUA bios, for which we don't need
to send a flush to the backing device.

This was causing data corruption somehow - the mechanism was never
determined, but this patch fixes it for the users that were seeing it.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
84786438ed bcache: Fix for handling overlapping extents when reading in a btree node
btree_sort_fixup() was overly clever, because it was trying to avoid
pulling a key off the btree iterator in more than one place.

This led to a really obscure bug where we'd break early from the loop in
btree_sort_fixup() if the current key overlapped with keys in more than
one older set, and the next key it overlapped with was zero size.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
a698e08c82 bcache: Fix a shrinker deadlock
GFP_NOIO means we could be getting called recursively - mca_alloc() ->
mca_data_alloc() - definitely can't use mutex_lock(bucket_lock) then.
Whoops.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
79e3dab90d bcache: Fix a dumb CPU spinning bug in writeback
schedule_timeout() != schedule_timeout_uninterruptible()

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
1394d6761b bcache: Fix a flush/fua performance bug
bch_journal_meta() was missing the flush to make the journal write
actually go down (instead of waiting up to journal_delay_ms)...

Whoops

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
c2a4f3183a bcache: Fix a writeback performance regression
Background writeback works by scanning the btree for dirty data and
adding those keys into a fixed size buffer, then for each dirty key in
the keybuf writing it to the backing device.

When read_dirty() finishes and it's time to scan for more dirty data, we
need to wait for the outstanding writeback IO to finish - they still
take up slots in the keybuf (so that foreground writes can check for
them to avoid races) - without that wait, we'll continually rescan when
we'll be able to add at most a key or two to the keybuf, and that takes
locks that starves foreground IO.  Doh.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Geert Uytterhoeven
61cbd250f8 bcache: Correct printf()-style format length modifier
Fix

  drivers/md/bcache/btree.c: In function ‘bch_btree_node_read’:
  drivers/md/bcache/btree.c:259: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘size_t’

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
c426c4fd46 bcache: Fix for when no journal entries are found
The journal replay code didn't handle this case, causing it to go into
an infinite loop...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Gabriel de Perthuis
aee6f1cfff bcache: Strip endline when writing the label through sysfs
sysfs attributes with unusual characters have crappy failure modes
in Squeeze (udev 164); later versions of udev are unaffected.

This should make these characters more unusual.

Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
6d9d21e35f bcache: Fix a dumb journal discard bug
That switch statement was obviously wrong, leading to some sort of weird
spinning on rare occasion with discards enabled...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Dave Chinner
7dc19d5aff drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API.  Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.

FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out.  The amount of broken code I just encountered is
mind boggling.  I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.

Special mention goes to the zcache/zcache2 drivers.  They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth.  And that doesn't even take into account the horrible,
broken code...

[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:32 -04:00
Linus Torvalds
d4c90b1b9f Merge branch 'for-3.11/drivers' of git://git.kernel.dk/linux-block
Pull block IO driver bits from Jens Axboe:
 "As I mentioned in the core block pull request, due to real life
  circumstances the driver pull request would be late.  Now it looks
  like -rc2 late...  On the plus side, apart form the rsxx update, these
  are all things that I could argue could go in later in the cycle as
  they are fixes and not features.  So even though things are late, it's
  not ALL bad.

  The pull request contains:

   - Updates to bcache, all bug fixes, from Kent.

   - A pile of drbd bug fixes (no big features this time!).

   - xen blk front/back fixes.

   - rsxx driver updates, some of them deferred form 3.10.  So should be
     well cooked by now"

* 'for-3.11/drivers' of git://git.kernel.dk/linux-block: (63 commits)
  bcache: Allocation kthread fixes
  bcache: Fix GC_SECTORS_USED() calculation
  bcache: Journal replay fix
  bcache: Shutdown fix
  bcache: Fix a sysfs splat on shutdown
  bcache: Advertise that flushes are supported
  bcache: check for allocation failures
  bcache: Fix a dumb race
  bcache: Use standard utility code
  bcache: Update email address
  bcache: Delete fuzz tester
  bcache: Document shrinker reserve better
  bcache: FUA fixes
  drbd: Allow online change of al-stripes and al-stripe-size
  drbd: Constants should be UPPERCASE
  drbd: Ignore the exit code of a fence-peer handler if it returns too late
  drbd: Fix rcu_read_lock balance on error path
  drbd: fix error return code in drbd_init()
  drbd: Do not sleep inside rcu
  bcache: Refresh usage docs
  ...
2013-07-22 19:02:52 -07:00
Kent Overstreet
79826c35eb bcache: Allocation kthread fixes
The alloc kthread should've been using try_to_freeze() - and also there
was the potential for the alloc kthread to get woken up after it had
shut down, which would have been bad.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-07-12 00:22:49 -07:00
Kent Overstreet
29ebf465b9 bcache: Fix GC_SECTORS_USED() calculation
Part of the job of garbage collection is to add up however many sectors
of live data it finds in each bucket, but that doesn't work very well if
it doesn't reset GC_SECTORS_USED() when it starts. Whoops.

This wouldn't have broken anything horribly, but allocation tries to
preferentially reclaim buckets that are mostly empty and that's not
gonna work with an incorrect GC_SECTORS_USED() value.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2013-07-12 00:22:48 -07:00
Kent Overstreet
faa5673617 bcache: Journal replay fix
The journal replay code starts by finding something that looks like a
valid journal entry, then it does a binary search over the unchecked
region of the journal for the journal entries with the highest sequence
numbers.

Trouble is, the logic was wrong - journal_read_bucket() returns true if
it found journal entries we need, but if the range of journal entries
we're looking for loops around the end of the journal - in that case
journal_read_bucket() could return true when it hadn't found the highest
sequence number we'd seen yet, and in that case the binary search did
the wrong thing. Whoops.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2013-07-12 00:22:48 -07:00
Kent Overstreet
5caa52afc5 bcache: Shutdown fix
Stopping a cache set is supposed to make it stop attached backing
devices, but somewhere along the way that code got lost. Fixing this
mainly has the effect of fixing our reboot notifier.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2013-07-12 00:22:47 -07:00
Kent Overstreet
c9502ea442 bcache: Fix a sysfs splat on shutdown
If we stopped a bcache device when we were already detaching (or
something like that), bcache_device_unlink() would try to remove a
symlink from sysfs that was already gone because the bcache dev kobject
had already been removed from sysfs.

So keep track of whether we've removed stuff from sysfs.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2013-07-12 00:22:47 -07:00
Kent Overstreet
54d12f2b4f bcache: Advertise that flushes are supported
Whoops - bcache's flush/FUA was mostly correct, but flushes get filtered
out unless we say we support them...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2013-07-12 00:22:46 -07:00
Dan Carpenter
d2a65ce2ac bcache: check for allocation failures
There is a missing NULL check after the kzalloc().

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2013-07-12 00:22:46 -07:00
Kent Overstreet
6aa8f1a6ca bcache: Fix a dumb race
In the far-too-complicated closure code - closures can have destructors,
for probably dubious reasons; they get run after the closure is no
longer waiting on anything but before dropping the parent ref, intended
just for freeing whatever memory the closure is embedded in.

Trouble is, when remaining goes to 0 and we've got nothing more to run -
we also have to unlock the closure, setting remaining to -1. If there's
a destructor, that unlock isn't doing anything - nobody could be trying
to lock it if we're about to free it - but if the unlock _is needed...
that check for a destructor was racy. Argh.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2013-07-12 00:22:33 -07:00
Linus Torvalds
80cc38b163 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "The usual stuff from trivial tree"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
  treewide: relase -> release
  Documentation/cgroups/memory.txt: fix stat file documentation
  sysctl/net.txt: delete reference to obsolete 2.4.x kernel
  spinlock_api_smp.h: fix preprocessor comments
  treewide: Fix typo in printk
  doc: device tree: clarify stuff in usage-model.txt.
  open firmware: "/aliasas" -> "/aliases"
  md: bcache: Fixed a typo with the word 'arithmetic'
  irq/generic-chip: fix a few kernel-doc entries
  frv: Convert use of typedef ctl_table to struct ctl_table
  sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
  doc: clk: Fix incorrect wording
  Documentation/arm/IXP4xx fix a typo
  Documentation/networking/ieee802154 fix a typo
  Documentation/DocBook/media/v4l fix a typo
  Documentation/video4linux/si476x.txt fix a typo
  Documentation/virtual/kvm/api.txt fix a typo
  Documentation/early-userspace/README fix a typo
  Documentation/video4linux/soc-camera.txt fix a typo
  lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
  ...
2013-07-04 11:40:58 -07:00
Kent Overstreet
8e51e414a3 bcache: Use standard utility code
Some of bcache's utility code has made it into the rest of the kernel,
so drop the bcache versions.

Bcache used to have a workaround for allocating from a bio set under
generic_make_request() (if you allocated more than once, the bios you
already allocated would get stuck on current->bio_list when you
submitted, and you'd risk deadlock) - bcache would mask out __GFP_WAIT
when allocating bios under generic_make_request() so that allocation
could fail and it could retry from workqueue. But bio_alloc_bioset() has
a workaround now, so we can drop this hack and the associated error
handling.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-07-01 14:43:53 -07:00
Kent Overstreet
f3059a5461 bcache: Delete fuzz tester
This code has rotted and it hasn't been used in ages anyways.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-07-01 14:43:48 -07:00
Kent Overstreet
36c9ea9837 bcache: Document shrinker reserve better
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-07-01 14:42:48 -07:00
Kent Overstreet
e49c7c374e bcache: FUA fixes
Journal writes need to be marked FUA, not just REQ_FLUSH. And btree node
writes have... weird ordering requirements.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-07-01 14:42:47 -07:00
Gabriel de Perthuis
ab9e14002e bcache: Send label uevents
Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 21:58:06 -07:00
Gabriel de Perthuis
a25c32bede bcache: Send a uevent with a cached device's UUID
Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
2013-06-26 21:58:05 -07:00
Kent Overstreet
72c270612b bcache: Write out full stripes
Now that we're tracking dirty data per stripe, we can add two
optimizations for raid5/6:

 * If a stripe is already dirty, force writes to that stripe to
   writeback mode - to help build up full stripes of dirty data

 * When flushing dirty data, preferentially write out full stripes first
   if there are any.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 21:58:04 -07:00
Kent Overstreet
279afbad4e bcache: Track dirty data by stripe
To make background writeback aware of raid5/6 stripes, we first need to
track the amount of dirty data within each stripe - we do this by
breaking up the existing sectors_dirty into per stripe atomic_ts

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 21:57:23 -07:00
Kent Overstreet
444fc0b6b1 bcache: Initialize sectors_dirty when attaching
Previously, dirty_data wouldn't get initialized until the first garbage
collection... which was a bit of a problem for background writeback (as
the PD controller keys off of it) and also confusing for users.

This is also prep work for making background writeback aware of raid5/6
stripes.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:09:16 -07:00
Kent Overstreet
6ded34d1a5 bcache: Improve lazy sorting
The old lazy sorting code was kind of hacky - rewrite in a way that
mathematically makes more sense; the idea is that the size of the sets
of keys in a btree node should increase by a more or less fixed ratio
from smallest to biggest.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:09:16 -07:00
Kent Overstreet
85b1492ee1 bcache: Rip out pkey()/pbtree()
Old gcc doesnt like the struct hack, and it is kind of ugly. So finish
off the work to convert pr_debug() statements to tracepoints, and delete
pkey()/pbtree().

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:09:15 -07:00
Kent Overstreet
c37511b863 bcache: Fix/revamp tracepoints
The tracepoints were reworked to be more sensible, and fixed a null
pointer deref in one of the tracepoints.

Converted some of the pr_debug()s to tracepoints - this is partly a
performance optimization; it used to be that with DEBUG or
CONFIG_DYNAMIC_DEBUG pr_debug() was an empty macro; but at some point it
was changed to an empty inline function.

Some of the pr_debug() statements had rather expensive function calls as
part of the arguments, so this code was getting run unnecessarily even
on non debug kernels - in some fast paths, too.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:09:15 -07:00
Kent Overstreet
5794351146 bcache: Refactor btree io
The most significant change is that btree reads are now done
synchronously, instead of asynchronously and doing the post read stuff
from a workqueue.

This was originally done because we can't block on IO under
generic_make_request(). But - we already have a mechanism to punt cache
lookups to workqueue if needed, so if we just use that we don't have to
deal with the complexity of doing things asynchronously.

The main benefit is this makes the locking situation saner; we can hold
our write lock on the btree node until we're finished reading it, and we
don't need that btree_node_read_done() flag anymore.

Also, for writes, btree_write() was broken out into btree_node_write()
and btree_leaf_dirty() - the old code with the boolean argument was dumb
and confusing.

The prio_blocked mechanism was improved a bit too, now the only counter
is in struct btree_write, we don't mess with transfering a count from
struct btree anymore.

This required changing garbage collection to block prios at the start
and unblock when it finishes, which is cleaner than what it was doing
anyways (the old code had mostly the same effect, but was doing it in a
convoluted way)

And the btree iter btree_node_read_done() uses was converted to a real
mempool.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:09:14 -07:00
Kent Overstreet
119ba0f828 bcache: Convert allocator thread to kthread
Using a workqueue when we just want a single thread is a bit silly.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:09:13 -07:00
Gabriel de Perthuis
a9dd53adbb bcache: Warn when a device is already registered.
Signed-off-by: Gabriel de Perthuis <g2p.code+bcache@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:08:52 -07:00
Kent Overstreet
bbc77aa7fb bcache: fix a spurious gcc complaint, use scnprintf
An old version of gcc was complaining about using a const int as the
size of a stack allocated array. Which should be fine - but using
ARRAY_SIZE() is better, anyways.

Also, refactor the code to use scnprintf().

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:06:33 -07:00
Kumar Amit Mehta
5c694129c8 md: bcache: io.c: fix a potential NULL pointer dereference
bio_alloc_bioset returns NULL on failure. This fix adds a missing check
for potential NULL pointer dereferencing.

Signed-off-by: Kumar Amit Mehta <gmate.amit@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-06-26 17:06:19 -07:00
Phil Viana
48a73025cb md: bcache: Fixed a typo with the word 'arithmetic'
The word 'arithmetic' was typed as 'arithmatic'

Signed-off-by: Phil Viana <phillip.l.viana@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-06-18 13:41:16 +02:00
Kent Overstreet
f59fce847f bcache: Fix error handling in init code
This code appears to have rotted... fix various bugs and do some
refactoring.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-05-15 00:48:14 -07:00
Paul Bolle
bbb1c3b5ae bcache: drop "select CLOSURES"
The Kconfig entry for BCACHE selects CLOSURES. But there's no Kconfig
symbol CLOSURES. That symbol was used in development versions of bcache,
but was removed when the closures code was no longer provided as a
kernel library. It can safely be dropped.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
2013-05-15 00:42:51 -07:00
Emil Goode
867e116206 bcache: Fix incompatible pointer type warning
The function pointer release in struct block_device_operations
should point to functions declared as void.

Sparse warnings:

drivers/md/bcache/super.c:656:27: warning:
	incorrect type in initializer (different base types)
	drivers/md/bcache/super.c:656:27:
	expected void ( *release )( ... )
	drivers/md/bcache/super.c:656:27:
	got int ( static [toplevel] *<noident> )( ... )

drivers/md/bcache/super.c:656:2: warning:
	initialization from incompatible pointer type [enabled by default]

drivers/md/bcache/super.c:656:2: warning:
	(near initialization for ‘bcache_ops.release’) [enabled by default]

Signed-off-by: Emil Goode <emilgoode@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-05-15 00:42:50 -07:00
Kent Overstreet
ee66850642 bcache: Use bd_link_disk_holder()
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-30 19:14:43 -07:00
Kent Overstreet
86b26b824c bcache: Allocator cleanup/fixes
The main fix is that bch_allocator_thread() wasn't waiting on
garbage collection to finish (if invalidate_buckets had set
ca->invalidate_needs_gc); we need that to make sure the allocator
doesn't spin and potentially block gc from finishing.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-30 19:14:40 -07:00
Kent Overstreet
8abb2a5dba bcache: Make sure blocksize isn't smaller than device blocksize
Sanity check to make sure we don't end up doing IO the device doesn't
support.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-24 13:07:39 -07:00
Kent Overstreet
a09ded8edf bcache: Fix merge_bvec_fn usage for when it modifies the bvm
Stacked md devices reuse the bvm for the subordinate device, causing
problems...

Reported-by: Michael Balser <michael.balser@profitbricks.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-22 14:44:24 -07:00
Kent Overstreet
1545f13730 bcache: Correctly check against BIO_MAX_PAGES
bch_bio_max_sectors() was checking against BIO_MAX_PAGES as if the limit
was for the total bytes in the bio, not the number of segments.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-20 17:57:42 -07:00
Kent Overstreet
bca97adaf5 bcache: Hack around stuff that clones up to bi_max_vecs
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-20 17:57:41 -07:00
Kent Overstreet
4f0fd955cd bcache: Set ra_pages based on backing device's ra_pages
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-20 17:57:26 -07:00
Kent Overstreet
2903381fce bcache: Take data offset from the bdev superblock.
Add a new superblock version, and consolidate related defines.

Signed-off-by: Gabriel de Perthuis <g2p.code+bcache@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-20 17:56:12 -07:00
Kent Overstreet
cef5279735 bcache: Disable broken btree fuzz tester
Reported-by: <sasha.levin@oracle.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-08 13:33:49 -07:00
Kent Overstreet
91bbcfc361 bcache: Fix a format string overflow
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-08 13:33:49 -07:00
Kent Overstreet
8ef747909c bcache: Fix a minor memory leak on device teardown
Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-08 13:33:48 -07:00
Kent Overstreet
cc0f4eaa61 bcache: Use WARN_ONCE() instead of __WARN()
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-08 13:33:48 -07:00
Geert Uytterhoeven
cd953ed036 bcache: Add missing #include <linux/prefetch.h>
m68k/allmodconfig:

drivers/md/bcache/bset.c: In function ‘bset_search_tree’:
drivers/md/bcache/bset.c:727: error: implicit declaration of function ‘prefetch’

drivers/md/bcache/btree.c: In function ‘bch_btree_node_get’:
drivers/md/bcache/btree.c:933: error: implicit declaration of function ‘prefetch’

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-08 13:33:48 -07:00
Kent Overstreet
c19ed23a0b bcache: Sparse fixes
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-04-08 13:33:48 -07:00
Kent Overstreet
169ef1cf61 bcache: Don't export utility code, prefix with bch_
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: linux-bcache@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-28 12:50:55 -06:00
Kent Overstreet
29177b8966 bcache: Fix for the build fixes
Commit 82a84eaf7e51ba3da0c36cbc401034a4e943492d left a return 0 in
closure_debug_init(). Whoops.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-25 19:36:39 -06:00
Kent Overstreet
b1a67b0f4c bcache: Style/checkpatch fixes
Took out some nested functions, and fixed some more checkpatch
complaints.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: linux-bcache@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-25 13:06:13 -06:00
Kent Overstreet
07e86ccb54 bcache: Build fixes from test robot
config: make ARCH=i386 allmodconfig

All error/warnings:

   drivers/md/bcache/bset.c: In function 'bch_ptr_bad':
>> drivers/md/bcache/bset.c:164:2: warning: format '%li' expects argument of type 'long int', but argument 4 has type 'size_t' [-Wformat]
--
   drivers/md/bcache/debug.c: In function 'bch_pbtree':
>> drivers/md/bcache/debug.c:86:4: warning: format '%li' expects argument of type 'long int', but argument 4 has type 'size_t' [-Wformat]
--
   drivers/md/bcache/btree.c: In function 'bch_btree_read_done':
>> drivers/md/bcache/btree.c:245:8: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'size_t' [-Wformat]
--
   drivers/md/bcache/closure.o: In function `closure_debug_init':
>> (.init.text+0x0): multiple definition of `init_module'
>> drivers/md/bcache/super.o:super.c:(.init.text+0x0): first defined here

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: linux-bcache@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-25 13:06:13 -06:00
Kent Overstreet
cafe563591 bcache: A block layer cache
Does writethrough and writeback caching, handles unclean shutdown, and
has a bunch of other nifty features motivated by real world usage.

See the wiki at http://bcache.evilpiepirate.org for more.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-03-23 16:11:31 -07:00