linux

Author	SHA1	Message	Date
Kent Overstreet	e0a985a4b1	bcache: Improve bucket_prio() calculation When deciding what order to reuse buckets we take into account both the bucket's priority (which indicates lru order) and also the amount of live data in that bucket. The way they were scaled together wasn't as correct as it could be... this patch improves and documents it. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:14 -08:00
Nicholas Swenson	3bdad1e40d	bcache: Add bch_bkey_equal_header() Checks if two keys have equivalent header fields. (good enough for replacement or merging) Used in bch_bkey_try_merge, and replacing a key in the btree. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:14 -08:00
Nicholas Swenson	0f49cf3d83	bcache: update bch_bkey_try_merge Added generic header checks to bch_bkey_try_merge, which then calls the bkey specific function Removed extraneous checks from bch_extent_merge Signed-off-by: Nicholas Swenson <nks@daterainc.com>	2014-01-08 13:05:14 -08:00
Kent Overstreet	829a60b905	bcache: Move insert_fixup() to btree_keys_ops Now handling overlapping extents/keys is a method that's specific to what the btree node contains. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:14 -08:00
Kent Overstreet	89ebb4a28b	bcache: Convert sorting to btree_keys More work to disentangle various code from struct btree Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:13 -08:00
Kent Overstreet	dc9d98d621	bcache: Convert debug code to btree_keys More work to disentangle various code from struct btree Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:13 -08:00
Kent Overstreet	c052dd9a26	bcache: Convert btree_iter to struct btree_keys More work to disentangle bset.c from struct btree Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:13 -08:00
Kent Overstreet	f67342dd34	bcache: Refactor bset_tree sysfs stats We're in the process of turning bset.c into library code, so none of the code in that file should know about struct cache_set or struct btree - so, move the btree traversal part of the stats code to sysfs.c. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:13 -08:00
Kent Overstreet	59158fde42	bcache: Add bch_btree_keys_u64s_remaining() Helper function to explicitly check how much space is free in a btree node Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:13 -08:00
Kent Overstreet	a85e968e66	bcache: Add struct btree_keys Soon, bset.c won't need to depend on struct btree. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:13 -08:00
Kent Overstreet	65d45231b5	bcache: Abstract out stuff needed for sorting Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:12 -08:00
Kent Overstreet	ee811287c9	bcache: Rename/shuffle various code around More work to disentangle bset.c from the rest of the code: Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:12 -08:00
Kent Overstreet	67539e8528	bcache: Add struct bset_sort_state More disentangling bset.c from the rest of the bcache code - soon, the sorting routines won't have any dependencies on any outside structs. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:12 -08:00
Kent Overstreet	911c961009	bcache: Split out sort_extent_cmp() Only use extent comparison for comparing extents, so we're not using START_KEY() on other key types (i.e. btree pointers) Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:12 -08:00
Kent Overstreet	fafff81cea	bcache: Bkey indexing renaming More refactoring: node() -> bset_bkey_idx() end() -> bset_bkey_last() Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:12 -08:00
Kent Overstreet	085d2a3dd4	bcache: Make bch_keylist_realloc() take u64s, not nptrs Getting away from KEY_PTRS and moving toward KEY_U64s - and getting rid of magic 2s Also - split out the part that checks against journal entry size so as to avoid a dependancy on struct cache_set in bset.c Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:11 -08:00
Kent Overstreet	9a02b7eeeb	bcache: Remove/fix some header dependencies In the process of disentagling/libraryizing bset.c from the rest of the bcache code. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:11 -08:00
Kent Overstreet	0a45114534	bcache: Use a mempool for mergesort temporary space It was a single element mempool before, it's slightly cleaner to just use a real mempool. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:11 -08:00
Kent Overstreet	78b77bf8b2	bcache: Btree verify code improvements Used this fixed code to find and fix the bug fixed by a4d885097b0ac0cd1337f171f2d4b83e946094d4. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:10 -08:00
Kent Overstreet	88b9f8c426	bcache: kill index() That was a terrible name for a macro, add some better helpers to replace it. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:10 -08:00
Kent Overstreet	5c41c8a713	bcache: Trivial error handling fix Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:10 -08:00
Kent Overstreet	c78afc6261	bcache/md: Use raid stripe size Now that we've got code for raid5/6 stripe awareness, bcache just needs to know about the stripes and when writing partial stripes is expensive - we probably don't want to enable this optimization for raid1 or 10, even though they have stripes. So add a flag to queue_limits. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:09 -08:00
Kent Overstreet	5f5837d2d6	bcache: Do bkey_put() in btree_split() error path This error path shouldn't have been hit in practice.. and we've got reworked reserve code coming soon so that it shouldn't _ever_ be bit... but if we've got code for this error path it should be correct. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:09 -08:00
Kent Overstreet	78365411b3	bcache: Rework allocator reserves We need a reserve for allocating buckets for new btree nodes - and now that we've got multiple btrees, it really needs to be per btree. This reworks the reserves so we've got separate freelists for each reserve instead of watermarks, which seems to make things a bit cleaner, and it adds some code so that btree_split() can make sure the reserve is available before it starts. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:09 -08:00
Kent Overstreet	1dd13c8d3c	bcache: kill closure locking code Also flesh out the documentation a bit Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:08 -08:00
Kent Overstreet	cb7a583e6a	bcache: kill closure locking usage Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:08 -08:00
Kent Overstreet	a5ae4300c1	bcache: Zero less memory Another minor performance optimization Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:08 -08:00
Kent Overstreet	d56d000a1f	bcache: Don't touch bucket gen for dirty ptrs Unnecessary since a bucket that has dirty pointers pointing to it can never be invalidated - and skipping it is a measurable performance boost, since the bucket gen will usually be a cache miss. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:07 -08:00
Kent Overstreet	b0f32a56f2	bcache: Minor btree cache fix Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:07 -08:00
Kent Overstreet	5775e2133d	bcache: Performance fix for when journal entry is full We were unnecessarily waiting on a journal write to complete when we just needed to start a journal write and start setting up the next one. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:07 -08:00
Kent Overstreet	b3fa7e77e6	bcache: Minor journal fix The real fix is where we check the bytes we need against how much is remaining - we also need to check for a journal entry bigger than our buffer, we'll never write those and it would be bad if we tried to read one. Also improve the diagnostic messages. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:06 -08:00
Kent Overstreet	ef71ec0000	bcache: Data corruption fix The code that handles overlapping extents that we've just read back in from disk was depending on the behaviour of the code that handles overlapping extents as we're inserting into a btree node in the case of an insert that forced an existing extent to be split: on insert, if we had to split we'd also insert a new extent to represent the top part of the old extent - and then that new extent would get written out. The code that read the extents back in thus not bother with splitting extents - if it saw an extent that ovelapped in the middle of an older extent, it would trim the old extent to only represent the bottom part, assuming that the original insert would've inserted a new extent to represent the top part. I still haven't figured out _how_ it can happen, but I'm now pretty convinced (and testing has confirmed) that there's some kind of an obscure corner case (probably involving extent merging, and multiple overwrites in different sets) that breaks this. The fix is to change the mergesort fixup code to split extents itself when required. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10	2014-01-08 13:05:06 -08:00
Joe Thornber	7e664b3dec	dm space map metadata: fix extending the space map When extending a metadata space map we should do the first commit whilst still in bootstrap mode -- a mode where all blocks get allocated in the new area. That way the commit overhead is allocated from the newly added space. Otherwise we risk running out of space. With this fix, and the previous commit "dm space map common: make sure new space is used during extend", the following device mapper testsuite test passes: dmtest run --suite thin-provisioning -n /resize_metadata_no_io/ Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-01-07 21:05:18 -05:00
Joe Thornber	12c91a5c2d	dm space map common: make sure new space is used during extend When extending a low level space map we should update nr_blocks at the start so the new space is used for the index entries. Otherwise extend can fail, e.g.: sm_metadata_extend call sequence that fails: -> sm_ll_extend -> dm_tm_new_block -> dm_sm_new_block -> sm_bootstrap_new_block => returns -ENOSPC because smm->begin == smm->ll.nr_blocks Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-01-07 21:05:17 -05:00
Mikulas Patocka	be35f48610	dm: wait until embedded kobject is released before destroying a device There may be other parts of the kernel holding a reference on the dm kobject. We must wait until all references are dropped before deallocating the mapped_device structure. The dm_kobject_release method signals that all references are dropped via completion. But dm_kobject_release doesn't free the kobject (which is embedded in the mapped_device structure). This is the sequence of operations: * when destroying a DM device, call kobject_put from dm_sysfs_exit * wait until all users stop using the kobject, when it happens the release method is called * the release method signals the completion and should return without delay * the dm device removal code that waits on the completion continues * the dm device removal code drops the dm_mod reference the device had * the dm device removal code frees the mapped_device structure that contains the kobject Using kobject this way should avoid the module unload race that was mentioned at the beginning of this thread: https://lkml.org/lkml/2014/1/4/83 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-01-07 21:01:43 -05:00
Mikulas Patocka	1ddd641ddc	dm: remove pointless kobject comparison in dm_get_from_kobject The comparison is always true and the compiler optimizes it out anyway. Milan offered additional context relative to the original commit `784aae735d` ("dm: add name and uuid to sysfs") which introduced the code: "I think it is just relict of some experiments before I committed this simple embedded sysfs kobj handling". Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-07 13:22:32 -05:00
Chuansheng Liu	c1a6416021	dm snapshot: call destroy_work_on_stack() to pair with INIT_WORK_ONSTACK() In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to call destroy_work_on_stack() which frees the debug object to pair with INIT_WORK_ONSTACK(). Signed-off-by: Liu, Chuansheng <chuansheng.liu@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-07 10:31:34 -05:00
Joe Thornber	78e03d6973	dm cache policy mq: introduce three promotion threshold tunables Internally the mq policy maintains a promotion threshold variable. If the hit count of a block not in the cache goes above this threshold it gets promoted to the cache. This patch introduces three new tunables that allow you to tweak the promotion threshold by adding a small value. These adjustments depend on the io type: read_promote_adjustment: READ io, default 4 write_promote_adjustment: WRITE io, default 8 discard_promote_adjustment: READ/WRITE io to a discarded block, default 1 If you're trying to quickly warm a new cache device you may wish to reduce these to encourage promotion. Remember to switch them back to their defaults after the cache fills though. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-07 10:14:33 -05:00
Wei Yongjun	b815805154	dm cache policy mq: use list_del_init instead of list_del + INIT_LIST_HEAD Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-07 10:14:32 -05:00
Mike Snitzer	8b64e881eb	dm thin: fix set_pool_mode exposed pool operation races The pool mode must not be switched until after the corresponding pool process_* methods have been established. Otherwise, because set_pool_mode() isn't interlocked with the IO path for performance reasons, the IO path can end up executing process_* operations that don't match the mode. This patch eliminates problems like the following (as seen on really fast PCIe SSD storage when transitioning the pool's mode from PM_READ_ONLY to PM_WRITE): kernel: device-mapper: thin: 253:2: reached low water mark for data device: sending event. kernel: device-mapper: thin: 253:2: no free data space available. kernel: device-mapper: thin: 253:2: switching pool to read-only mode kernel: device-mapper: thin: 253:2: switching pool to write mode kernel: ------------[ cut here ]------------ kernel: WARNING: CPU: 11 PID: 7564 at drivers/md/dm-thin.c:995 handle_unserviceable_bio+0x146/0x160 [dm_thin_pool]() ... kernel: Workqueue: dm-thin do_worker [dm_thin_pool] kernel: 00000000000003e3 ffff880308831cc8 ffffffff8152ebcb 00000000000003e3 kernel: 0000000000000000 ffff880308831d08 ffffffff8104c46c ffff88032502a800 kernel: ffff880036409000 ffff88030ec7ce00 0000000000000001 00000000ffffffc3 kernel: Call Trace: kernel: [<ffffffff8152ebcb>] dump_stack+0x49/0x5e kernel: [<ffffffff8104c46c>] warn_slowpath_common+0x8c/0xc0 kernel: [<ffffffff8104c4ba>] warn_slowpath_null+0x1a/0x20 kernel: [<ffffffffa001e2c6>] handle_unserviceable_bio+0x146/0x160 [dm_thin_pool] kernel: [<ffffffffa001f276>] process_bio_read_only+0x136/0x180 [dm_thin_pool] kernel: [<ffffffffa0020b75>] process_deferred_bios+0xc5/0x230 [dm_thin_pool] kernel: [<ffffffffa0020d31>] do_worker+0x51/0x60 [dm_thin_pool] kernel: [<ffffffff81067823>] process_one_work+0x183/0x490 kernel: [<ffffffff81068c70>] worker_thread+0x120/0x3a0 kernel: [<ffffffff81068b50>] ? manage_workers+0x160/0x160 kernel: [<ffffffff8106e86e>] kthread+0xce/0xf0 kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70 kernel: [<ffffffff8153b3ec>] ret_from_fork+0x7c/0xb0 kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70 kernel: ---[ end trace 3f00528e08ffa55c ]--- kernel: device-mapper: thin: pool mode is PM_WRITE not PM_READ_ONLY like expected!? dm-thin.c:995 was the WARN_ON_ONCE(get_pool_mode(pool) != PM_READ_ONLY); at the top of handle_unserviceable_bio(). And as the additional debugging I had conveys: the pool mode was _not_ PM_READ_ONLY like expected, it was already PM_WRITE, yet pool->process_bio was still set to process_bio_read_only(). Also, while fixing this up, reduce logging of redundant pool mode transitions by checking new_mode is different from old_mode. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-01-07 10:14:31 -05:00
Mike Snitzer	6d16202be7	dm thin: eliminate the no_free_space flag The pool's error_if_no_space flag can easily serve the same purpose that no_free_space did, namely: control whether handle_unserviceable_bio() will error a bio or requeue it. This is cleaner since error_if_no_space is established when the pool's features are processed during table load. So it avoids managing the no_free_space flag by taking the pool's spinlock. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-07 10:14:31 -05:00
Mike Snitzer	787a996cb2	dm thin: add error_if_no_space feature If the pool runs out of data or metadata space, the pool can either queue or error the IO destined to the data device. The default is to queue the IO until more space is added. An admin may now configure the pool to error IO when no space is available by setting the 'error_if_no_space' feature when loading the thin-pool table. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-01-07 10:14:30 -05:00
Mike Snitzer	8c0f0e8c9f	dm thin: requeue bios to DM core if no_free_space and in read-only mode Now that we switch the pool to read-only mode when the data device runs out of space it causes active writers to get IO errors once we resume after resizing the data device. If no_free_space is set, save bios to the 'retry_on_resume_list' and requeue them on resume (once the data or metadata device may have been resized). With this patch the resize_io test passes again (on slower storage): dmtest run --suite thin-provisioning -n /resize_io/ Later patches fix some subtle races associated with the pool mode transitions done as part of the pool's -ENOSPC handling. These races are exposed on fast storage (e.g. PCIe SSD). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-01-07 10:14:29 -05:00
Mike Snitzer	399caddfb1	dm thin: cleanup and improve no space handling Factor out_of_data_space() out of alloc_data_block(). Eliminate the use of 'no_free_space' as a latch in alloc_data_block() -- this is no longer needed now that we switch to read-only mode when we run out of data or metadata space. In a later patch, the 'no_free_space' flag will be eliminated entirely (in favor of checking metadata rather than relying on a transient flag). Move no metdata space handling into metdata_operation_failed(). Set no_free_space when metadata space is exhausted too. This is useful, because it offers consistency, for the following patch that will requeue data IOs if no_free_space. Also, rename no_space() to retry_bios_on_resume(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-01-07 10:14:28 -05:00
Mike Snitzer	6f7f51d434	dm thin: log info when growing the data or metadata device Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-01-07 10:14:28 -05:00
Joe Thornber	b533065585	dm thin: handle metadata failures more consistently Introduce metadata_operation_failed() wrappers, around set_pool_mode(), to assist with improving the consistency of how metadata failures are handled. Logging is improved and metadata operation failures trigger read-only mode immediately. Also, eliminate redundant set_pool_mode() calls in the two alloc_data_block() caller's error paths. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-07 10:14:27 -05:00
Joe Thornber	88a6621bed	dm thin: factor out check_low_water_mark and use bools Factor check_low_water_mark() out of alloc_data_block(). Change a couple unsigned flags in the pool structure to bool. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-07 10:14:26 -05:00
Mike Snitzer	daec338bbd	dm thin: add mappings to end of prepared_* lists Mappings could be processed in descending logical block order, particularly if buffered IO is used. This could adversely affect the latency of IO processing. Fix this by adding mappings to the end of the 'prepared_mappings' and 'prepared_discards' lists. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-01-07 10:14:25 -05:00
Joe Thornber	8d30abff75	dm thin: return error from alloc_data_block if pool is not in write mode Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-07 10:14:24 -05:00
Mike Snitzer	7f21466512	dm thin: use bool rather than unsigned for flags in structures Also, move 'err' member in dm_thin_new_mapping structure to eliminate 4 byte hole (reduces size from 88 bytes to 80). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-01-07 10:14:18 -05:00
Mike Snitzer	10343180f5	dm persistent data: cleanup dm-thin specific references in text DM's persistent-data library is now used my multiple targets so exclusive references to "pool" or "thin provisioning" need to be cleaned up. Adjust Kconfig's DM_DEBUG_BLOCK_STACK_TRACING text and remove "pool" from a block manager error message. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-01-07 10:11:54 -05:00
Mike Snitzer	c46985e211	dm space map metadata: limit errors in sm_metadata_new_block The "unable to allocate new metadata block" error can be a particularly verbose error if there is a systemic issue with the metadata device. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-01-07 10:11:46 -05:00
Mikulas Patocka	42065460ae	dm delay: use per-bio data instead of a mempool and slab cache Starting with commit `c0820cf5ad` ("dm: introduce per_bio_data"), device mapper has the capability to pre-allocate a target-specific structure with the bio. This patch changes dm-delay to use this facility instead of a slab cache and mempool. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-07 10:11:45 -05:00
Mikulas Patocka	57a2f23856	dm table: remove unused buggy code that extends the targets array A device mapper table is allocated in the following way: * The function dm_table_create is called, it gets the number of targets as an argument -- it allocates a targets array accordingly. * For each target, we call dm_table_add_target. If we add more targets than were specified in dm_table_create, the function dm_table_add_target reallocates the targets array. However, this reallocation code is wrong - it moves the targets array to a new location, while some target constructors hold pointers to the array in the old location. The following DM target drivers save the pointer to the target structure, so they corrupt memory if the target array is moved: multipath, raid, mirror, snapshot, stripe, switch, thin, verity. Under normal circumstances, the reallocation function is not called (because dm_table_create is called with the correct number of targets), so the buggy reallocation code is not used. Prior to the fix "dm table: fail dm_table_create on dm_round_up overflow", the reallocation code could only be used in case the user specifies too large a value in param->target_count, such as 0xffffffff. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-07 10:11:44 -05:00
Joe Thornber	19fa1a6756	dm thin: fix discard support to a previously shared block If a snapshot is created and later deleted the origin dm_thin_device's snapshotted_time will have been updated to reflect the snapshot's creation time. The 'shared' flag in the dm_thin_lookup_result struct returned from dm_thin_find_block() is an approximation based on snapshotted_time -- this is done to avoid 0(n), or worse, time complexity. In this case, the shared flag would be true. But because the 'shared' flag reflects an approximation a block can be incorrectly assumed to be shared (e.g. false positive for 'shared' because the snapshot no longer exists). This could result in discards issued to a thin device not being passed down to the pool's underlying data device. To fix this we double check that a thin block is really still in-use after a mapping is removed using dm_pool_block_is_used(). If the reference count for a block is now zero the discard is allowed to be passed down. Also add a 'definitely_not_shared' member to the dm_thin_new_mapping structure -- reflects that the 'shared' flag in the response from dm_thin_find_block() can only be held as definitive if false is returned. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1043527 Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-01-07 10:11:43 -05:00
Mike Snitzer	16961b042d	dm thin: initialize dm_thin_new_mapping returned by get_next_mapping As additional members are added to the dm_thin_new_mapping structure care should be taken to make sure they get initialized before use. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org	2014-01-07 10:10:03 -05:00
Jens Axboe	b28bc9b38c	Linux 3.13-rc6 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA 189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7 k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU= =/4/R -----END PGP SIGNATURE----- Merge tag 'v3.13-rc6' into for-3.14/core Needed to bring blk-mq uptodate, since changes have been going in since for-3.14/core was established. Fixup merge issues related to the immutable biovec changes. Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/blk-flush.c fs/btrfs/check-integrity.c fs/btrfs/extent_io.c fs/btrfs/scrub.c fs/logfs/dev_bdev.c	2013-12-31 09:51:02 -07:00
Linus Torvalds	c5fdd531b5	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - fix for a memory leak on certain unplug events - a collection of bcache fixes from Kent and Nicolas - a few null_blk fixes and updates form Matias - a marking of static of functions in the stec pci-e driver * 'for-linus' of git://git.kernel.dk/linux-block: null_blk: support submit_queues on use_per_node_hctx null_blk: set use_per_node_hctx param to false null_blk: corrections to documentation null_blk: warning on ignored submit_queues param null_blk: refactor init and init errors code paths null_blk: documentation null_blk: mem garbage on NUMA systems during init drivers: block: Mark the functions as static in skd_main.c bcache: New writeback PD controller bcache: bugfix for race between moving_gc and bucket_invalidate bcache: fix for gc and writeback race bcache: bugfix - moving_gc now moves only correct buckets bcache: fix for gc crashing when no sectors are used bcache: Fix heap_peek() macro bcache: Fix for can_attach_cache() bcache: Fix dirty_data accounting bcache: Use uninterruptible sleep in writeback bcache: kthread don't set writeback task to INTERUPTIBLE block: fix memory leaks on unplugging block device bcache: fix sparse non static symbol warning	2013-12-24 10:06:03 -08:00
Greg Kroah-Hartman	5bd2010fbe	Merge 3.13-rc5 into staging-next We want these fixes here to handle some merge issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-12-24 09:43:21 -08:00
Jens Axboe	60e53a6701	Merge branch 'bcache-for-3.13' of git://evilpiepirate.org/~kent/linux-bcache into for-linus Kent writes: Jens - small pile of bcache fixes. I've been slacking on the writeback fixes but those definitely need to get into 3.13.	2013-12-17 12:54:03 -07:00
Kent Overstreet	16749c23c0	bcache: New writeback PD controller The old writeback PD controller could get into states where it had throttled all the way down and take way too long to recover - it was too complicated to really understand what it was doing. This rewrites a good chunk of it to hopefully be simpler and make more sense, and it also pays more attention to units which should make the behaviour a bit easier to understand. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:22:59 -08:00
Kent Overstreet	6d3d1a9c54	bcache: bugfix for race between moving_gc and bucket_invalidate There is a possibility for a bucket to be invalidated by the allocator while moving_gc was copying it's contents to another bucket, if the bucket only held cached data. To prevent this moving checks for a stale ptr (to an invalidated bucket), before and after reads. It it finds one, it simply ignores moving that data. This only affects bcache if the moving_gc was turned on, note that it's off by default. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:22:58 -08:00
Nicholas Swenson	bf0a628a95	bcache: fix for gc and writeback race Garbage collector needs to check keys in the writeback keybuf to make sure it's not invalidating buckets to which the writeback keys point to. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:22:58 -08:00
Nicholas Swenson	981aa8c091	bcache: bugfix - moving_gc now moves only correct buckets Removed gc_move_threshold because picking buckets only by threshold could lead moving extra buckets (ei. if there are buckets at the threshold that aren't supposed to be moved do to space considerations). This is replaced by a GC_MOVE bit in the gc_mark bitmask. Now only marked buckets get moved. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:22:58 -08:00
Nicholas Swenson	bee63f40cb	bcache: fix for gc crashing when no sectors are used Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:22:57 -08:00
Nicholas Swenson	97d11a660f	bcache: Fix heap_peek() macro Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:22:57 -08:00
Nicholas Swenson	9eb8ebeb24	bcache: Fix for can_attach_cache() Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:22:57 -08:00
Kent Overstreet	d24a6e1087	bcache: Fix dirty_data accounting Dirty data accounting wasn't quite right - firstly, we were adding the key we're inserting after it could have merged with another dirty key already in the btree, and secondly we could sometimes pass the wrong offset to bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is important when tracking dirty data by stripe. NOTE FOR BACKPORTERS: For 3.10 (and 3.11?) there's other accounting fixes necessary that got squashed in with other patches; the full patch against 3.10 is 408cc2f47eeac93a, available at: git://evilpiepirate.org/~kent/linux-bcache.git bcache-3.10-writeback-fixes Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 2a46036..4a12b2f 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -1817,7 +1817,8 @@ static bool fix_overlapping_extents(struct btree b, struct bkey insert, if (KEY_START(k) > KEY_START(insert) + sectors_found) goto check_failed; - if (KEY_PTRS(replace_key) != KEY_PTRS(k)) + if (KEY_PTRS(k) != KEY_PTRS(replace_key) \|\| + KEY_DIRTY(k) != KEY_DIRTY(replace_key)) goto check_failed; /* skip past gen */	2013-12-16 14:22:16 -08:00
Kent Overstreet	ce2b3f595e	bcache: Use uninterruptible sleep in writeback We're just waiting on kthread_should_stop(), nothing else, so interruptible sleep was wrong here. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:04:57 -08:00
Stefan Priebe	f665c0f852	bcache: kthread don't set writeback task to INTERUPTIBLE at the beginning (schedule_timout_interuptible) and others do his on their own This prevents wrong load average calculation (load of 1 per thread) Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-12-16 14:04:57 -08:00
Linus Torvalds	93e1585e2c	A set of device-mapper fixes for 3.13. A fix for possible memory corruption during DM table load, fix a possible leak of snapshot space in case of a crash, fix a possible deadlock due to a shared workqueue in the delay target, fix to initialize read-only module parameters that are used to export metrics for dm stats and dm bufio. Quite a few stable fixes were identified for both the thin-provisioning and caching targets as a result of increased regression testing using the device-mapper-test-suite (dmts). The most notable of these are the reference counting fixes for the space map btree that is used by the dm-array interface -- without these the dm-cache metadata will leak, resulting in dm-cache devices running out of metadata blocks. Also, some important fixes related to the thin-provisioning target's transition to read-only mode on error. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.15 (GNU/Linux) iQEcBAABAgAGBQJSq2pVAAoJEMUj8QotnQNaAzEH/0qpz06SVd3+oTbsWeWxXwax B0zIZ4HPnaD9FDfSWN+6IdQ6Rkn51cspMBHPxWX0I8pmqOTg7MUM7wbaZZaKT3UW aDlibzG1O3zBGbkr+Qhbh89fJK5/3xnZXlK0hOyaNI2vz1rA6RThZ2hIeak9xQYr 7USz2bjSMXchwebb0Z01CrRnOkhUFg6yUJURIEu4XFCJmDM/PM9zKogLGYKuZedK pZePnZhwgkLMn7f9l4lPHk3EcWeD4Nf9WR0lS4eKdbYsvMxm1QE7BpDlVJD0Asjg /SVOVkgygW18TINMHuw1K0DPdSCo5UZF2HRj1QUD/X4N2BSkio+E1qwF6kT4ltk= =0Ix+ -----END PGP SIGNATURE----- Merge tag 'dm-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: "A set of device-mapper fixes for 3.13. A fix for possible memory corruption during DM table load, fix a possible leak of snapshot space in case of a crash, fix a possible deadlock due to a shared workqueue in the delay target, fix to initialize read-only module parameters that are used to export metrics for dm stats and dm bufio. Quite a few stable fixes were identified for both the thin- provisioning and caching targets as a result of increased regression testing using the device-mapper-test-suite (dmts). The most notable of these are the reference counting fixes for the space map btree that is used by the dm-array interface -- without these the dm-cache metadata will leak, resulting in dm-cache devices running out of metadata blocks. Also, some important fixes related to the thin-provisioning target's transition to read-only mode on error" * tag 'dm-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm array: fix a reference counting bug in shadow_ablock dm space map: disallow decrementing a reference count below zero dm stats: initialize read-only module parameter dm bufio: initialize read-only module parameters dm cache: actually resize cache dm cache: update Documentation for invalidate_cblocks's range syntax dm cache policy mq: fix promotions to occur as expected dm thin: allow pool in read-only mode to transition to read-write mode dm thin: re-establish read-only state when switching to fail mode dm thin: always fallback the pool mode if commit fails dm thin: switch to read-only mode if metadata space is exhausted dm thin: switch to read only mode if a mapping insert fails dm space map metadata: return on failure in sm_metadata_new_block dm table: fail dm_table_create on dm_round_up overflow dm snapshot: avoid snapshot space leak on crash dm delay: fix a possible deadlock due to shared workqueue	2013-12-13 13:22:22 -08:00
Joe Thornber	ed9571f0cf	dm array: fix a reference counting bug in shadow_ablock An old array block could have its reference count decremented below zero when it is being replaced in the btree by a new array block. The fix is to increment the old ablock's reference count just before inserting a new ablock into the btree. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.9+	2013-12-13 14:22:10 -05:00
Joe Thornber	5b564d80f8	dm space map: disallow decrementing a reference count below zero The old behaviour, returning -EINVAL if a ref_count of 0 would be decremented, was removed in commit `f722063` ("dm space map: optimise sm_ll_dec and sm_ll_inc"). To fix this regression we return an error code from the mutator function pointer passed to sm_ll_mutate() and have dec_ref_count() return -EINVAL if the old ref_count is 0. Add a DMERR to reflect the potential seriousness of this error. Also, add missing dm_tm_unlock() to sm_ll_mutate()'s error path. With this fix the following dmts regression test now passes: dmtest run --suite cache -n /metadata_use_kernel/ The next patch fixes the higher-level dm-array code that exposed this regression. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.12+	2013-12-13 14:22:09 -05:00
Tejun Heo	324a56e16e	kernfs: s/sysfs_dirent/kernfs_node/ and rename its friends accordingly kernfs has just been separated out from sysfs and we're already in full conflict mode. Nothing can make the situation any worse. Let's take the chance to name things properly. This patch performs the following renames. * s/sysfs_elem_dir/kernfs_elem_dir/ * s/sysfs_elem_symlink/kernfs_elem_symlink/ * s/sysfs_elem_attr/kernfs_elem_file/ * s/sysfs_dirent/kernfs_node/ * s/sd/kn/ in kernfs proper * s/parent_sd/parent/ * s/target_sd/target/ * s/dir_sd/parent/ * s/to_sysfs_dirent()/rb_to_kn()/ * misc renames of local vars when they conflict with the above Because md, mic and gpio dig into sysfs details, this patch ends up modifying them. All are sysfs_dirent renames and trivial. While we can avoid these by introducing a dummy wrapping struct sysfs_dirent around kernfs_node, given the limited usage outside kernfs and sysfs proper, I don't think such workaround is called for. This patch is strictly rename only and doesn't introduce any functional difference. - mic / gpio renames were missing. Spotted by kbuild test robot. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Neil Brown <neilb@suse.de> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-12-11 15:28:36 -08:00
Mikulas Patocka	76f5bee5c3	dm stats: initialize read-only module parameter The module parameter stats_current_allocated_bytes in dm-mod is read-only. This parameter informs the user about memory consumption. It is not supposed to be changed by the user. However, despite being read-only, this parameter can be set on modprobe or insmod command line: modprobe dm-mod stats_current_allocated_bytes=12345 The kernel doesn't expect that this variable can be non-zero at module initialization and if the user sets it, it results in warning. This patch initializes the variable in the module init routine, so that user-supplied value is ignored. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.12+	2013-12-10 19:13:21 -05:00
Mikulas Patocka	4cb57ab4a2	dm bufio: initialize read-only module parameters Some module parameters in dm-bufio are read-only. These parameters inform the user about memory consumption. They are not supposed to be changed by the user. However, despite being read-only, these parameters can be set on modprobe or insmod command line, for example: modprobe dm-bufio current_allocated_bytes=12345 The kernel doesn't expect that these variables can be non-zero at module initialization and if the user sets them, it results in BUG. This patch initializes the variables in the module init routine, so that user-supplied values are ignored. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.2+	2013-12-10 19:13:20 -05:00
Vincent Pelletier	088448007b	dm cache: actually resize cache Commit `f494a9c6b1` ("dm cache: cache shrinking support") broke cache resizing support. dm_cache_resize() is called with cache->cache_size before it gets updated to new_size, so it is a no-op. But the dm-cache superblock is updated with the new_size even though the backing dm-array is not resized. Fix this by passing the new_size to dm_cache_resize(). Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2013-12-10 16:35:15 -05:00
Joe Thornber	af95e7a69b	dm cache policy mq: fix promotions to occur as expected Micro benchmarks that repeatedly issued IO to a single block were failing to cause a promotion from the origin device to the cache. Fix this by not updating the stats during map() if -EWOULDBLOCK will be returned. The mq policy will only update stats, consider migration, etc, once per tick period (a unit of time established between dm-cache core and the policies). When the IO thread calls the policy's map method, if it would like to migrate the associated block it returns -EWOULDBLOCK, the IO then gets handed over to a worker thread which handles the migration. The worker thread calls map again, to check the migration is still needed (avoids a race among other things). BUT, before this fix, if we were still in the same tick period the stats were already updated by the previous map call -- so the migration would no longer be requested. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2013-12-10 16:35:14 -05:00
Joe Thornber	9b7aaa64f9	dm thin: allow pool in read-only mode to transition to read-write mode A thin-pool may be in read-only mode because the pool's data or metadata space was exhausted. To allow for recovery, by adding more space to the pool, we must allow a pool to transition from PM_READ_ONLY to PM_WRITE mode. Otherwise, running out of space will render the pool permanently read-only. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2013-12-10 16:35:13 -05:00
Joe Thornber	5383ef3a92	dm thin: re-establish read-only state when switching to fail mode If the thin-pool transitioned to fail mode and the thin-pool's table were reloaded for some reason: the new table's default pool mode would be read-write, though it will transition to fail mode during resume. When the pool mode transitions directly from PM_WRITE to PM_FAIL we need to re-establish the intermediate read-only state in both the metadata and persistent-data block manager (as is usually done with the normal pool mode transition sequence: PM_WRITE -> PM_READ_ONLY -> PM_FAIL). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2013-12-10 16:35:12 -05:00
Joe Thornber	020cc3b5e2	dm thin: always fallback the pool mode if commit fails Rename commit_or_fallback() to commit(). Now all previous calls to commit() will trigger the pool mode to fallback if the commit fails. Also, check the error returned from commit() in alloc_data_block(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2013-12-10 16:35:12 -05:00
Mike Snitzer	4a02b34e0c	dm thin: switch to read-only mode if metadata space is exhausted Switch the thin pool to read-only mode in alloc_data_block() if dm_pool_alloc_data_block() fails because the pool's metadata space is exhausted. Differentiate between data and metadata space in messages about no free space available. This issue was noticed with the device-mapper-test-suite using: dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/ The quantity of errors logged in this case must be reduced. before patch: device-mapper: thin: 253:4: reached low water mark for metadata device: sending event. device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed <snip ... these repeat for a _very_ long while ... > device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: 253:4: commit failed: error = -28 device-mapper: thin: 253:4: switching pool to read-only mode after patch: device-mapper: thin: 253:4: reached low water mark for metadata device: sending event. device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: 253:4: no free metadata space available. device-mapper: thin: 253:4: switching pool to read-only mode Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org	2013-12-10 16:35:04 -05:00
Joe Thornber	fafc7a815e	dm thin: switch to read only mode if a mapping insert fails Switch the thin pool to read-only mode when dm_thin_insert_block() fails since there is little reason to expect the cause of the failure to be resolved without further action by user space. This issue was noticed with the device-mapper-test-suite using: dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/ The quantity of errors logged in this case must be reduced. before patch: device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map metadata: unable to allocate new metadata block <snip ... these repeat for a long while ... > device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed device-mapper: thin: 253:4: no free metadata space available. device-mapper: thin: 253:4: switching pool to read-only mode after patch: device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: 253:4: dm_thin_insert_block() failed: error = -28 device-mapper: thin: 253:4: switching pool to read-only mode Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2013-12-10 16:34:29 -05:00
Mike Snitzer	f62b6b8f49	dm space map metadata: return on failure in sm_metadata_new_block Commit `2fc48021f4` ("dm persistent metadata: add space map threshold callback") introduced a regression to the metadata block allocation path that resulted in errors being ignored. This regression was uncovered by running the following device-mapper-test-suite test: dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/ The ignored error codes in sm_metadata_new_block() could crash the kernel through use of either the dm-thin or dm-cache targets, e.g.: device-mapper: thin: 253:4: reached low water mark for metadata device: sending event. device-mapper: space map metadata: unable to allocate new metadata block general protection fault: 0000 [#1] SMP ... Workqueue: dm-thin do_worker [dm_thin_pool] task: ffff880035ce2ab0 ti: ffff88021a054000 task.ti: ffff88021a054000 RIP: 0010:[<ffffffffa0331385>] [<ffffffffa0331385>] metadata_ll_load_ie+0x15/0x30 [dm_persistent_data] RSP: 0018:ffff88021a055a68 EFLAGS: 00010202 RAX: 003fc8243d212ba0 RBX: ffff88021a780070 RCX: ffff88021a055a78 RDX: ffff88021a055a78 RSI: 0040402222a92a80 RDI: ffff88021a780070 RBP: ffff88021a055a68 R08: ffff88021a055ba4 R09: 0000000000000010 R10: 0000000000000000 R11: 00000002a02e1000 R12: ffff88021a055ad4 R13: 0000000000000598 R14: ffffffffa0338470 R15: ffff88021a055ba4 FS: 0000000000000000(0000) GS:ffff88033fca0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f467c0291b8 CR3: 0000000001a0b000 CR4: 00000000000007e0 Stack: ffff88021a055ab8 ffffffffa0332020 ffff88021a055b30 0000000000000001 ffff88021a055b30 0000000000000000 ffff88021a055b18 0000000000000000 ffff88021a055ba4 ffff88021a055b98 ffff88021a055ae8 ffffffffa033304c Call Trace: [<ffffffffa0332020>] sm_ll_lookup_bitmap+0x40/0xa0 [dm_persistent_data] [<ffffffffa033304c>] sm_metadata_count_is_more_than_one+0x8c/0xc0 [dm_persistent_data] [<ffffffffa0333825>] dm_tm_shadow_block+0x65/0x110 [dm_persistent_data] [<ffffffffa0331b00>] sm_ll_mutate+0x80/0x300 [dm_persistent_data] [<ffffffffa0330e60>] ? set_ref_count+0x10/0x10 [dm_persistent_data] [<ffffffffa0331dba>] sm_ll_inc+0x1a/0x20 [dm_persistent_data] [<ffffffffa0332270>] sm_disk_new_block+0x60/0x80 [dm_persistent_data] [<ffffffff81520036>] ? down_write+0x16/0x40 [<ffffffffa001e5c4>] dm_pool_alloc_data_block+0x54/0x80 [dm_thin_pool] [<ffffffffa001b23c>] alloc_data_block+0x9c/0x130 [dm_thin_pool] [<ffffffffa001c27e>] provision_block+0x4e/0x180 [dm_thin_pool] [<ffffffffa001fe9a>] ? dm_thin_find_block+0x6a/0x110 [dm_thin_pool] [<ffffffffa001c57a>] process_bio+0x1ca/0x1f0 [dm_thin_pool] [<ffffffff8111e2ed>] ? mempool_free+0x8d/0xa0 [<ffffffffa001d755>] process_deferred_bios+0xc5/0x230 [dm_thin_pool] [<ffffffffa001d911>] do_worker+0x51/0x60 [dm_thin_pool] [<ffffffff81067872>] process_one_work+0x182/0x3b0 [<ffffffff81068c90>] worker_thread+0x120/0x3a0 [<ffffffff81068b70>] ? manage_workers+0x160/0x160 [<ffffffff8106eb2e>] kthread+0xce/0xe0 [<ffffffff8106ea60>] ? kthread_freezable_should_stop+0x70/0x70 [<ffffffff8152af6c>] ret_from_fork+0x7c/0xb0 [<ffffffff8106ea60>] ? kthread_freezable_should_stop+0x70/0x70 [<ffffffff8152af6c>] ret_from_fork+0x7c/0xb0 [<ffffffff8106ea60>] ? kthread_freezable_should_stop+0x70/0x70 Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org # v3.10+	2013-12-10 16:34:28 -05:00
Mikulas Patocka	5b2d06576c	dm table: fail dm_table_create on dm_round_up overflow The dm_round_up function may overflow to zero. In this case, dm_table_create() must fail rather than go on to allocate an empty array with alloc_targets(). This fixes a possible memory corruption that could be caused by passing too large a number in "param->target_count". Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2013-12-10 16:34:27 -05:00
Mikulas Patocka	230c83afdd	dm snapshot: avoid snapshot space leak on crash There is a possible leak of snapshot space in case of crash. The reason for space leaking is that chunks in the snapshot device are allocated sequentially, but they are finished (and stored in the metadata) out of order, depending on the order in which copying finished. For example, supposed that the metadata contains the following records SUPERBLOCK METADATA (blocks 0 ... 250) DATA 0 DATA 1 DATA 2 ... DATA 250 Now suppose that you allocate 10 new data blocks 251-260. Suppose that copying of these blocks finish out of order (block 260 finished first and the block 251 finished last). Now, the snapshot device looks like this: SUPERBLOCK METADATA (blocks 0 ... 250, 260, 259, 258, 257, 256) DATA 0 DATA 1 DATA 2 ... DATA 250 DATA 251 DATA 252 DATA 253 DATA 254 DATA 255 METADATA (blocks 255, 254, 253, 252, 251) DATA 256 DATA 257 DATA 258 DATA 259 DATA 260 Now, if the machine crashes after writing the first metadata block but before writing the second metadata block, the space for areas DATA 250-255 is leaked, it contains no valid data and it will never be used in the future. This patch makes dm-snapshot complete exceptions in the same order they were allocated, thus fixing this bug. Note: when backporting this patch to the stable kernel, change the version field in the following way: * if version in the stable kernel is {1, 11, 1}, change it to {1, 12, 0} * if version in the stable kernel is {1, 10, 0} or {1, 10, 1}, change it to {1, 10, 2} Userspace reads the version to determine if the bug was fixed, so the version change is needed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2013-12-10 16:34:25 -05:00
Linus Torvalds	5ee540613d	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block layer fixes from Jens Axboe: "A small collection of fixes for the current series. It contains: - A fix for a use-after-free of a request in blk-mq. From Ming Lei - A fix for a blk-mq bug that could attempt to dereference a NULL rq if allocation failed - Two xen-blkfront small fixes - Cleanup of submit_bio_wait() type uses in the kernel, unifying that. From Kent - A fix for 32-bit blkg_rwstat reading. I apologize for this one looking mangled in the shortlog, it's entirely my fault for missing an empty line between the description and body of the text" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: fix use-after-free of request blk-mq: fix dereference of rq->mq_ctx if allocation fails block: xen-blkfront: Fix possible NULL ptr dereference xen-blkfront: Silence pfn maybe-uninitialized warning block: submit_bio_wait() conversions Update of blkg_stat and blkg_rwstat may happen in bh context	2013-12-05 15:33:27 -08:00
Mike Snitzer	8d30726912	dm cache: increment bi_remaining when bi_end_io is restored Move the bio->bi_remaining increment into dm_unhook_bio() so the overwrite_endio() handler works as expected. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-12-03 19:16:04 -07:00
Wei Yongjun	08239ca2a0	bcache: fix sparse non static symbol warning Fixes the following sparse warning: drivers/md/bcache/btree.c:2220:5: warning: symbol 'btree_insert_fn' was not declared. Should it be static? Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-11-28 17:05:58 -08:00
NeilBrown	6d183de407	md/raid5: fix newly-broken locking in get_active_stripe. commit `566c09c534` raid5: relieve lock contention in get_active_stripe() modified the locking in get_active_stripe() reducing the range protected by the (highly contended) device_lock. Unfortunately it reduced the range too much opening up some races. One race can occur if get_priority_stripe runs between the test on sh->count and device_lock being taken. This will mean that sh->lru is not empty while get_active_stripe thinks ->count is zero resulting in a 'BUG' firing. Another race happens if __release_stripe is called immediately after sh->count is tested and found to be non-zero. If STRIPE_HANDLE is not set, get_active_stripe should increment ->active_stripes when it increments ->count from 0, but as it didn't think it was 0, it doesn't. Extending device_lock to cover the test on sh->count close these races. While we are here, fix the two BUG tests: -If count is zero, then lru really must not be empty, or we've lock the stripe_head somehow - no other tests are relevant. -STRIPE_ON_RELEASE_LIST is completely independent of ->lru so testing it is pointless. Reported-and-tested-by: Brassow Jonathan <jbrassow@redhat.com> Reviewed-by: Shaohua Li <shli@kernel.org> Fixes: `566c09c534` Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-28 11:00:15 +11:00
NeilBrown	142d44c310	md: test mddev->flags more safely in md_check_recovery. commit `7a0a5355cb` md: Don't test all of mddev->flags at once. made most tests on mddev->flags safer, but missed one. When commit `260fa034ef` md: avoid deadlock when dirty buffers during md_stop. added MD_STILL_CLOSED, this caused md_check_recovery to misbehave. It can think there is something to do but find nothing. This can lead to the md thread spinning during array shutdown. https://bugzilla.kernel.org/show_bug.cgi?id=65721 Reported-and-tested-by: Richard W.M. Jones <rjones@redhat.com> Fixes: `260fa034ef` Cc: stable@vger.kernel.org (3.12) Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-28 11:00:08 +11:00
NeilBrown	0c775d5208	md/raid5: fix new memory-reference bug in alloc_thread_groups. In alloc_thread_groups, worker_groups is a pointer to an array, not an array of pointers. So worker_groups[i] is wrong. It should be &(*worker_groups)[i] Found-by: coverity Fixes: `60aaf93385` Reported-by: Ben Hutchings <bhutchings@solarflare.com> Cc: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-28 11:00:04 +11:00
Kent Overstreet	c170bbb45f	block: submit_bio_wait() conversions It was being open coded in a few places. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Chris Mason <chris.mason@fusionio.com> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-24 16:33:41 -07:00
Kent Overstreet	20d0189b10	block: Introduce new bio_split() The new bio_split() can split arbitrary bios - it's not restricted to single page bios, like the old bio_split() (previously renamed to bio_pair_split()). It also has different semantics - it doesn't allocate a struct bio_pair, leaving it up to the caller to handle completions. Then convert the existing bio_pair_split() users to the new bio_split() - and also nvme, which was open coding bio splitting. (We have to take that BUG_ON() out of bio_integrity_trim() because this bio_split() needs to use it, and there's no reason it has to be used on bios marked as cloned; BIO_CLONED doesn't seem to have clearly documented semantics anyways.) Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Neil Brown <neilb@suse.de>	2013-11-23 22:33:57 -08:00
Kent Overstreet	ee67891bf1	block: Rename bio_split() -> bio_pair_split() This is prep work for introducing a more general bio_split(). Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Peter Osterlund <petero2@telia.com> Cc: Sage Weil <sage@inktank.com>	2013-11-23 22:33:56 -08:00
Kent Overstreet	196d38bccf	block: Generic bio chaining This adds a generic mechanism for chaining bio completions. This is going to be used for a bio_split() replacement, and it turns out to be very useful in a fair amount of driver code - a fair number of drivers were implementing this in their own roundabout ways, often painfully. Note that this means it's no longer to call bio_endio() more than once on the same bio! This can cause problems for drivers that save/restore bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all - in all but the simplest cases they'd be better off just cloning the bio, and immutable biovecs is making bio cloning cheaper. But for now, we add a bio_endio_nodec() for these cases. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk>	2013-11-23 22:33:56 -08:00
Kent Overstreet	e90abc8ec3	block: Remove bi_idx hacks Now that drivers have been converted to the new bvec_iter primitives, there's no need to trim the bvec before we submit it; and we can't trim it once we start sharing bvecs. It used to be that passing a partially completed bio (i.e. one with nonzero bi_idx) to generic_make_request() was a dangerous thing - various drivers would choke on such things. But with immutable biovecs and our new bio splitting that shares the biovecs, submitting partially completed bios has to work (and should work, now that all the drivers have been completed to the new primitives) Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk>	2013-11-23 22:33:55 -08:00
Kent Overstreet	1c3b13e64c	dm: Refactor for new bio cloning/splitting We need to convert the dm code to the new bvec_iter primitives which respect bi_bvec_done; they also allow us to drastically simplify dm's bio splitting code. Also, it's no longer necessary to save/restore the bvec array anymore - driver conversions for immutable bvecs are done, so drivers should never be modifying it. Also kill bio_sector_offset(), dm was the only user and it doesn't make much sense anymore. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Reviewed-by: Mike Snitzer <snitzer@redhat.com>	2013-11-23 22:33:55 -08:00
Kent Overstreet	59d276fe02	block: Add bio_clone_fast() bio_clone() just got more expensive - however, most users of bio_clone() don't actually need to modify the biovec. If they aren't modifying the biovec, and they can guarantee that the original bio isn't freed before the clone (also true in most cases), we can just point the clone at the original bio's biovec. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-11-23 22:33:54 -08:00
Kent Overstreet	003b5c5719	block: Convert drivers to immutable biovecs Now that we've got a mechanism for immutable biovecs - bi_iter.bi_bvec_done - we need to convert drivers to use primitives that respect it instead of using the bvec array directly. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com	2013-11-23 22:33:51 -08:00

1 2 3 4 5 ...

3154 Commits