linux

Author	SHA1	Message	Date
Joe Thornber	9153df7405	dm cache: fix leaking of deferred bio prison cells There were two cases where dm_cell_visit_release() was being called, which removes the cell from the prison's rbtree, but the callers didn't also return the cell to the mempool. Fix this by having them call free_prison_cell(). This leak manifested as the 'kmalloc-96' slab growing until OOM. Fixes: `651f5fa2a3` ("dm cache: defer whole cells") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+	2015-08-31 15:08:14 -04:00
Joe Thornber	e44b6a5a3c	dm cache: move wake_waker() from free_migrations() to where it is needed This stops spurious wake ups from calls to prealloc_free_structs(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-08-12 11:32:20 -04:00
Mike Snitzer	795e633a2d	dm cache: fix device destroy hang due to improper prealloc_used accounting Commit `665022d72f` ("dm cache: avoid calls to prealloc_free_structs() if possible") introduced a regression that caused the removal of a DM cache device to hang in cache_postsuspend()'s call to wait_for_migrations() with the following stack trace: [<ffffffff81651457>] schedule+0x37/0x80 [<ffffffffa041e21b>] cache_postsuspend+0xbb/0x470 [dm_cache] [<ffffffff810ba970>] ? prepare_to_wait_event+0xf0/0xf0 [<ffffffffa0006f77>] dm_table_postsuspend_targets+0x47/0x60 [dm_mod] [<ffffffffa0001eb5>] __dm_destroy+0x215/0x250 [dm_mod] [<ffffffffa0004113>] dm_destroy+0x13/0x20 [dm_mod] [<ffffffffa00098cd>] dev_remove+0x10d/0x170 [dm_mod] [<ffffffffa00097c0>] ? dev_suspend+0x240/0x240 [dm_mod] [<ffffffffa0009f85>] ctl_ioctl+0x255/0x4d0 [dm_mod] [<ffffffff8127ac00>] ? SYSC_semtimedop+0x280/0xe10 [<ffffffffa000a213>] dm_ctl_ioctl+0x13/0x20 [dm_mod] [<ffffffff811fd432>] do_vfs_ioctl+0x2d2/0x4b0 [<ffffffff81117d5f>] ? __audit_syscall_entry+0xaf/0x100 [<ffffffff81022636>] ? do_audit_syscall_entry+0x66/0x70 [<ffffffff811fd689>] SyS_ioctl+0x79/0x90 [<ffffffff81023e58>] ? syscall_trace_leave+0xb8/0x110 [<ffffffff81654f6e>] entry_SYSCALL_64_fastpath+0x12/0x71 Fix this by accounting for the call to prealloc_data_structs() immediately _before_ the call as opposed to after. This is needed because it is possible to break out of the control loop after the call to prealloc_data_structs() but before prealloc_used was set to true. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-29 14:32:09 -04:00
Mike Snitzer	3508e6590d	Revert "dm cache: do not wake_worker() in free_migration()" This reverts commit `386cb7cdee`. Taking the wake_worker() out of free_migration() will slow writeback dramatically, and hence adaptability. Say we have 10k blocks that need writing back, but are only able to issue 5 concurrently due to the migration bandwidth: it's imperative that we wake_worker() immediately after migration completion; waiting for the next 1 second wake up (via do_waker) means it'll take a long time to write that all back. Reported-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-29 14:32:08 -04:00
Mike Snitzer	665022d72f	dm cache: avoid calls to prealloc_free_structs() if possible If no work was performed then prealloc_data_structs() wasn't ever called so there isn't any need to call prealloc_free_structs(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-16 22:32:08 -04:00
Mike Snitzer	e782eff591	dm cache: avoid preallocation if no work in writeback_some_dirty_blocks() Refactor writeback_some_dirty_blocks() to avoid prealloc_data_structs() if the policy doesn't have any dirty blocks ready for writeback. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-16 22:32:07 -04:00
Mike Snitzer	386cb7cdee	dm cache: do not wake_worker() in free_migration() All methods that queue work call wake_worker() as you'd expect. E.g. cell_defer, defer_bio, quiesce_migration (which is called by writeback, promote, demote_then_promote, invalidate, discard, etc). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-16 22:32:06 -04:00
Mike Snitzer	255eac2005	dm cache: display 'needs_check' in status if it is set There is currently no way to see that the needs_check flag has been set in the metadata. Display 'needs_check' in the cache status if it is set in the cache metadata. Also, update cache documentation. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-16 10:23:50 -04:00
Joe Thornber	fba10109a4	dm cache: age and write back cache entries even without active IO The policy tick() method is normally called from interrupt context. Both the mq and smq policies do some bottom half work for the tick method in their map functions. However if no IO is going through the cache, then that bottom half work doesn't occur. With these policies this means recently hit entries do not age and do not get written back as early as we'd like. Fix this by introducing a new 'can_block' parameter to the tick() method. When this is set the bottom half work occurs immediately. 'can_block' is set when the tick method is called every second by the core target (not in interrupt context). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:13:01 -04:00
Mike Snitzer	b61d950962	dm cache: prefix all DMERR and DMINFO messages with cache device name Having the DM device name associated with the ERR or INFO message is very helpful. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:13:01 -04:00
Joe Thornber	028ae9f76f	dm cache: add fail io mode and needs_check flag If a cache metadata operation fails (e.g. transaction commit) the cache's metadata device will abort the current transaction, set a new needs_check flag, and the cache will transition to "read-only" mode. If aborting the transaction or setting the needs_check flag fails the cache will transition to "fail-io" mode. Once needs_check is set the cache device will not be allowed to activate. Activation requires write access to metadata. Future work is needed to add proper support for running the cache in read-only mode. Once in fail-io mode the cache will report a status of "Fail". Also, add commit() wrapper that will disallow commits if in read_only or fail mode. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:13:00 -04:00
Joe Thornber	88bf5184fa	dm cache: wake the worker thread every time we free a migration object When the cache is idle, writeback work was only being issued every second. With this change outstanding writebacks are streamed constantly. This offers a writeback performance improvement. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:13:00 -04:00
Joe Thornber	40775257b9	dm cache: boost promotion of blocks that will be overwritten When considering whether to move a block to the cache we already give preferential treatment to discarded blocks, since they are cheap to promote (no read of the origin required since the data is junk). The same is true of blocks that are about to be completely overwritten, so we likewise boost their promotion chances. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:07 -04:00
Joe Thornber	651f5fa2a3	dm cache: defer whole cells Currently individual bios are deferred to the worker thread if they cannot be processed immediately (eg, a block is in the process of being moved to the fast device). This patch passes whole cells across to the worker. This saves reaquiring the cell, and also collects bios destined for the same block together, which allows them to be mapped with a single look up to the policy. This reduces the overhead of using dm-cache. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:06 -04:00
Joe Thornber	451b9e0071	dm cache: pull out some bitset utility functions for reuse Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:05 -04:00
Joe Thornber	20f6814b94	dm cache: pass a new 'critical' flag to the policies when requesting writeback work We only allow non critical writeback if the origin is idle. It is up to the policy to decide what writeback work is critical. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:04 -04:00
Joe Thornber	066dbaa386	dm cache: track IO to the origin device using io_tracker Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:04 -04:00
Joe Thornber	77289d3207	dm cache: add io_tracker A little class that keeps track of the volume of io that is in flight, and the length of time that a device has been idle for. FIXME: rather than jiffes, may be best to use ktime_t (to support faster devices). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:03 -04:00
Joe Thornber	fb4100ae7f	dm cache: fix race when issuing a POLICY_REPLACE operation There is a race between a policy deciding to replace a cache entry, the core target writing back any dirty data from this block, and other IO threads doing IO to the same block. This sort of problem is avoided most of the time by the core target grabbing a bio prison cell before making the request to the policy. But for a demotion the core target doesn't know which block will be demoted, so can't do this in advance. Fix this demotion race by introducing a callback to the policy interface that allows the policy to grab the cell on behalf of the core target. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-05-29 14:19:03 -04:00
Mike Snitzer	326e1dbb57	block: remove management of bi_remaining when restoring original bi_end_io Commit `c4cf5261` ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") regressed all existing callers that followed this pattern: 1) saving a bio's original bi_end_io 2) wiring up an intermediate bi_end_io 3) restoring the original bi_end_io from intermediate bi_end_io 4) calling bio_endio() to execute the restored original bi_end_io The regression was due to BIO_CHAIN only ever getting set if bio_inc_remaining() is called. For the above pattern it isn't set until step 3 above (step 2 would've needed to establish BIO_CHAIN). As such the first bio_endio(), in step 2 above, never decremented __bi_remaining before calling the intermediate bi_end_io -- leaving __bi_remaining with the value 1 instead of 0. When bio_inc_remaining() occurred during step 3 it brought it to a value of 2. When the second bio_endio() was called, in step 4 above, it should've called the original bi_end_io but it didn't because there was an extra reference that wasn't dropped (due to atomic operations being optimized away since BIO_CHAIN wasn't set upfront). Fix this issue by removing the __bi_remaining management complexity for all callers that use the above pattern -- bio_chain() is the only interface that _needs_ to be concerned with __bi_remaining. For the above pattern callers just expect the bi_end_io they set to get called! Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls that aren't associated with the bio_chain() interface. Also, the bio_inc_remaining() interface has been moved local to bio.c. Fixes: `c4cf5261` ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-05-22 08:58:55 -06:00
Jens Axboe	c4cf5261f8	bio: skip atomic inc/dec of ->bi_remaining for non-chains Struct bio has an atomic ref count for chained bio's, and we use this to know when to end IO on the bio. However, most bio's are not chained, so we don't need to always introduce this atomic operation as part of ending IO. Add a helper to elevate the bi_remaining count, and flag the bio as now actually needing the decrement at end_io time. Rename the field to __bi_remaining to catch any current users of this doing the incrementing manually. For high IOPS workloads, this reduces the overhead of bio_endio() substantially. Tested-by: Robert Elliott <elliott@hp.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-05-05 13:32:47 -06:00
Linus Torvalds	802ea9d864	- Most significant change this cycle is request-based DM now supports stacking ontop of blk-mq devices. This blk-mq support changes the model request-based DM uses for cloning a request to relying on calling blk_get_request() directly from the underlying blk-mq device. Early consumer of this code is Intel's emerging NVMe hardware; thanks to Keith Busch for working on, and pushing for, these changes. - A few other small fixes and cleanups across other DM targets. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJU3NRnAAoJEMUj8QotnQNavG0H/3yogMcHvKg9H+w0WmUQdwhN w99Wj3nkquAw2sm9yahKlAMBNY53iu/LHmC6/PaTpJetgdH7y1foTrRa0qjyeB2D DgNr8mOzxSxzX6CX9V8JMwqzky9XoG2IOt/7FeQQOpMqp4T1M2zgvbZtpl0lK/f3 lNaNBFpl+47NbGssD/WbtfI4Yy3hX0u406yGmQN5DxRyGTWD2AFqpA76g2mp8vrp wmw259gPr4oLhj3pDc0GkuiVn59ZR2Zp+2gs0jD5uKlDL84VP/nE+WNB+ny1Mnmt cOg8Q+W6/OosL66MKBHNsF0QS6DXNo5UvsN9fHGa5IUJw7Tsa11ZEPKHZGEbQw4= =RiN2 -----END PGP SIGNATURE----- Merge tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper changes from Mike Snitzer: - The most significant change this cycle is request-based DM now supports stacking ontop of blk-mq devices. This blk-mq support changes the model request-based DM uses for cloning a request to relying on calling blk_get_request() directly from the underlying blk-mq device. An early consumer of this code is Intel's emerging NVMe hardware; thanks to Keith Busch for working on, and pushing for, these changes. - A few other small fixes and cleanups across other DM targets. * tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: inherit QUEUE_FLAG_SG_GAPS flags from underlying queues dm snapshot: remove unnecessary NULL checks before vfree() calls dm mpath: simplify failure path of dm_multipath_init() dm thin metadata: remove unused dm_pool_get_data_block_size() dm ioctl: fix stale comment above dm_get_inactive_table() dm crypt: update url in CONFIG_DM_CRYPT help text dm bufio: fix time comparison to use time_after_eq() dm: use time_in_range() and time_after() dm raid: fix a couple integer overflows dm table: train hybrid target type detection to select blk-mq if appropriate dm: allocate requests in target when stacking on blk-mq devices dm: prepare for allocating blk-mq clone requests in target dm: submit stacked requests in irq enabled context dm: split request structure out from dm_rq_target_io structure dm: remove exports for request-based interfaces without external callers	2015-02-12 16:36:31 -08:00
Manuel Schölling	0f30af98cb	dm: use time_in_range() and time_after() To be future-proof and for better readability the time comparisons are modified to use time_in_range() and time_after() instead of plain, error-prone math. Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-02-09 13:06:48 -05:00
Joe Thornber	a59db67656	dm cache: fix problematic dual use of a single migration count variable Introduce a new variable to count the number of allocated migration structures. The existing variable cache->nr_migrations became overloaded. It was used to: i) track of the number of migrations in flight for the purposes of quiescing during suspend. ii) to estimate the amount of background IO occuring. Recent discard changes meant that REQ_DISCARD bios are processed with a migration. Discards are not background IO so nr_migrations was not incremented. However this could cause quiescing to complete early. (i) is now handled with a new variable cache->nr_allocated_migrations. cache->nr_migrations has been renamed cache->nr_io_migrations. cleanup_migration() is now called free_io_migration(), since it decrements that variable. Also, remove the unused cache->next_migration variable that got replaced with with prealloc_structs a while ago. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-01-23 11:06:08 -05:00
Joe Thornber	f824a2af3d	dm cache: fix spurious cell_defer when dealing with partial block at end of device We never bother caching a partial block that is at the back end of the origin device. No cell ever gets locked, but the calling code was assuming it was and trying to release it. Now the code only releases if the cell has been set to a non NULL value. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-12-01 11:30:13 -05:00
Joe Thornber	1e32134a5a	dm cache: dirty flag was mistakenly being cleared when promoting via overwrite If the incoming bio is a WRITE and completely covers a block then we don't bother to do any copying for a promotion operation. Once this is done the cache block and origin block will be different, so we need to set it to 'dirty'. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-12-01 11:30:12 -05:00
Joe Thornber	f29a3147e2	dm cache: only use overwrite optimisation for promotion when in writeback mode Overwrite causes the cache block and origin blocks to diverge, which is only allowed in writeback mode. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-12-01 11:30:12 -05:00
Joe Thornber	2bb812df63	dm cache: discard block size must be a multiple of cache block size Otherwise the cache blocks may span two discard blocks, which we don't handle when doing the discard lookup. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-12-01 11:30:11 -05:00
Joe Thornber	43c32bf2b0	dm cache: fix a harmless race when working out if a block is discarded It is more correct to hold the cell before checking the discard state. These flags are only used as hints to the policy so this change will have negligable effect. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-12-01 11:30:11 -05:00
Joe Thornber	3e2e1c3098	dm cache: when reloading a discard bitset allow for a different discard block size The discard block size can change if the origin changes size or if an old DM cache is upgraded from using a discard block size that was equal to cache block size. To fix this an extent of discarded blocks is established for the purpose of translating the old discard block size to the new in-core discard block size and set bits. The old (potentially huge) discard bitset is left ondisk until it is re-written using the new in-core information on the next successful DM cache shutdown. Fixes: `7ae34e7778` ("dm cache: improve discard support") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-12-01 11:30:10 -05:00
Joe Thornber	2572629a13	dm cache: fix some issues with the new discard range support Commit `7ae34e777` ("dm cache: improve discard support") needed to also: - discontinue having DM core split the discard bios on cache block boundaries - calculate the cache's discard_nr_blocks relative to the determined discard_block_size rather than using oblock_to_dblock() Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-12-01 11:30:09 -05:00
Joe Thornber	d1d9220cba	dm cache: emit a warning message if there are a lot of cache blocks Loading and saving millions of block mappings takes time. We may as well explain what's going on, and encourage people to use a larger cache block size. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-12 20:14:59 -05:00
Joe Thornber	7ae34e7778	dm cache: improve discard support Safely allow the discard blocksize to be larger than the cache blocksize by using the bio prison's range locking support. This also improves discard performance considerly because larger discards are issued to the dm-cache device. The discard blocksize was always intended to be greater than the cache blocksize. But until now it wasn't implemented safely. Also, by safely restoring the ability to have discard blocksize larger than cache blocksize we're able to significantly reduce the memory used for the cache's discard bitset. Before, with a small discard blocksize, the discard bitset could get quite large because its size is a function of the discard blocksize and the origin device's size. For example, previously, using a 32KB cache blocksize with a 40TB origin resulted in 1280MB of incore memory use for the discard bitset! Now, the discard blocksize is scaled up accordingly to ensure the discard bitset is capped at 2**14 bits, or 16KB. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:30 -05:00
Joe Thornber	08b184514f	dm cache: revert "prevent corruption caused by discard_block_size > cache_block_size" This reverts commit `d132cc6d9e` because we actually do want to allow the discard blocksize to be larger than the cache blocksize. Further dm-cache discard changes will make this possible. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:30 -05:00
Joe Thornber	1bad9bc4ee	dm cache: revert "remove remainder of distinct discard block size" This reverts commit `64ab346a36` because we actually do want to allow the discard blocksize to be larger than the cache blocksize. Further dm-cache discard changes will make this possible. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:30 -05:00
Joe Thornber	5f274d8865	dm bio prison: introduce support for locking ranges of blocks Ranges will be placed in the same cell if they overlap. Range locking is a prerequisite for more efficient multi-block discard support in both the cache and thin-provisioning targets. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:30 -05:00
Joe Thornber	a195db2d29	dm bio prison: switch to using a red black tree Previously it was using a fixed sized hash table. There are times when very many concurrent cells are held (such as when processing a very large discard). When this happens the hash table performance becomes very poor. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:26 -05:00
Anssi Hannula	40aa978ecc	dm cache: fix race causing dirty blocks to be marked as clean When a writeback or a promotion of a block is completed, the cell of that block is removed from the prison, the block is marked as clean, and the clear_dirty() callback of the cache policy is called. Unfortunately, performing those actions in this order allows an incoming new write bio for that block to come in before clearing the dirty status is completed and therefore possibly causing one of these two scenarios: Scenario A: Thread 1 Thread 2 cell_defer() . - cell removed from prison . - detained bios queued . . incoming write bio . remapped to cache . set_dirty() called, . but block already dirty . => it does nothing clear_dirty() . - block marked clean . - policy clear_dirty() called . Result: Block is marked clean even though it is actually dirty. No writeback will occur. Scenario B: Thread 1 Thread 2 cell_defer() . - cell removed from prison . - detained bios queued . clear_dirty() . - block marked clean . . incoming write bio . remapped to cache . set_dirty() called . - block marked dirty . - policy set_dirty() called - policy clear_dirty() called . Result: Block is properly marked as dirty, but policy thinks it is clean and therefore never asks us to writeback it. This case is visible in "dmsetup status" dirty block count (which normally decreases to 0 on a quiet device). Fix these issues by calling clear_dirty() before calling cell_defer(). Incoming bios for that block will then be detained in the cell and released only after clear_dirty() has completed, so the race will not occur. Found by inspecting the code after noticing spurious dirty counts (scenario B). Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-09-10 11:20:47 -04:00
Mike Snitzer	b02465308f	dm cache: set minimum_io_size to cache's data block size Before, if the block layer's limit stacking didn't establish an optimal_io_size that was compatible with the cache's data block size we'd set optimal_io_size to the data block size and minimum_io_size to 0 (which the block layer adjusts to be physical_block_size). Update cache_io_hints() to set both minimum_io_size and optimal_io_size to the cache's data block size. This fixes an issue where mkfs.xfs would create more XFS Allocation Groups on cache volumes than on a normal linear LV of comparable size. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-01 12:30:36 -04:00
Mike Snitzer	895b47d798	dm cache metadata: use dm-space-map-metadata.h defined size limits Commit `7d48935e` cleaned up the persistent-data's space-map-metadata limits by elevating them to dm-space-map-metadata.h. Update dm-cache-metadata to use these same limits. The calculation for DM_CACHE_METADATA_MAX_SECTORS didn't account for the sizeof the disk_bitmap_header. So the supported maximum metadata size is a bit smaller (reduced from 33423360 to 33292800 sectors). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-08-01 12:30:33 -04:00
Joe Thornber	304affaa88	dm cache: fail migrations in the do_worker error path Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-01 12:30:33 -04:00
Joe Thornber	8c081b52c6	dm cache: simplify deferred set reference count increments Factor out inc_and_issue and inc_ds helpers to simplify deferred set reference count increments. Also cleanup cache_map to consistently call cell_defer and inc_ds when the bio is DM_MAPIO_REMAPPED. No functional change. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-08-01 12:30:32 -04:00
Anssi Hannula	44fa816bb7	dm cache: fix race affecting dirty block count nr_dirty is updated without locking, causing it to drift so that it is non-zero (either a small positive integer, or a very large one when an underflow occurs) even when there are no actual dirty blocks. This was due to a race between the workqueue and map function accessing nr_dirty in parallel without proper protection. People were seeing under runs due to a race on increment/decrement of nr_dirty, see: https://lkml.org/lkml/2014/6/3/648 Fix this by using an atomic_t for nr_dirty. Reported-by: roma1390@gmail.com Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-08-01 12:25:22 -04:00
Heinz Mauelshagen	f1daa838e8	dm cache: always split discards on cache block boundaries The DM cache target cannot cope with discards that span multiple cache blocks, so each discard bio that spans more than one cache block must get split by the DM core. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v3.9+	2014-05-27 10:33:05 -04:00
Mike Snitzer	131cd131a9	dm cache: fix writethrough mode quiescing in cache_map Commit `2ee57d5873` ("dm cache: add passthrough mode") inadvertently removed the deferred set reference that was taken in cache_map()'s writethrough mode support. Restore taking this reference. This issue was found with code inspection. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org # 3.13+	2014-05-01 16:14:24 -04:00
Joe Thornber	0596661f0a	dm cache: fix a lock-inversion When suspending a cache the policy is walked and the individual policy hints written to the metadata via sync_metadata(). This led to this lock order: policy->lock cache_metadata->root_lock When loading the cache target the policy is populated while the metadata lock is held: cache_metadata->root_lock policy->lock Fix this potential lock-inversion (ABBA) deadlock in sync_metadata() by ensuring the cache_metadata root_lock is held whilst all the hints are written, rather than being repeatedly locked while policy->lock is held (as was the case with each callout that policy_walk_mappings() made to the old save_hint() method). Found by turning on the CONFIG_PROVE_LOCKING ("Lock debugging: prove locking correctness") build option. However, it is not clear how the LOCKDEP reported paths can lead to a deadlock since the two paths, suspending a target and loading a target, never occur at the same time. But that doesn't mean the same lock-inversion couldn't have occurred elsewhere. Reported-by: Marian Csontos <mcsontos@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-04-04 14:53:05 -04:00
Heinz Mauelshagen	64ab346a36	dm cache: remove remainder of distinct discard block size Discard block size not being equal to cache block size causes data corruption by erroneously avoiding migrations in issue_copy() because the discard state is being cleared for a group of cache blocks when it should not. Completely remove all code that enabled a distinction between the cache block size and discard block size. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-27 16:56:23 -04:00
Mike Snitzer	d132cc6d9e	dm cache: prevent corruption caused by discard_block_size > cache_block_size If the discard block size is larger than the cache block size we will not properly quiesce IO to a region that is about to be discarded. This results in a race between a cache migration where no copy is needed, and a write to an adjacent cache block that's within the same large discard block. Workaround this by limiting the discard_block_size to cache_block_size. Also limit the max_discard_sectors to cache_block_size. A more comprehensive fix that introduces range locking support in the bio_prison and proper quiescing of a discard range that spans multiple cache blocks is already in development. Reported-by: Morgan Mears <Morgan.Mears@netapp.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Cc: stable@vger.kernel.org	2014-03-27 16:56:23 -04:00
Heinz Mauelshagen	e893fba90c	dm cache: fix access beyond end of origin device In order to avoid wasting cache space a partial block at the end of the origin device is not cached. Unfortunately, the check for such a partial block at the end of the origin device was flawed. Fix accesses beyond the end of the origin device that occured due to attempted promotion of an undetected partial block by: - initializing the per bio data struct to allow cache_end_io to work properly - recognizing access to the partial block at the end of the origin device - avoiding out of bounds access to the discard bitset Otherwise, users can experience errors like the following: attempt to access beyond end of device dm-5: rw=0, want=20971520, limit=20971456 ... device-mapper: cache: promotion failed; couldn't copy block Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-03-12 13:52:00 -04:00
Heinz Mauelshagen	8b9d966665	dm cache: fix truncation bug when copying a block to/from >2TB fast device During demotion or promotion to a cache's >2TB fast device we must not truncate the cache block's associated sector to 32bits. The 32bit temporary result of from_cblock() caused a 32bit multiplication when calculating the sector of the fast device in issue_copy_real(). Use an intermediate 64bit type to store the 32bit from_cblock() to allow for proper 64bit multiplication. Here is an example of how this bug manifests on an ext4 filesystem: EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 17136, 32768 clusters in bitmap, 30688 in gd; block bitmap corrupt. JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-03-12 13:49:27 -04:00

1 2

91 Commits