This DM multipath NVMe bio-based support requires CONFIG_NVME_MULTIPATH
to not be set. In the future hopefully NVMe multipath and DM multipath
can co-exist more seemlessly. But as is, if CONFIG_NVME_MULTIPATH=Y
then all the individal NVMe paths will remain hidden to upper layers and
as such DM multipath will not be able to manage them.
Though NVMe's native multipathing doesn't multipath namespaces across
subsystems; so technically a user _could_ use CONFIG_NVME_MULTIPATH=Y
and also use DM multipath to multipath across subsystems.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Moving the dm_bio_restore() to process_queued_bios() avoids doing that
work in multipath_end_io_bio().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
All underlying members are initialized directly so the memset() calls
are not needed. Also, initialize mpio->nr_bytes from the start since it
never changes.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Upper level bio-based drivers that stack immediately ontop of NVMe can
leverage direct_make_request(). In addition DM's NVMe bio-based
will initially only ever have one NVMe device that it submits IO to at a
time. There is no splitting needed. Enhance DM core so that
DM_TYPE_NVME_BIO_BASED's IO submission takes advantage of both of these
characteristics.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If dm_table_determine_type() establishes DM_TYPE_NVME_BIO_BASED then
all devices in the DM table do not support partial completions. Also,
the table has a single immutable target that doesn't require DM core to
split bios.
This will enable adding NVMe optimizations to bio-based DM.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
No apparent need to generic_start_io_acct() until before the IO is ready
for submission. start_io_acct() is the proper place to do this
accounting -- it is also where DM accounts for pending IO and, if
enabled, starts dm-stats accounting.
Replace start_io_acct()'s part_round_stats() with generic_start_io_acct().
This eliminates needing to take part_stat_lock() multiple times when
starting an IO on bio-based devices.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Eliminates need for a separate mempool to allocate 'struct dm_io'
objects from. As such, it saves an extra mempool allocation for each
original bio that DM core is issued.
This complicates the per-bio-data accessor functions by needing to
conditonally add extra padding to get to a target's per-bio-data. But
in the end this provides a decent performance improvement for all
bio-based DM devices.
On an NVMe-loop based testbed to a ramdisk (~3100 MB/s): bio-based
DM linear performance improved by 2% (went from 2665 to 2777 MB/s).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
These CRUD comments have worn out their welcome. The code is what it
is, over time it'll hopefully get better. But these comments serve no
purpose whatsoever.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
__send_changing_extent_only() must follow the same pattern that was
established with commit "dm: ensure bio submission follows a depth-first
tree walk". That is: submit first bio up to split boundary and then
split the remainder to further submissions.
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
alloc_multiple_bios() assumes it can allocate the requested number of
bios but until now there was no gaurantee that the mempools would be
accomodating.
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Now that all of DM has been revised and/or verified to no longer require
the use of BIOSET_NEED_RESCUER the dm_offload code may be removed.
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
DM targets can request multiple bios be sent to them by DM core (see:
num_{flush,discard,write_same,write_zeroes}_bios). But until now these
bios were allocated in an unsafe manner than could potentially exhaust
the DM device's bioset -- in the face of multiple threads each trying to
do multiple allocations from the same DM device's bioset.
Fix __send_duplicate_bios() by using the new alloc_multiple_bios(). The
allocation strategy used by alloc_multiple_bios() models that used by
dm-crypt.c:crypt_alloc_buffer().
Neil Brown initially proposed this fix but the implementation has been
revised enough that it inappropriate to attribute the entirety of it to
him.
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
No DM target provides num_write_bios and none has since dm-cache's
brief use in 2013.
Having the possibility of num_write_bios > 1 complicates bio
allocation. So remove the interface and assume there is only one bio
needed.
If a target ever needs more, it must provide a suitable bioset and
allocate itself based on its particular needs.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A dm device can, in general, represent a tree of targets, each of which
handles a sub-range of the range of blocks handled by the parent.
The bio sequencing managed by generic_make_request() requires that bios
are generated and handled in a depth-first manner. Each call to a
make_request_fn() may submit bios to a single member device, and may
submit bios for a reduced region of the same device as the
make_request_fn.
In particular, any bios submitted to member devices must be expected to
be processed in order, so a later one must never wait for an earlier
one.
This ordering is usually achieved by using bio_split() to reduce a bio
to a size that can be completely handled by one target, and resubmitting
the remainder to the originating device. bio_queue_split() shows the
canonical approach.
dm doesn't follow this approach, largely because it has needed to split
bios since long before bio_split() was available. It currently can
submit bios to separate targets within the one dm_make_request() call.
Dependencies between these targets, as can happen with dm-snap, can
cause deadlocks if either bios gets stuck behind the other in the queues
managed by generic_make_request(). This requires the 'rescue'
functionality provided by dm_offload_{start,end}.
Some of this requirement can be removed by changing the order of bio
submission to follow the canonical approach. That is, if dm finds that
it needs to split a bio, the remainder should be sent to
generic_make_request() rather than being handled immediately. This
delays the handling until the first part is completely processed, so the
deadlock problems do not occur.
__split_and_process_bio() can be called both from dm_make_request() and
from dm_wq_work(). When called from dm_wq_work() the current approach
is perfectly satisfactory as each bio will be processed immediately.
When called from dm_make_request(), current->bio_list will be non-NULL,
and in this case it is best to create a separate "clone" bio for the
remainder.
When we use bio_clone_bioset() to split off the front part of a bio
and chain the two together and submit the remainder to
generic_make_request(), it is important that the newly allocated
bio is used as the head to be processed immediately, and the original
bio gets "bio_advance()"d and sent to generic_make_request() as the
remainder. Otherwise, if the newly allocated bio is used as the
remainder, and if it then needs to be split again, then the next
bio_clone_bioset() call will be made while holding a reference a bio
(result of the first clone) from the same bioset. This can potentially
exhaust the bioset mempool and result in a memory allocation deadlock.
Note that there is no race caused by reassigning cio.io->bio after already
calling __map_bio(). This bio will only be dereferenced again after
dec_pending() has found io->io_count to be zero, and this cannot happen
before the dec_pending() call at the end of __split_and_process_bio().
To provide the clone bio when splitting, we use q->bio_split. This
was previously being freed by bio-based dm to avoid having excess
rescuer threads. As bio_split bio sets no longer create rescuer
threads, there is little cost and much gain from restoring the
q->bio_split bio set.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might
do two allocations from the one bioset, and the second one could block
until the first bio completes.
dm_io() is called from make_request_fn() context. The closest it comes
to multiple allocations is in chunk_io() in dm-snap-persistent. But
there the code uses a separate thread to avoid problems.
So BIOSET_NEED_RESCUER is not needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might
do two allocations from the one bioset, and the second one could block
until the first bio completes.
dm-crypt does allocate from this bioset inside the dm make_request_fn,
but does so using GFP_NOWAIT so that the allocation will not block.
So BIOSET_NEED_RESCUER is not needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Clarify that dm_accept_partial_bio isn't allowed for REQ_OP_ZONE_RESET
bios.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
No need to calculate the reshaping progress because
mddev->curr_resync_completed holds it.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
During reshape, 'A' chars were reported in status rather than 'a'.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
In order to avoid redoing synchronization/recovery/reshape partially,
the raid set got frozen until after all passed in table line flags had
been cleared. The related table reload sequence had to be precisely
followed, or reshaping may lead to data corruption caused by the active
mapping carrying on with a reshape when the inactive mapping already
had retrieved a stale reshape position.
Harden by retrieving the actual resync/recovery/reshape position
during resume whilst the active table is suspended thus avoiding
to keep the raid set frozen altogether. This prevents superfluous
redoing of an already resynchronized or recovered segment and,
most importantly, potential for redoing of an already reshaped
segment causing data corruption.
Fixes: d39f0010e ("dm raid: fix raid_resume() to keep raid set frozen as needed")
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Verifying the current raid sets redundancy based on retrieved
superblock content has to use the superblock's raid level (e.g. raid0),
not the constructor requested one (e.g. raid10).
Using the requested raid level of raid10 lead to a "divide error"
on raid0 which defines data copies divided by to be zero.
Also check for bogus data copies.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Move raid_resume()'s setting of 'rw' and 'in_sync' to just prior to
mddev_resume().
Also, remove unused 'bitmap_loaded' member from "struct raid_set".
No functional changes.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fix various sync state issues causing racy/bogus sync ratio,
sync_action ad health chars in dm_status() info output.
Sync ratio could be N/N (i.e. 100%) shortly after raid set
creation, i.e. creating a new RaidLV or upconverting a linear LV to
raid1 thus:
"0 2097152 raid raid1 2 Aa 2097162/2097152 recover 0 0 -"
instead of:
"0 2097152 raid raid1 2 Aa 0/2097152 idle 0 0 -"
Sync action could be non-idle, when the MD thread was done with io.
Health chars could be 'A' when they should be 'a' for a short time
before a resynchonization started.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The raid_status() function passes the bool array_in_sync variable around
providing synchronization state of the MD array. Replace it with a
runtime flag. This will avoid a pattern of having to pass discrete
variables to various functions.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The MD sync thread updates recovery flags providing state of any
running, idle, frozen, recovering, reshaping, ... activity it performs
and updates respective flags asynchronously versus dm processing
raid_status(). To close that race window, take a single copy of the
flags and pass it into its callees.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
During a reshape request: if userspace reloads a "raid" table multiple
times, resulting in multiple superblock reads, the raid set needs to
stay frozen until all config changes (chunk size, layout data_offset,
delta_disks) have been stored in the superblocks and respective flags
cleared.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Check all component data device sizes versus calculated size.
Reject if device(s) are too small. Otherwise, MD will fail the
operation by accessing beyond the end of the data device.
An example use-case is that growing bitmap won't fit any more and the MD
runtime will report an error when DM raid should catch this earlier.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The raid set size is being revalidated unconditionally before a
reshaping conversion is started. MD requires the size to only be
reduced in case of a stripe removing (i.e. shrinking) reshape but not
when growing because the raid array has to stay small until after the
growing reshape finishes.
Fix by avoiding the size revalidation in preresume unless a shrinking
reshape is requested.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pay attention to existing reshape space to define if a raid set needs
resizing. Otherwise we can hit "Can't resize a reshaping raid set"
when a reshape is being requested.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The md raid personalities call md_finish_reshape() at the end of a
reshape conversion which adjusts rdev->sectors.
Correct/check rdev->sectors before initiating a reshape and raise the
recovery pointer accordingly.
Otherwise, the DM raid coordinated reshape will fail.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
md_stop_writes() is called in raid_presuspend() causing deadlocks on
bios submitted afterwards -- which happens on loaded raid sets with
conversion requests.
Fix by moving md_stop_writes() to raid_postsuspend(). NOTE: when the
recovery's frozen (MD_RECOVERY_FROZEN), writes haven't been started (or
are already stopped) so don't stop them again.
Also remove superfluous readonly setting.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When system is under memory pressure it is observed that dm bufio
shrinker often reclaims only one buffer per scan. This change fixes
the following two issues in dm bufio shrinker that cause this behavior:
1. ((nr_to_scan - freed) <= retain_target) condition is used to
terminate slab scan process. This assumes that nr_to_scan is equal
to the LRU size, which might not be correct because do_shrink_slab()
in vmscan.c calculates nr_to_scan using multiple inputs.
As a result when nr_to_scan is less than retain_target (64) the scan
will terminate after the first iteration, effectively reclaiming one
buffer per scan and making scans very inefficient. This hurts vmscan
performance especially because mutex is acquired/released every time
dm_bufio_shrink_scan() is called.
New implementation uses ((LRU size - freed) <= retain_target)
condition for scan termination. LRU size can be safely determined
inside __scan() because this function is called after dm_bufio_lock().
2. do_shrink_slab() uses value returned by dm_bufio_shrink_count() to
determine number of freeable objects in the slab. However dm_bufio
always retains retain_target buffers in its LRU and will terminate
a scan when this mark is reached. Therefore returning the entire LRU size
from dm_bufio_shrink_count() is misleading because that does not
represent the number of freeable objects that slab will reclaim during
a scan. Returning (LRU size - retain_target) better represents the
number of freeable objects in the slab. This way do_shrink_slab()
returns 0 when (LRU size < retain_target) and vmscan will not try to
scan this shrinker avoiding scans that will not reclaim any memory.
Test: tested using Android device running
<AOSP>/system/extras/alloc-stress that generates memory pressure
and causes intensive shrinker scans
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit ca5beb76 ("dm mpath: micro-optimize the hot path relative to
MPATHF_QUEUE_IF_NO_PATH") caused bio-based DM-multipath to fail mptest's
"test_02_sdev_delete".
Restoring the logic that existed prior to commit ca5beb76 fixes this
bio-based DM-multipath regression. Also verified all mptest tests pass
with request-based DM-multipath.
This commit effectively reverts commit ca5beb76 -- but it does so
without reintroducing the need to take the m->lock spinlock in
must_push_back_{rq,bio}.
Fixes: ca5beb76 ("dm mpath: micro-optimize the hot path relative to MPATHF_QUEUE_IF_NO_PATH")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A NULL pointer is seen if two concurrent "vgchange -ay -K <vg name>"
processes race to load the dm-thin-pool module:
PID: 25992 TASK: ffff883cd7d23500 CPU: 4 COMMAND: "vgchange"
#0 [ffff883cd743d600] machine_kexec at ffffffff81038fa9
0000001 [ffff883cd743d660] crash_kexec at ffffffff810c5992
0000002 [ffff883cd743d730] oops_end at ffffffff81515c90
0000003 [ffff883cd743d760] no_context at ffffffff81049f1b
0000004 [ffff883cd743d7b0] __bad_area_nosemaphore at ffffffff8104a1a5
0000005 [ffff883cd743d800] bad_area at ffffffff8104a2ce
0000006 [ffff883cd743d830] __do_page_fault at ffffffff8104aa6f
0000007 [ffff883cd743d950] do_page_fault at ffffffff81517bae
0000008 [ffff883cd743d980] page_fault at ffffffff81514f95
[exception RIP: kmem_cache_alloc+108]
RIP: ffffffff8116ef3c RSP: ffff883cd743da38 RFLAGS: 00010046
RAX: 0000000000000004 RBX: ffffffff81121b90 RCX: ffff881bf1e78cc0
RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000000
RBP: ffff883cd743da68 R8: ffff881bf1a4eb00 R9: 0000000080042000
R10: 0000000000002000 R11: 0000000000000000 R12: 00000000000000d0
R13: 0000000000000000 R14: 00000000000000d0 R15: 0000000000000246
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
0000009 [ffff883cd743da70] mempool_alloc_slab at ffffffff81121ba5
0000010 [ffff883cd743da80] mempool_create_node at ffffffff81122083
0000011 [ffff883cd743dad0] mempool_create at ffffffff811220f4
0000012 [ffff883cd743dae0] pool_ctr at ffffffffa08de049 [dm_thin_pool]
0000013 [ffff883cd743dbd0] dm_table_add_target at ffffffffa0005f2f [dm_mod]
0000014 [ffff883cd743dc30] table_load at ffffffffa0008ba9 [dm_mod]
0000015 [ffff883cd743dc90] ctl_ioctl at ffffffffa0009dc4 [dm_mod]
The race results in a NULL pointer because:
Process A (vgchange -ay -K):
a. send DM_LIST_VERSIONS_CMD ioctl;
b. pool_target not registered;
c. modprobe dm_thin_pool and wait until end.
Process B (vgchange -ay -K):
a. send DM_LIST_VERSIONS_CMD ioctl;
b. pool_target registered;
c. table_load->dm_table_add_target->pool_ctr;
d. _new_mapping_cache is NULL and panic.
Note:
1. process A and process B are two concurrent processes.
2. pool_target can be detected by process B but
_new_mapping_cache initialization has not ended.
To fix dm-thin-pool, and other targets (cache, multipath, and snapshot)
with the same problem, simply dm_register_target() after all resources
created during module init (as labelled with __init) are finished.
Cc: stable@vger.kernel.org
Signed-off-by: monty <monty_pavel@sina.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Multiple refcounts are needed if the device was already added. The
micro-optimization of setting the refcount to 1 on first added (rather
than fall thru to a common refcount_inc) lost sight of the fact that the
refcount_inc is also needed for the case when the device already exists
and the mode need not be upgraded.
Fixes: 2a0b4682e0 ("dm: convert dm_dev_internal.count from atomic_t to refcount_t")
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
register_shrinker is now __must_check, so check it to kill a warning.
Caller of bch_btree_cache_alloc in super.c appropriately checks return
value so this is fully plumbed through.
This V2 fixes checkpatch warnings and improves the commit description,
as I was too hasty getting the previous version out.
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Vojtech Pavlik <vojtech@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When we send a read request and hit the clean data in cache device, there
is a situation called cache read race in bcache(see the commit in the tail
of cache_look_up(), the following explaination just copy from there):
The bucket we're reading from might be reused while our bio is in flight,
and we could then end up reading the wrong data. We guard against this
by checking (in bch_cache_read_endio()) if the pointer is stale again;
if so, we treat it as an error (s->iop.error = -EINTR) and reread from
the backing device (but we don't pass that error up anywhere)
It should be noted that cache read race happened under normal
circumstances, not the circumstance when SSD failed, it was counted
and shown in /sys/fs/bcache/XXX/internal/cache_read_races.
Without this patch, when we use writeback mode, we will never reread from
the backing device when cache read race happened, until the whole cache
device is clean, because the condition
(s->recoverable && (dc && !atomic_read(&dc->has_dirty))) is false in
cached_dev_read_error(). In this situation, the s->iop.error(= -EINTR)
will be passed up, at last, user will receive -EINTR when it's bio end,
this is not suitable, and wield to up-application.
In this patch, we use s->read_dirty_data to judge whether the read
request hit dirty data in cache device, it is safe to reread data from
the backing device when the read request hit clean data. This can not
only handle cache read race, but also recover data when failed read
request from cache device.
[edited by mlyle to fix up whitespace, commit log title, comment
spelling]
Fixes: d59b237959 ("bcache: only permit to recovery read error when cache device is clean")
Cc: <stable@vger.kernel.org> # 4.14
Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch try to fix the building error on MIPS. The reason is MIPS
has already defined the PTR macro, which conflicts with the PTR macro
in include/uapi/linux/bcache.h.
[fixed by mlyle: corrected a line-length issue]
Cc: stable@vger.kernel.org
Signed-off-by: Huacai Chen <chenhc@lemote.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Journal bucket is a circular buffer, the bucket
can be like YYYNNNYY, which means the first valid journal in
the 7th bucket, and the latest valid journal in third bucket, in
this case, if we do not try we the zero index first, We
may get a valid journal in the 7th bucket, then we call
find_next_bit(bitmap,ca->sb.njournal_buckets, l + 1) to get the
first invalid bucket after the 7th bucket, because all these
buckets is valid, so no bit 1 in bitmap, thus find_next_bit()
function would return with ca->sb.njournal_buckets (8). So, after
that, bcache only read journal in 7th and 8the bucket,
the first to the third buckets are lost.
So, it is important to let developer know that, we need to try
the zero index at first in the hash-search, and avoid any breaks
in future's code modification.
[ML: Fixed whitespace & formatting & file permissions]
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull more block layer updates from Jens Axboe:
"A followup pull request, with some parts that either needed a bit more
testing before going in, merge sync, or just later arriving fixes.
This contains:
- Timer related updates from Kees. These were purposefully delayed
since I didn't want to pull in a later v4.14-rc tag to my block
tree.
- ide-cd prep sense buffer fix from Bart. Also delayed, as not to
clash with the late fix we put into 4.14-rc.
- Small BFQ updates series from Luca and Paolo.
- Single nvmet fix from James, fixing a non-functional case there.
- Bio fast clone fix from Michael, which made bcache return the wrong
data for some cases.
- Legacy IO path regression hang fix from Ming"
* 'for-linus' of git://git.kernel.dk/linux-block:
bio: ensure __bio_clone_fast copies bi_partno
nvmet_fc: fix better length checking
block: wake up all tasks blocked in get_request()
block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP
block, bfq: update blkio stats outside the scheduler lock
block, bfq: add missing invocations of bfqg_stats_update_io_add/remove
doc, block, bfq: update max IOPS sustainable with BFQ
ide: Make ide_cdrom_prep_fs() initialize the sense buffer pointer
md: Convert timers to use timer_setup()
block: swim3: Convert timers to use timer_setup()
block/aoe: Convert timers to use timer_setup()
amifloppy: Convert timers to use timer_setup()
block/floppy: Convert callback to pass timer_list
isn't _really_ an error
- A DM core @stable fix for discard support that was enabled for an
entire DM device despite only having partial support for discards due
to a mix of discard capabilities across the underlying devices.
- A couple other DM core discard fixes.
- A DM bufio @stable fix that resolves a 32-bit overflow
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJaDglsAAoJEMUj8QotnQNaaFwIAMLjV27BYtHBYWnvMlROiXAD
2aPSEoGHEGcq6BQyTlXyew1CNl0xXOcb8KMFhQMR/IjPuKyLl47OXbavE3TIwVoT
Lw+XUvXUuxK1Qd34fUvPoPd94w1aJBoY9Wlv5YxCp+U0WQ2SH3kHo/FOFvLPJ6wY
OhHZiByGvxXWc8tso86zx0pq6j5Nghk18D2lQvaGU28BtElfWE3/xoDr6FrwDqEb
MvzmUMKs/M5EoJt3HT4SNDFqujkCP69PGjqpHxV9mFT8HaonX+MF61Kr96/Tc6cO
c+DOkw7kaqnjJsrdKu3KIdtXf3cyoHYqtExXRdzap8QoCQvosNR4r78svcfY0i8=
=QKXY
-----END PGP SIGNATURE-----
Merge tag 'for-4.15/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull more device mapper updates from Mike Snitzer:
"Given your expected travel I figured I'd get these fixes to you sooner
rather than later.
- a DM multipath stable@ fix to silence an annoying error message
that isn't _really_ an error
- a DM core @stable fix for discard support that was enabled for an
entire DM device despite only having partial support for discards
due to a mix of discard capabilities across the underlying devices.
- a couple other DM core discard fixes.
- a DM bufio @stable fix that resolves a 32-bit overflow"
* tag 'for-4.15/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm bufio: fix integer overflow when limiting maximum cache size
dm: clear all discard attributes in queue_limits when discards are disabled
dm: do not set 'discards_supported' in targets that do not need it
dm: discard support requires all targets in a table support discards
dm mpath: remove annoying message of 'blk_get_request() returned -11'
The default max_cache_size_bytes for dm-bufio is meant to be the lesser
of 25% of the size of the vmalloc area and 2% of the size of lowmem.
However, on 32-bit systems the intermediate result in the expression
(VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100
overflows, causing the wrong result to be computed. For example, on a
32-bit system where the vmalloc area is 520093696 bytes, the result is
1174405 rather than the expected 130023424, which makes the maximum
cache size much too small (far less than 2% of lowmem). This causes
severe performance problems for dm-verity users on affected systems.
Fix this by using mult_frac() to correctly multiply by a percentage. Do
this for all places in dm-bufio that multiply by a percentage. Also
replace (VMALLOC_END - VMALLOC_START) with VMALLOC_TOTAL, which contrary
to the comment is now defined in include/linux/vmalloc.h.
Depends-on: 9993bc635 ("sched/x86: Fix overflow in cyc2ns_offset")
Fixes: 95d402f057 ("dm: add bufio")
Cc: <stable@vger.kernel.org> # v3.2+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>