Set QUEUE_FLAG_NONROT. Following commit b277da0a8a ("block: disable
entropy contributions for nonrot devices") we should also clear
QUEUE_FLAG_ADD_RANDOM, but it's off by default for blk-mq drivers, so
just note it in the comment.
Also remove physical block size assignment - no sense in repeating
defaults that are not going to change.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This converts the rbd driver to use the blk-mq infrastructure. Except
for switching to a per-request work item this is almost mechanical.
This was tested by Alexandre DERUMIER in November, and found to give
him 120000 iops, although the only comparism available was an old
3.10 kernel which gave 80000iops.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <elder@linaro.org>
[idryomov@gmail.com: context, blk_mq_init_queue() EH]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the clone is resized down to 0, it becomes standalone. If such
resize is carried over while an image is mapped we would detect this
and call rbd_dev_parent_put() which means "let go of all parent state,
including the spec(s) of parent images(s)". This leads to a mismatch
between "rbd info" and sysfs parent fields, so a fix is in order.
# rbd create --image-format 2 --size 1 foo
# rbd snap create foo@snap
# rbd snap protect foo@snap
# rbd clone foo@snap bar
# DEV=$(rbd map bar)
# rbd resize --allow-shrink --size 0 bar
# rbd resize --size 1 bar
# rbd info bar | grep parent
parent: rbd/foo@snap
Before:
# cat /sys/bus/rbd/devices/0/parent
(no parent image)
After:
# cat /sys/bus/rbd/devices/0/parent
pool_id 0
pool_name rbd
image_id 10056b8b4567
image_name foo
snap_id 2
snap_name snap
overlap 0
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
header_rwsem should be released on errors. Also remove useless
rbd_dev->mapping.size != rbd_dev->header.image_size test.
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
It's been largely superseded by dup_token() and unused for over
2 years, identified by cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
[idryomov@redhat.com: changelog]
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
This effectively reverts the last hunk of 392a9dad7e ("rbd: detect
when clone image is flattened").
The problem with parent_overlap != 0 condition is that it's possible
and completely valid to have an image with parent_overlap == 0 whose
parent state needs to be cleaned up on unmap. The next commit, which
drops the "clone image now standalone" logic, opens up another window
of opportunity to hit this, but even without it
# cat parent-ref.sh
#!/bin/bash
rbd create --image-format 2 --size 1 foo
rbd snap create foo@snap
rbd snap protect foo@snap
rbd clone foo@snap bar
rbd resize --allow-shrink --size 0 bar
rbd resize --size 1 bar
DEV=$(rbd map bar)
rbd unmap $DEV
leaves rbd_device/rbd_spec/etc and rbd_client along with ceph_client
hanging around.
My thinking behind calling rbd_dev_parent_put() unconditionally is that
there shouldn't be any requests in flight at that point in time as we
are deep into unmap sequence. Hence, even if rbd_dev_unparent() caused
by flatten is delayed by in-flight requests, it will have finished by
the time we reach rbd_dev_unprobe() caused by unmap, thus turning
unconditional rbd_dev_parent_put() into a no-op.
Fixes: http://tracker.ceph.com/issues/10352
Cc: stable@vger.kernel.org # 3.11+
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
The comment for rbd_dev_parent_get() said
* We must get the reference before checking for the overlap to
* coordinate properly with zeroing the parent overlap in
* rbd_dev_v2_parent_info() when an image gets flattened. We
* drop it again if there is no overlap.
but the "drop it again if there is no overlap" part was missing from
the implementation. This lead to absurd parent_ref values for images
with parent_overlap == 0, as parent_ref was incremented for each
img_request and virtually never decremented.
Fix this by leveraging the fact that refresh path calls
rbd_dev_v2_parent_info() under header_rwsem and use it for read in
rbd_dev_parent_get(), instead of messing around with atomics. Get rid
of barriers in rbd_dev_v2_parent_info() while at it - I don't see what
they'd pair with now and I suspect we are in a pretty miserable
situation as far as proper locking goes regardless.
Cc: stable@vger.kernel.org # 3.11+
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
CEPH_OSD_OP_DELETE is not an extent op, stop treating it as such. This
sneaked in with discard patches - it's one of the three osd ops (the
other two are CEPH_OSD_OP_TRUNCATE and CEPH_OSD_OP_ZERO) that discard
is implemented with.
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
The functions ceph_put_snap_context() and iput() test whether their
argument is NULL and then return immediately. Thus the test around the
call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
[idryomov@redhat.com: squashed rbd.c hunk, changelog]
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
When we fail to allocate page vector in rbd_obj_read_sync() we just
basically ignore the problem and continue which will result in an oops
later. Fix the problem by returning proper error.
CC: Yehuda Sadeh <yehuda@inktank.com>
CC: Sage Weil <sage@inktank.com>
CC: ceph-devel@vger.kernel.org
CC: stable@vger.kernel.org
Coverity-id: 1226882
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Using one queue per device doesn't make much sense given that our
workfn processes "devices" and not "requests". Switch to a single
workqueue for all devices.
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Need to use WQ_MEM_RECLAIM for our workqueues to prevent I/O lockups
under memory pressure - we sit on the memory reclaim path.
Cc: stable@vger.kernel.org # 3.17, needs backporting for 3.16
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Tested-by: Micha Krause <micha@krausam.de>
Reviewed-by: Sage Weil <sage@redhat.com>
max_discard_sectors must be set for the queue to support discard.
Operations implementing discard for rbd zero data, so report that.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Only allocate two osd ops for discard requests, since the
preallocation hint is only added for regular writes. Use
rbd_img_obj_request_fill() to recreate the original write or discard
osd operations, isolating that logic to one place, and change the
assert in rbd_osd_req_create_copyup() to accept discard requests as
well.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
rbd_img_request_fill() creates a ceph_osd_request and has logic for
adding the appropriate osd ops to it based on the request type and
image properties.
For layered images, the original rbd_obj_request is resent with a
copyup operation in front, using a new ceph_osd_request. The logic for
adding the original operations should be the same as when first
sending them, so move it to a helper function.
op_type only needs to be checked once, so create a helper for that as
well and call it outside the loop in rbd_img_request_fill().
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Discard requests are a form of write, so they should go through the
same process as plain write requests and trigger copy-on-write for
layered images.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Discard may try to delete an object from a non-layered image that does not exist.
If this occurs, the image already has no data in that range, so change the
result to success.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Discards take a reference to the snapshot context of an image when
they are created. This reference needs to be cleaned up when the
request is done just as it is for regular writes.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
In rbd_img_request_fill() the image size is only checked to determine
whether we can truncate an object instead of zeroing it for discard
requests. Take rbd_dev->header_rwsem while reading the image size, and
move this read into the discard check, so that non-discard ops don't
need to take the semaphore in this function.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
This patch add the discard support for rbd driver.
There are three types operation in the driver:
1. The objects would be removed if they completely contained
within the discard range.
2. The objects would be truncated if they partly contained within
the discard range, and align with their boundary.
3. Others would be zeroed.
A discard request from blkdev_issue_discard() is defined which
REQ_WRITE and REQ_DISCARD both marked and no data, so we must
check the REQ_DISCARD first when getting the request type.
This resolve:
http://tracker.ceph.com/issues/190
[ Ilya Dryomov: This is incomplete and somewhat buggy, see follow up
commits by Josh Durgin for refinements and fixes which weren't
folded in to preserve authorship. ]
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
It could only handle the read and write operations now,
extend it for the coming discard support.
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
It need to copyup the parent's content when layered writing,
but an entire object write would overwrite it, so skip it.
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
These fields may both change while the image is mapped if a snapshot
is created or deleted or the image is resized. They are guarded by
rbd_dev->header_rwsem, so hold that while reading them, and store a
local copy to refer to outside of the critical section. The local copy
will stay consistent since the snapshot context is reference counted,
and the mapping size is just a u64. This prevents torn loads from
giving us inconsistent values.
Move reading header.snapc into the caller of rbd_img_request_create()
so that we only need to take the semaphore once. The read-only caller,
rbd_parent_request_create() can just pass NULL for snapc, since the
snapshot context is only relevant for writes.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Trying to map an image out of a pool for which we don't have an 'x'
permission bit fails with -ERANGE from ceph_extract_encoded_string()
due to an unsigned vs signed bug. Fix it and get rid of the -EINVAL
sink, thus propagating rbd::get_id cls method errors. (I've seen
a bunch of unexplained -ERANGE reports, I bet this is it).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Fix to return -ENOMEM from the workqueue alloc error handling
case instead of 0, as done elsewhere in this function.
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
drivers/block/rbd.c: In function ‘rbd_dev_device_setup’:
drivers/block/rbd.c:5090:19: warning: format not a string literal and no format arguments [-Wformat-security]
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Now that rbd_img_request_create() is called from work functions, no
need to use GFP_ATOMIC.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
While it was never a good idea to sleep in request_fn(), commit
34c6bc2c91 ("locking/mutexes: Add extra reschedule point") made it
a *bad* idea. mutex_lock() since 3.15 may reschedule *before* putting
task on the mutex wait queue, which for tasks in !TASK_RUNNING state
means block forever. request_fn() may be called with !TASK_RUNNING on
the way to schedule() in io_schedule().
Offload request handling to a workqueue, one per rbd device, to avoid
calling blocking primitives from rbd_request_fn().
Fixes: http://tracker.ceph.com/issues/8818
Cc: stable@vger.kernel.org # 3.16, needs backporting for 3.15
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Tested-by: Eric Eastman <eric0e@aol.com>
Tested-by: Greg Wilson <greg.wilson@keepertech.com>
Reviewed-by: Alex Elder <elder@linaro.org>
If we are mapping a snapshot, we must read in the parent_overlap value
of that snapshot instead of that of the base image. Not doing so may
in particular result in us returning zeros instead of user data:
# cat overlap-snap.sh
#!/bin/bash
rbd create --size 10 --image-format 2 foo
FOO_DEV=$(rbd map foo)
dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null
echo "Base image"
dd if=$FOO_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
rbd snap create foo@snap
rbd snap protect foo@snap
rbd clone foo@snap bar
rbd snap create bar@snap
BAR_DEV=$(rbd map bar@snap)
echo "Snapshot"
dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
rbd resize --allow-shrink --size 4 bar
echo "Snapshot after base image resize"
dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
# ./overlap-snap.sh
Base image
0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6.
Snapshot
0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6.
Resizing image: 100% complete...done.
Snapshot after base image resize
0000000: e781 e33b d34b 2225 0000 0000 0000 0000 ...;.K"%........
Even though bar@snap is taken with the old bar parent_overlap (8M),
reads from bar@snap beyond the new bar parent_overlap (4M) return
zeroes. Fix it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Currently rbd_dev_v2_header_info() reads in parent info before the snap
context is read in. This is wrong, because we may need to look at the
the parent_overlap value of the snapshot instead of that of the base
image, for example when mapping a snapshot - see next commit. (When
mapping a snapshot, all we got is its name and we need the snap context
to translate that name into an id to know which parent info to look
for.)
The approach taken here is to make sure rbd_dev_v2_parent_info() is
called after the snap context has been read in. The other approach
would be to add a parent_overlap field to struct rbd_mapping and
maintain it the same way rbd_mapping::size is maintained. The reason
I chose the first approach is that the value of keeping around both
base image values and the actual mapping values is unclear to me.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
There is no sense in trying to update the mapping size before it's even
been set.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Recently discovered watch/notify problems showed that we really can't
ignore errors in anything refresh related. Alas, currently there is
not much we can do in response to those errors, except print warnings.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
rbd_dev_spec_update() has two modes of operation, with nothing in
common between them. Split it into two functions, one for each mode
and make our expectations more clear.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
spec->image_id assert doesn't buy us much and image_format is asserted
in rbd_dev_header_name() and rbd_dev_header_info() anyway.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
A wrapper around rbd_dev_v{1,2}_header_info() to reduce duplication.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Make /sys/bus/rbd/devices/<id>/parent show the entire chain of parent
images. While at it, kernel sprintf() doesn't return negative values,
casting to unsigned long long is no longer necessary and there is no
good reason to split into multiple sprintf() calls.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Free memory allocated using kmem_cache_zalloc using kmem_cache_free
rather than kfree. The helper rbd_segment_name_free does the job here.
Its position is shifted above the calling function.
The Coccinelle semantic patch that detects this change is as follows:
// <smpl>
@@
expression x,E,c;
@@
x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...)
... when != x = E
when != &x
?-kfree(x)
+kmem_cache_free(c,x)
// </smpl>
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
image_id is leaked if the parent happens to have been recorded already.
Fix it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Switch rbd_dev_header_{un,}watch_sync() to use the new helper and fix
rbd_dev_header_unwatch_sync() to destroy watch_request structures
before queuing watch-remove message while at it. This mistake slipped
into commit b30a01f2a3 ("rbd: fix osd_request memory leak in
__rbd_dev_header_watch_sync()") and could lead to "image still in use"
errors on image removal.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
In the past, rbd_dev_header_watch_sync() used to handle both watch and
unwatch requests and was entangled and leaky. Commit b30a01f2a3
("rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()")
split it into two separate functions. This commit cleanly abstracts
the common bits, relying on the fixed rbd_obj_request_wait().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
rbd_obj_request_wait() should cancel the underlying OSD request if
interrupted. Otherwise libceph will hold onto it indefinitely, causing
assert failures or leaking the original object request.
This also adds an rbd wrapper around ceph_osdc_cancel_request() to
match rbd_obj_request_submit() and rbd_obj_request_wait().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
The following check in rbd_img_obj_request_submit()
rbd_dev->parent_overlap <= obj_request->img_offset
allows the fall through to the non-layered write case even if both
parent_overlap and obj_request->img_offset belong to the same RADOS
object. This leads to data corruption, because the area to the left of
parent_overlap ends up unconditionally zero-filled instead of being
populated with parent data. Suppose we want to write 1M to offset 6M
of image bar, which is a clone of foo@snap; object_size is 4M,
parent_overlap is 5M:
rbd_data.<id>.0000000000000001
---------------------|----------------------|------------
| should be copyup'ed | should be zeroed out | write ...
---------------------|----------------------|------------
4M 5M 6M
parent_overlap obj_request->img_offset
4..5M should be copyup'ed from foo, yet it is zero-filled, just like
5..6M is.
Given that the only striping mode kernel client currently supports is
chunking (i.e. stripe_unit == object_size, stripe_count == 1), round
parent_overlap up to the next object boundary for the purposes of the
overlap check.
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
rbd_open(), called every time the device is opened, calls
set_device_ro(). There's no reason to set the device read-only or
read-write every time it is opened. Just do this once during device
setup, using set_disk_ro() instead because the struct block_device
isn't available to us there.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
get_user() and set_disk_ro() may allocate memory, leading to a
potential deadlock if theye are called while a spin lock is held.
Move the acquisition and release of rbd_dev->lock from rbd_ioctl()
into rbd_ioctl_set_ro(), so it can occur between get_user() and
set_disk_ro().
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
When running the following commands:
[root@ceph0 mnt]# blockdev --setro /dev/rbd1
[root@ceph0 mnt]# blockdev --getro /dev/rbd1
0
The block setro didn't take effect, it is because
the rbd doesn't support ioctl of block driver.
This resolves:
http://tracker.ceph.com/issues/6265
Signed-off-by: Guangliang Zhao <guangliang@unitedstack.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ida_destroy() needs to be called on module exit to release ida caches.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Each image request contains a reference count, but to date it has
not actually been used. (I think this was just an oversight.) A
recent report involving rbd failing an assertion shed light on why
and where we need to use these reference counts.
Every OSD request associated with an object request uses
rbd_osd_req_callback() as its callback function. That function will
call a helper function (dependent on the type of OSD request) that
will set the object request's "done" flag if the object request if
appropriate. If that "done" flag is set, the object request is
passed to rbd_obj_request_complete().
In rbd_obj_request_complete(), requests are processed in sequential
order. So if an object request completes before one of its
predecessors in the image request, the completion is deferred.
Otherwise, if it's a completing object's "turn" to be completed, it
is passed to rbd_img_obj_end_request(), which records the result of
the operation, accumulates transferred bytes, and so on. Next, the
successor to this request is checked and if it is marked "done",
(deferred) completion processing is performed on that request, and
so on. If the last object request in an image request is completed,
rbd_img_request_complete() is called, which (typically) destroys
the image request.
There is a race here, however. The instant an object request is
marked "done" it can be provided (by a thread handling completion of
one of its predecessor operations) to rbd_img_obj_end_request(),
which (for the last request) can then lead to the image request
getting torn down. And this can happen *before* that object has
itself entered rbd_img_obj_end_request(). As a result, once it
*does* enter that function, the image request (and even the object
request itself) may have been freed and become invalid.
All that's necessary to avoid this is to properly count references
to the image requests. We tear down an image request's object
requests all at once--only when the entire image request has
completed. So there's no need for an image request to count
references for its object requests. However, we don't want an
image request to go away until the last of its object requests
has passed through rbd_img_obj_callback(). In other words,
we don't want rbd_img_request_complete() to necessarily
result in the image request being destroyed, because it may
get called before we've finished processing on all of its
object requests.
So the fix is to add a reference to an image request for
each of its object requests. The reference can be viewed
as representing an object request that has not yet finished
its call to rbd_img_obj_callback(). That is emphasized by
getting the reference right after assigning that as the image
object's callback function. The corresponding release of that
reference is done at the end of rbd_img_obj_callback(), which
every image object request passes through exactly once.
Cc: stable@vger.kernel.org
Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>