Commit Graph

4385 Commits

Author SHA1 Message Date
Lars Ellenberg
45d2933c92 drbd: also keep track of trim -> zero-out fallback peer_requests
To be able to find and present such zero-out fallback peer_requests
in debugfs, we add those to "active_ee", once that list drained.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:09 +02:00
Lars Ellenberg
b9ed7080d7 drbd: consistently use list_add_tail for peer_request tracking
Keep the epoch entry lists (active_ee, read_ee, sync_ee, ...)
consistently "oldest first".  That way finding the oldest not yet
successfully processed request is simply list_first_entry_or_null.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:08 +02:00
Lars Ellenberg
41d9f7cd5b drbd: drop drbd_md_flush
The only user of drbd_md_flush was bm_rw(),
and it is always followed by either a drbd_md_sync(),
or an al_write_transaction(), which, if so configured,
both end up submiting a FLUSH|FUA request anyways.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:07 +02:00
Lars Ellenberg
15e26f6a3c drbd: add drbd_queue_work_if_unqueued helper
We sometimes do
    if (list_empty(&w.list))
	drbd_queue_work(&q, &w.list);

Removal (list_del_init) may happen outside all locks, after all
pending work entries have been moved to an on-stack local work list.

For not dynamically allocated, but embeded, work structs,
we must avoid to re-add until it really was removed.

Move that list_empty check inside the spin_lock(&q->q_lock)
within the helper function, and change to list_empty_careful().

This may have been the reason for a list_add corruption
inside drbd_queue_work().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:07 +02:00
Lars Ellenberg
7f34f61490 drbd: drbd_rs_number_requests: fix unit mismatch in comparison
We try to limit the number of "in-flight" resync requests.
One condition for that is the amount of requested data should not exceed
half of what can be covered by our "max-buffers" setting.

However we compared number of 4k pages with number of in-flight 512 Byte
sectors, and this extra throttle triggered much earlier than intended.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:06 +02:00
Lars Ellenberg
f88c5d90cc drbd: cosmetic: change all printk(level, ...) to pr_<level>(...)
Cosmetic change only.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:05 +02:00
Lars Ellenberg
2f2abeae3c drbd: clear CRASHED_PRIMARY only after successful resync
If we lost a disk during the first resync after primary crash,
we could have prematurely cleared the CRASHED_PRIMARY flag.
Testing on C_CONNECTED is not what we meant there,
but testing for both peers to become D_UP_TO_DATE.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:05 +02:00
Lars Ellenberg
506afb6248 drbd: improve resync request throttling due to sendbuf size
If we throttle resync because the socket sendbuffer is filling up,
tell TCP about it, so it may expand the sendbuffer for us.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:04 +02:00
Joe Perches
659b2e3bb8 block: Convert last uses of __FUNCTION__ to __func__
Just about all of these have been converted to __func__,
so convert the last uses.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:04 +02:00
Monam Agarwal
ccdd6a93ee drivers/block: Use RCU_INIT_POINTER(x, NULL) in drbd/drbd_state.c
This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)

The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure.
And in the case of the NULL pointer, there is no structure to initialize.
So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL)

Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:03 +02:00
Lars Ellenberg
0c066bc39e drbd: short-circuit in maybe_pull_ahead
If we already "pulled ahead", we can short-circuit,
and avoid logging the same messages over and over again.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:02 +02:00
Lars Ellenberg
08d0dabf48 drbd: application writes may set-in-sync in protocol != C
If "dirty" blocks are written to during resync,
that brings them in-sync.

By explicitly requesting write-acks during resync even in protocol != C,
we now can actually respect this.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:02 +02:00
Philipp Reisner
5d0b17f1a2 drbd: New net configuration option socket-check-timeout
In setups involving a DRBD-proxy and connections that experience a lot of
buffer-bloat it might be necessary to set ping-timeout to an
unusual high value. By default DRBD uses the same value to wait if a newly
established TCP-connection is stable. Since the DRBD-proxy is usually located
in the same data center such a long wait time may hinder DRBD's connect process.

In such setups socket-check-timeout should be set to
at least to the round trip time between DRBD and DRBD-proxy. I.e. in most
cases to 1.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:01 +02:00
Philipp Reisner
4920e37a9e drbd: Limit the time we are waiting for the first packet on an accepted socket
Before the patch
'drbd: Keep the listening socket open while trying to connect to the peer'

the newly created socket inherited the receive timeout from the listen
socket. The listen socket had a receive timeout of connect-intervall
+- 30% random jitter.

The real issue is that after the mentioned patch we had no timeout at all.
Now use 4 times the ping-timeout.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:00 +02:00
Lars Ellenberg
aaaba34576 drbd: implement csums-after-crash-only
Checksum based resync trades CPU cycles for network bandwidth,
in situations where we expect much of the to-be-resynced blocks
to be actually identical on both sides already.

In a "network hickup" scenario, it won't help:
all to-be-resynced blocks will typically be different.

The use case is for the resync of *potentially* different blocks
after crash recovery -- the crash recovery had marked larger areas
(those covered by the activity log) as need-to-be-resynced,
just in case. Most of those blocks will be identical.

This option makes it possible to configure checksum based resync,
but only actually use it for the first resync after primary crash.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:00 +02:00
Lars Ellenberg
6a8d68b187 drbd: don't implicitly resize Diskless node beyond end of device
During handshake, we compare backend sizes, and user set limits,
and agree on what device size we are going to expose.

We remember that last-agreed-size in our meta data.

But if we come up diskless, we have to accept what the peer
presents us with. We used to accept the peers maximum potential
capacity (backend size), which is wrong, and could lead to IO errors
due to access beyond end of device.

Instead, we need to accept the peer's current size.
Unless that is communicated as 0, in which case we
accept the backend size, or the user set limit, if set.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:59 +02:00
Lars Ellenberg
a5655dac75 drbd: fix bogus resync stats in /proc/drbd
We intentionally do not serialize /proc/drbd access with
internal state changes or statistic updates.

Because of that, cat /proc/drbd  may race with resync just being
finished, still see the sync state, and find information about
number of blocks still to go, but then find the total number
of blocks within this resync has just been reset to 0
when accessing it.

This now produces bogus numbers in the resync speed estimates.

Fix by accessing all relevant data only once,
and fixing it up if "still to go" happens to be more than "total".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:59 +02:00
Andreas Gruenbacher
caa3db0e14 drbd: Remove unnecessary/unused code
Get rid of dump_stack() debug statements.

There is no point whatsoever in registering and unregistering a reboot
notifier that doesn't do anything.

The intention was to switch to an "emergency read-only" mode,
so we won't have to resync the full activity log just because
we had been Primary before the reboot.

Once we have that implemented, we may re-introduce the reboot notifier.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:58 +02:00
Lars Ellenberg
8ce953aa39 drbd: silence -Wmissing-prototypes warnings
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:57 +02:00
Lars Ellenberg
3e0c78d346 drbd: drop wrong debugging aid
The textual representation of resync extents in /proc/drbd presented
with proc_details >= 3 was wrong, it used bitnumbers as bitmasks.

It was not particularly useful either, and I doubt anyone has even tried
to look at it in the last few years. Drop it.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:57 +02:00
Lars Ellenberg
4dd726f029 drbd: get rid of drbd_queue_work_front
The last user was al_write_transaction, if called with "delegate",
and the last user to call it with "delegate = true" was the receiver
thread, which has no need to delegate, but can call it himself.

Finally drop the delegate parameter, drop the extra
w_al_write_transaction callback, and drop drbd_queue_work_front.

Do not (yet) change dequeue_work_item to dequeue_work_batch, though.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:56 +02:00
Lars Ellenberg
ac0acb9e39 drbd: use drbd_device_post_work() in more places
This replaces the md_sync_work member of struct drbd_device
by a new MD_SYNC "work bit" in device->flags.

This replaces the resync_start_work member of struct drbd_device
by a new RS_START "work bit" in device->flags.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:55 +02:00
Lars Ellenberg
e334f55095 drbd: make sure disk cleanup happens in worker context
The recent fix to put_ldev() (correct ordering of access to local_cnt
and state.disk; memory barrier in __drbd_set_state) guarantees
that the cleanup happens exactly once.

However it does not yet guarantee that the cleanup happens from worker
context, the last put_ldev() may still happen from atomic context,
which must not happen: blkdev_put() may sleep.

Fix this by scheduling the cleanup to the worker instead,
using a couple more bits in device->flags and a new helper,
drbd_device_post_work().

Generalized the "resync progress" work to cover these new work bits.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:55 +02:00
Lars Ellenberg
ba3c6fb87d drbd: close race when detaching from disk
BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
IP: bd_release+0x21/0x70
Process drbd_w_t7146
Call Trace:
 close_bdev_exclusive
 drbd_free_ldev		[drbd]
 drbd_ldev_destroy	[drbd]
 w_after_state_ch	[drbd]

Race probably went like this:
  state.disk = D_FAILED

... first one to hit zero during D_FAILED:
   put_ldev() /* ----------------> 0 */
     i = atomic_dec_return()
     if (i == 0)
       if (state.disk == D_FAILED)
         schedule_work(go_diskless)
                                /* 1 <------ */ get_ldev_if_state()
   go_diskless()
      do_some_pre_cleanup()                     corresponding put_ldev():
      force_state(D_DISKLESS)   /* 0 <------ */ i = atomic_dec_return()
                                                if (i == 0)
        atomic_inc() /* ---------> 1 */
        state.disk = D_DISKLESS
        schedule_work(after_state_ch)           /* execution pre-empted by IRQ ? */

   after_state_ch()
     put_ldev()
       i = atomic_dec_return()  /* 0 */
       if (i == 0)
         if (state.disk == D_DISKLESS)            if (state.disk == D_DISKLESS)
           drbd_ldev_destroy()                      drbd_ldev_destroy();

Trying to fix this by checking the disk state *before* the
atomic_dec_return(), which implies memory barriers, and by inserting
extra memory barriers around the state assignment in __drbd_set_state().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:54 +02:00
Lars Ellenberg
2ed912e9d3 drbd: explicitly submit meta data requests with REQ_NOIDLE
For some reason we have assumed NOIDLE was implied
by one of the other flags we set. It is not (anymore?).
Explicitly set REQ_NOIDLE for synchronous meta data updates,
or we can seriously starve random writes when using CFQ.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:54 +02:00
Lars Ellenberg
720979fb90 drbd: move set_disk_ro() to after we persisted the new role
This probably does not have any real life impact,
but we should first persist any potentially new UUID
and other meta data flags, as well as our new role,
before we allow/disallow write access.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:53 +02:00
Lars Ellenberg
123ff122ad drbd: trigger tcp_push_pending_frames() for PING and PING_ACK
This should reduce latency for such in-DRBD-protocol "pings",
and may help reduce spurious disconnect/reconnect cycles due to
 "PingAck did not arrive in time."

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:52 +02:00
Lars Ellenberg
66ce6dbce2 drbd: re-add lost conf_mutex protection in drbd_set_role
The conf_update mutex used to be held while clearing the
net_conf->discard_my_data flag inside drbd_set_role.

It was moved into drbd_adm_set_role with
    drbd: allow parallel promote/demote actions
but then replaced at that location by the newly introduced adm_mutex with
    drbd: Fix a potential deadlock in drbdsetup, introduce resource->adm_mutex

And I simply forgot to put it back in at the original location.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:52 +02:00
Lars Ellenberg
fcb096740a drbd: stop the meta data sync timer before open coded meta data sync
If we re-write all meta data due to resize, we have open-coded write-out
of our meta data super block. Stop the md_sync_timer, it would just
trigger scary but in this case spurious "timer expired" messages.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:51 +02:00
Lars Ellenberg
5ab7d2c005 drbd: fix resync finished detection
This fixes one recent regresion,
and one long existing bug.

The bug:
drbd_try_clear_on_disk_bm() assumed that all "count" bits have to be
accounted in the resync extent corresponding to the start sector.

Since we allow application requests to cross our "extent" boundaries,
this assumption is no longer true, resulting in possible misaccounting,
scary messages
("BAD! sector=12345s enr=6 rs_left=-7 rs_failed=0 count=58 cstate=..."),
and potentially, if the last bit to be cleared during resync would
reside in previously misaccounted resync extent, the resync would never
be recognized as finished, but would be "stalled" forever, even though
all blocks are in sync again and all bits have been cleared...

The regression was introduced by
    drbd: get rid of atomic update on disk bitmap works

For an "empty" resync (rs_total == 0), we must not "finish" the
resync on the SyncSource before the SyncTarget knows all relevant
information (sync uuid).  We need to wait for the full round-trip,
the SyncTarget will then explicitly notify us.

Also for normal, non-empty resyncs (rs_total > 0), the resync-finished
condition needs to be tested before the schedule() in wait_for_work, or
it is likely to be missed.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:50 +02:00
Lars Ellenberg
a80ca1ae81 drbd: fix a race stopping the worker thread
We may implicitly call drbd_send() from inside wait_for_work(),
via maybe_send_barrier().

If the "stop" signal was send just before that, drbd_send() would call
flush_signals(), and we would run an unbounded schedule() afterwards.

Fix: check for thread_state == RUNNING before we schedule()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:50 +02:00
Lars Ellenberg
c7a58db4e9 drbd: get rid of atomic update on disk bitmap works
Just trigger the occasional lazy bitmap write-out during resync
from the central wait_for_work() helper.

Previously, during resync, bitmap pages would be written out separately,
synchronously, one at a time, at least 8 times each (every 512 bytes
worth of bitmap cleared).

Now we trigger "merge friendly" bulk write out of all cleared pages
every two seconds during resync, and once the resync is finished.
Most pages will be written out only once.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:49 +02:00
Lars Ellenberg
70df70927b drbd: allow write-ordering policy to be bumped up again
Previously, once you disabled flushes as a means of enforcing
write-ordering, you'd need to detach/re-attach to enable them again.

Allow drbdsetup disk-options to re-enable previously disabled
write-ordering policy options at runtime.

While at it fix RCU in drbd_bump_write_ordering()
max_allowed_wo() uses rcu_dereference, therefore it must
be called within rcu_read_lock()/rcu_read_unlock()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 15:22:22 +02:00
Lars Ellenberg
44a4d55184 drbd: refactor use of first_peer_device()
Reduce the number of calls to first_peer_device(). Instead, call
first_peer_device() just once to assign a local variable peer_device.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 15:22:22 +02:00
Lars Ellenberg
35b5ed5bba drbd: reduce number of spinlock drop/re-aquire cycles
Instead of dropping and re-aquiring the spinlock around the submit,
just remember that we want to submit, and do that only once we have
dropped the spinlock for good.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 15:22:21 +02:00
Philipp Reisner
28995af5cf drbd: rename drbd_free_bc() to drbd_free_ldev()
Since the member of drbd_device is called ldev

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 15:22:21 +02:00
Philipp Reisner
8fe39aac05 drbd: device->ldev is not guaranteed on an D_ATTACHING disk
Some parts of the code assumed that get_ldev_if_state(device, D_ATTACHING)
is sufficient to access the ldev member of the device object. That was
wrong. ldev may not be there or might be freed at any time if the device
has a disk state of D_ATTACHING.

bm_rw()
  Documented that drbd_bm_read() is only called from drbd_adm_attach.
  drbd_bm_write() is only called when a reference is held, and it is
  documented that a caller has to hold a reference before calling
  drbd_bm_write()

drbd_bm_write_page()
  Use get_ldev() instead of get_ldev_if_state(device, D_ATTACHING)

drbd_bmio_set_n_write()
  No longer use get_ldev_if_state(device, D_ATTACHING). All callers
  hold a reference to ldev now.

drbd_bmio_clear_n_write()
  All callers where holding a reference of ldev anyways. Remove the
  misleading get_ldev_if_state(device, D_ATTACHING)

drbd_reconsider_max_bio_size()
  Removed the get_ldev_if_state(device, D_ATTACHING). All callers
  now pass a struct drbd_backing_dev* when they have a proper
  reference, or a NULL pointer.
  Before this fix, the receiver could trigger a NULL pointer
  deref when in drbd_reconsider_max_bio_size()

drbd_bump_write_ordering()
  Used get_ldev_if_state(device, D_ATTACHING) with the wrong assumption.
  Remove it, and allow the caller to pass in a struct drbd_backing_dev*
  when the caller knows that accessing this bdev is safe.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 15:22:20 +02:00
Philipp Reisner
e952658020 drbd: Move write_ordering from connection to resource
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 15:22:19 +02:00
Ming Lei
6a27b656fc block: virtio-blk: support multi virt queues per virtio-blk device
Firstly this patch supports more than one virtual queues for virtio-blk
device.

Secondly this patch maps the virtual queue to blk-mq's hardware queue.

With this approach, both scalability and performance can be improved.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-01 10:51:03 -06:00
Linus Torvalds
3493860c76 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A small collection of fixes/changes for the current series.  This
  contains:

   - Removal of dead code from Gu Zheng.

   - Revert of two bad fixes that went in earlier in this round, marking
     things as __init that were not purely used from init.

   - A fix for blk_mq_start_hw_queue() using the __blk_mq_run_hw_queue(),
     which could place us wrongly.  Make it use the non __ variant,
     which handles cases where we are called from the wrong CPU set.
     From me.

   - A fix for drbd, which allocates discard requests without room for
     the SCSI payload.  From Lars Ellenberg.

   - A fix for user-after-free in the blkcg code from Tejun.

   - Addition of limiting gaps in SG lists, if the hardware needs it.
     This is the last pre-req patch for blk-mq to enable the full NVMe
     conversion.  Could wait until 3.17, but it's simple enough so would
     be nice to have everything we need for the NVMe port in the 3.17
     release.  From me"

* 'for-linus' of git://git.kernel.dk/linux-block:
  drbd: fix NULL pointer deref in blk_add_request_payload
  blk-mq: blk_mq_start_hw_queue() should use blk_mq_run_hw_queue()
  block: add support for limiting gaps in SG lists
  bio: remove unused macro bip_vec_idx()
  Revert "block: add __init to elv_register"
  Revert "block: add __init to blkcg_policy_register"
  blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t
  floppy: format block0 read error message properly
2014-06-26 13:06:13 -07:00
Lars Ellenberg
54ed4ed8f9 drbd: fix NULL pointer deref in blk_add_request_payload
Discards don't have any payload.
But the scsi layer still expects a bio_vec it can use internally,
see sd_setup_discard_cmnd() and blk_add_request_payload().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-25 09:53:47 -06:00
Ilya Dryomov
9638556a27 rbd: handle parent_overlap on writes correctly
The following check in rbd_img_obj_request_submit()

    rbd_dev->parent_overlap <= obj_request->img_offset

allows the fall through to the non-layered write case even if both
parent_overlap and obj_request->img_offset belong to the same RADOS
object.  This leads to data corruption, because the area to the left of
parent_overlap ends up unconditionally zero-filled instead of being
populated with parent data.  Suppose we want to write 1M to offset 6M
of image bar, which is a clone of foo@snap; object_size is 4M,
parent_overlap is 5M:

    rbd_data.<id>.0000000000000001
     ---------------------|----------------------|------------
    | should be copyup'ed | should be zeroed out | write ...
     ---------------------|----------------------|------------
   4M                    5M                     6M
                    parent_overlap    obj_request->img_offset

4..5M should be copyup'ed from foo, yet it is zero-filled, just like
5..6M is.

Given that the only striping mode kernel client currently supports is
chunking (i.e. stripe_unit == object_size, stripe_count == 1), round
parent_overlap up to the next object boundary for the purposes of the
overlap check.

Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-06-23 12:55:37 +04:00
Linus Torvalds
f1d702487b Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A smaller collection of fixes for the block core that would be nice to
  have in -rc2.  This pull request contains:

   - Fixes for races in the wait/wakeup logic used in blk-mq from
     Alexander.  No issues have been observed, but it is definitely a
     bit flakey currently.  Alternatively, we may drop the cyclic
     wakeups going forward, but that needs more testing.

   - Some cleanups from Christoph.

   - Fix for an oops in null_blk if queue_mode=1 and softirq completions
     are used.  From me.

   - A fix for a regression caused by the chunk size setting.  It
     inadvertently used max_hw_sectors instead of max_sectors, which is
     incorrect, and causes hangs on btrfs multi-disk setups (where hw
     sectors apparently isn't set).  From me.

   - Removal of WQ_POWER_EFFICIENT in the kblockd creation.  This was a
     recent addition as well, but it actually breaks blk-mq which relies
     on strict scheduling.  If the workqueue power_efficient mode is
     turned on, this breaks blk-mq.  From Matias.

   - null_blk module parameter description fix from Mike"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: bitmap tag: fix races in bt_get() function
  blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt
  blk-mq: bitmap tag: fix races on shared ::wake_index fields
  block: blk_max_size_offset() should check ->max_sectors
  null_blk: fix softirq completions for queue_mode == 1
  blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue
  blk-mq: properly drain stopped queues
  block: remove WQ_POWER_EFFICIENT from kblockd
  null_blk: fix name and description of 'queue_mode' module parameter
  block: remove elv_abort_queue and blk_abort_flushes
2014-06-19 17:56:43 -10:00
Jens Axboe
e64d468773 Merge branch 'for-jens' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/linux-block into for-linus 2014-06-18 10:30:22 -07:00
Jiri Kosina
1c65df3d7b floppy: format block0 read error message properly
In case reading of block 0 fails, line without trailing newline
is printed causing dmesg to look horrible.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-06-18 13:44:18 +02:00
Jens Axboe
d891fa7087 null_blk: fix softirq completions for queue_mode == 1
Only blk-mq completions have payload attached to the request, for
request_fn mode we have stored it in req->special. This fixes an
oops with queue_mode=1 and softirq completions.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-16 11:40:25 -06:00
Linus Torvalds
b55b390202 Merge git://git.infradead.org/users/willy/linux-nvme
Pull NVMe update from Matthew Wilcox:
 "Mostly bugfixes again for the NVMe driver.  I'd like to call out the
  exported tracepoint in the block layer; I believe Keith has cleared
  this with Jens.

  We've had a few reports from people who're really pounding on NVMe
  devices at scale, hence the timeout changes (and new module
  parameters), hotplug cpu deadlock, tracepoints, and minor performance
  tweaks"

[ Jens hadn't seen that tracepoint thing, but is ok with it - it will
  end up going away when mq conversion happens ]

* git://git.infradead.org/users/willy/linux-nvme: (22 commits)
  NVMe: Fix START_STOP_UNIT Scsi->NVMe translation.
  NVMe: Use Log Page constants in SCSI emulation
  NVMe: Define Log Page constants
  NVMe: Fix hot cpu notification dead lock
  NVMe: Rename io_timeout to nvme_io_timeout
  NVMe: Use last bytes of f/w rev SCSI Inquiry
  NVMe: Adhere to request queue block accounting enable/disable
  NVMe: Fix nvme get/put queue semantics
  NVMe: Delete NVME_GET_FEAT_TEMP_THRESH
  NVMe: Make admin timeout a module parameter
  NVMe: Make iod bio timeout a parameter
  NVMe: Prevent possible NULL pointer dereference
  NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD)
  NVMe: Update data structures for NVMe 1.2
  NVMe: Enable BUILD_BUG_ON checks
  NVMe: Update namespace and controller identify structures to the 1.1a spec
  NVMe: Flush with data support
  NVMe: Configure support for block flush
  NVMe: Add tracepoints
  NVMe: Protect against badly formatted CQEs
  ...
2014-06-15 15:58:03 -10:00
Dan McLeran
b8e080847a NVMe: Fix START_STOP_UNIT Scsi->NVMe translation.
This patch contains several fixes for Scsi START_STOP_UNIT. The previous
code did not account for signed vs. unsigned arithmetic which resulted
in an invalid lowest power state caculation when the device only supports
1 power state.

The code for Power Condition == 2 (Idle) was not following the spec. The
spec calls for setting the device to specific power states, depending
upon Power Condition Modifier, without accounting for the number of
power states supported by the device.

The code for Power Condition == 3 (Standby) was using a hard-coded '0'
which is replaced with the macro POWER_STATE_0.

Signed-off-by: Dan McLeran <daniel.mcleran@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-13 13:11:00 -04:00
Matthew Wilcox
ef351b97de NVMe: Use Log Page constants in SCSI emulation
The nvme-scsi file defined its own Log Page constant.  Use the
newly-defined one from the header file instead.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-13 10:54:21 -04:00
Keith Busch
f3db22feb5 NVMe: Fix hot cpu notification dead lock
There is a potential dead lock if a cpu event occurs during nvme probe
since it registered with hot cpu notification. This fixes the race by
having the module register with notification outside of probe rather
than have each device register.

The actual work is done in a scheduled work queue instead of in the
notifier since assigning IO queues has the potential to block if the
driver creates additional queues.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-13 10:43:34 -04:00