Commit Graph

1113 Commits

Author SHA1 Message Date
Max Gurtovoy
d9d53ed3f7 nvme: add get-feature to admin cmds tracer
This will print get-feature cmd in more informative way. For example,
run "nvme get-feature /dev/nvme0 -n 1 -f 0x9 -c 10" will trace:

 nvme-3907  [008] ....  1763.635054: nvme_setup_cmd: nvme0: qid=0, cmdid=6, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_admin_get_features fid=0x9 sel=0x0 cdw11=0xa)
<idle>-0     [001] d.h.  1763.635112: nvme_sq: nvme0: qid=0, head=27, tail=27
<idle>-0     [008] ..s.  1763.635121: nvme_complete_rq: nvme0: qid=0, cmdid=6, res=10, retries=0, flags=0x2, status=0

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13 12:05:39 -06:00
Chaitanya Kulkarni
34e08191b1 nvme-rdma: use nr_phys_segments when map rq to sgl
Use blk_rq_nr_phys_segments() instead of blk_rq_payload_bytes() to check
if a command contains data to be mapped.  This fixes the case where
a struct request contains LBAs, but it has no payload, such as
Write Zeroes support.

Fixes: 6e02318eae ("nvme: add support for the Write Zeroes command")
Reported-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-21 06:39:20 -07:00
Christoph Hellwig
bc50ad7501 nvme: convert to SPDX identifiers
Update license to use SPDX-License-Identifier instead of verbose license
text.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20 07:22:28 -07:00
Christoph Hellwig
5f37396dff nvme-pci: convert to SPDX identifiers
Update license to use SPDX-License-Identifier instead of verbose license
text.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20 07:22:25 -07:00
Christoph Hellwig
115aa7abd7 nvme-lightnvm: convert to SPDX identifiers
Update license to use SPDX-License-Identifier instead of verbose license
text.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20 07:22:22 -07:00
Christoph Hellwig
5d8762d568 nvme-rdma: convert to SPDX identifiers
Update license to use SPDX-License-Identifier instead of verbose license
text.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20 07:22:19 -07:00
Christoph Hellwig
8638b24614 nvme-fc: convert to SPDX identifiers
Update license to use SPDX-License-Identifier instead of verbose license
text.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20 07:22:17 -07:00
Christoph Hellwig
9002c4e5ff nvme-fabrics: convert to SPDX identifiers
Update license to use SPDX-License-Identifier instead of verbose license
text.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20 07:22:13 -07:00
Hannes Reinecke
ab4ab09cbd nvme: return error from nvme_alloc_ns()
nvme_alloc_ns() might fail, so we should be returning an error code.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20 07:21:00 -07:00
Bart Van Assche
b9c77583b0 nvme: avoid that deleting a controller triggers a circular locking complaint
Rework nvme_delete_ctrl_sync() such that it does not have to wait for
queued work. This patch avoids that test nvme/008 triggers the following
complaint:

WARNING: possible circular locking dependency detected
5.0.0-rc6-dbg+ #10 Not tainted
------------------------------------------------------
nvme/7918 is trying to acquire lock:
000000009a1a7b69 ((work_completion)(&ctrl->delete_work)){+.+.}, at: __flush_work+0x379/0x410

but task is already holding lock:
00000000ef5a45b4 (kn->count#389){++++}, at: kernfs_remove_self+0x196/0x210

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (kn->count#389){++++}:
       lock_acquire+0xc5/0x1e0
       __kernfs_remove+0x42a/0x4a0
       kernfs_remove_by_name_ns+0x45/0x90
       remove_files.isra.1+0x3a/0x90
       sysfs_remove_group+0x5c/0xc0
       sysfs_remove_groups+0x39/0x60
       device_remove_attrs+0x68/0xb0
       device_del+0x24d/0x570
       cdev_device_del+0x1a/0x50
       nvme_delete_ctrl_work+0xbd/0xe0
       process_one_work+0x4f1/0xa40
       worker_thread+0x67/0x5b0
       kthread+0x1cf/0x1f0
       ret_from_fork+0x24/0x30

-> #0 ((work_completion)(&ctrl->delete_work)){+.+.}:
       __lock_acquire+0x1323/0x17b0
       lock_acquire+0xc5/0x1e0
       __flush_work+0x399/0x410
       flush_work+0x10/0x20
       nvme_delete_ctrl_sync+0x65/0x70
       nvme_sysfs_delete+0x4f/0x60
       dev_attr_store+0x3e/0x50
       sysfs_kf_write+0x87/0xa0
       kernfs_fop_write+0x186/0x240
       __vfs_write+0xd7/0x430
       vfs_write+0xfa/0x260
       ksys_write+0xab/0x130
       __x64_sys_write+0x43/0x50
       do_syscall_64+0x71/0x210
       entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(kn->count#389);
                               lock((work_completion)(&ctrl->delete_work));
                               lock(kn->count#389);
  lock((work_completion)(&ctrl->delete_work));

 *** DEADLOCK ***

3 locks held by nvme/7918:
 #0: 00000000e2223b44 (sb_writers#6){.+.+}, at: vfs_write+0x1eb/0x260
 #1: 000000003404976f (&of->mutex){+.+.}, at: kernfs_fop_write+0x128/0x240
 #2: 00000000ef5a45b4 (kn->count#389){++++}, at: kernfs_remove_self+0x196/0x210

stack backtrace:
CPU: 4 PID: 7918 Comm: nvme Not tainted 5.0.0-rc6-dbg+ #10
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
 dump_stack+0x86/0xca
 print_circular_bug.isra.36.cold.54+0x173/0x1d5
 check_prev_add.constprop.45+0x996/0x1110
 __lock_acquire+0x1323/0x17b0
 lock_acquire+0xc5/0x1e0
 __flush_work+0x399/0x410
 flush_work+0x10/0x20
 nvme_delete_ctrl_sync+0x65/0x70
 nvme_sysfs_delete+0x4f/0x60
 dev_attr_store+0x3e/0x50
 sysfs_kf_write+0x87/0xa0
 kernfs_fop_write+0x186/0x240
 __vfs_write+0xd7/0x430
 vfs_write+0xfa/0x260
 ksys_write+0xab/0x130
 __x64_sys_write+0x43/0x50
 do_syscall_64+0x71/0x210
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20 07:19:42 -07:00
Bart Van Assche
a686ed75c0 nvme: introduce a helper function for controller deletion
This patch does not change any functionality but makes the next patch
in this series easier to read.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20 07:18:21 -07:00
Bart Van Assche
d84c4b024a nvme: unexport nvme_delete_ctrl_sync()
Since nvme_delete_ctrl_sync() is not called from any other kernel module,
unexport it.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20 07:18:15 -07:00
Bart Van Assche
e895fedf12 nvme-pci: check kstrtoint() return value in queue_count_set()
This patch avoids that the compiler complains about 'ret' being set
but not being used when building with W=1.

Fixes: 3b6592f70a ("nvme: utilize two queue maps, one for reads and one for writes") # v5.0-rc1
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20 07:18:07 -07:00
Bart Van Assche
a467fc55fc nvme-fabrics: document the poll function argument
This patch avoids that the kernel-doc tool reports a warning when
building with W=1.

Fixes: 26c682274e ("nvme-fabrics: allow nvmf_connect_io_queue to poll") # v5.0-rc1
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20 07:18:01 -07:00
Hannes Reinecke
75c10e7327 nvme-multipath: round-robin I/O policy
Implement a simple round-robin I/O policy for multipathing.  Path
selection is done in two rounds, first iterating across all optimized
paths, and if that doesn't return any valid paths, iterate over all
optimized and non-optimized paths.  If no paths are found, use the
existing algorithm.  Also add a sysfs attribute 'iopolicy' to switch
between the current NUMA-aware I/O policy and the 'round-robin' I/O
policy.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20 07:17:49 -07:00
Jens Axboe
6fb845f0e7 Linux 5.0-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxgqNUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwsoH+OVXu0NQofwTvVru
 8lgF3BSDG2mhf7mxbBBlBizGVy9jnjRNGCFMC+Jq8IwiFLwprja/G27kaDTkpuF1
 PHC3yfjKvjTeUP5aNdHlmxv6j1sSJfZl0y46DQal4UeTG/Giq8TFTi+Tbz7Wb/WV
 yCx4Lr8okAwTuNhnL8ojUCVIpd3c8QsyR9v6nEQ14Mj+MvEbokyTkMJV0bzOrM38
 JOB+/X1XY4JPZ6o3MoXrBca3bxbAJzMneq+9CWw1U5eiIG3msg4a+Ua3++RQMDNr
 8BP0yCZ6wo32S8uu0PI6HrZaBnLYi5g9Wh7Q7yc0mn1Uh1zWFykA6TtqK90agJeR
 A6Ktjw==
 =scY4
 -----END PGP SIGNATURE-----

Merge tag 'v5.0-rc6' into for-5.1/block

Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
This is needed since io_uring is now based on the block branch,
to avoid a conflict between the multi-page bvecs and the bits
of io_uring that touch the core block parts.

* tag 'v5.0-rc6': (525 commits)
  Linux 5.0-rc6
  x86/mm: Make set_pmd_at() paravirt aware
  MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
  blk-mq: remove duplicated definition of blk_mq_freeze_queue
  Blk-iolatency: warn on negative inflight IO counter
  blk-iolatency: fix IO hang due to negative inflight counter
  MAINTAINERS: unify reference to xen-devel list
  x86/mm/cpa: Fix set_mce_nospec()
  futex: Handle early deadlock return correctly
  futex: Fix barrier comment
  net: dsa: b53: Fix for failure when irq is not defined in dt
  blktrace: Show requests without sector
  mips: cm: reprime error cause
  mips: loongson64: remove unreachable(), fix loongson_poweroff().
  sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
  geneve: should not call rt6_lookup() when ipv6 was disabled
  KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
  KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
  kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
  signal: Better detection of synchronous signals
  ...
2019-02-15 08:43:59 -07:00
Keith Busch
5c959d73db nvme-pci: fix rapid add remove sequence
A surprise removal may fail to tear down request queues if it is racing
with the initial asynchronous probe. If that happens, the remove path
won't see the queue resources to tear down, and the controller reset
path may create a new request queue on a removed device, but will not
be able to make forward progress, deadlocking the pci removal.

Protect setting up non-blocking resources from a shutdown by holding the
same mutex, and transition to the CONNECTING state after these resources
are initialized so the probe path may see the dead controller state
before dispatching new IO.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=202081
Reported-by: Alex Gagniuc <Alex_Gagniuc@Dellteam.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Tested-by: Alex Gagniuc <mr.nuke.me@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-06 16:35:33 +01:00
Keith Busch
e7ad43c3ed nvme: lock NS list changes while handling command effects
If a controller supports the NS Change Notification, the namespace
scan_work is automatically triggered after attaching a new namespace.

Occasionally the namespace scan_work may append the new namespace to the
list before the admin command effects handling is completed. The effects
handling unfreezes namespaces, but if it unfreezes the newly attached
namespace, its request_queue freeze depth will be off and we'll hit the
warning in blk_mq_unfreeze_queue().

On the next namespace add, we will fail to freeze that queue due to the
previous bad accounting and deadlock waiting for frozen.

Fix that by preventing scan work from altering the namespace list while
command effects handling needs to pair freeze with unfreeze.

Reported-by: Wen Xiong <wenxiong@us.ibm.com>
Tested-by: Wen Xiong <wenxiong@us.ibm.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-06 16:35:33 +01:00
Sagi Grimberg
794a4cb3d2 nvme: remove the .stop_ctrl callout
It is used now just to flush error recovery and reconnect work items in
the RDMA and TCP transports, which can simply be moved to the
corresponding teardown routines.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-04 15:41:25 +01:00
Chaitanya Kulkarni
6e02318eae nvme: add support for the Write Zeroes command
Allow write zeroes operations (REQ_OP_WRITE_ZEROES) on the block
device, if the device supports an optional command bit set for write
zeroes. Add support to setup write zeroes command. Set maximum possible
write zeroes sectors in one write zeroes command according to
nvme write zeroes command definition.

This patch was posted as a part of block-write-zeroes support
implementation (https://patchwork.kernel.org/patch/9454859/),
but did not make into mainline kernel as it got reverted due to
failure on the Linus's machine.

In this patch in order to be more cautious, we use NVMe controller's
maximum hardware sector size which is calculated based on the
controller's MDTS (Maximum Data Transfer Size) field to calculate
the maximum sectors for the write zeroes request.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
[folded a fix from Keith Busch to properly respect
 NVME_QUIRK_DEALLOCATE_ZEROES]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-04 15:41:25 +01:00
Linus Torvalds
6b8f915916 for-linus-20190125
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxLdgsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoVGD/4sGYQqfiXogQIJYbPH2RRPrJuLIIITjiAv
 lPXX1wx/tvz/ktwKJE2OiWTES0JjdH1HlC+7/0L/fLb8DXiKBmFuUHlwhureoL9Y
 o//BIQKuaje35kyHITTy2UAJOOqnNtJTaAP2AfkL+eOcj/V/G5rJIfLGs9QtuAR7
 sRJ+uhg1EbW/+uO0bULDmG6WUjxFu8mqcw3i6g0VVLVOnXB2EKcZTl3KPrdAXrUp
 XtmouERga6OfAUSJyZmTSV136mL+opRB2WFFVeIzjQfLmyItDGbSX/YPS8oJ2pow
 v7630F+CMrd4aKpqqtnAhfWpGqd0Xw7cYfZ9MKTJmZPmGzf9a1fQFpmgZosD4Dh3
 7MrhboU4TUt9PdXESA7CmE7LkTp99ghfj5/ysKrSV5h3HsH2RbLxJk91Rx3vmAWD
 u1xWRYL+GYLH6ZwOLvM1iqBrrLN3mUyrx98SaMgoXuqNzmQmgz9LPeA0Gt09FJbo
 uj+ebg4dRwuThjni4xQhl3zL2RQy7nlTDFDdKOz/XoiYk2NUVksss+sxGjNarHj0
 b5pCD4HOp57OreGExaOARpBRah5HSNdQtBRsIOnbyEq6f/e1LsIY23Z9nNF0deGO
 sZzgsbnsn+zg8bC6T/Gk4UY6XdCcgaS3SL04SVKAE3lO6A4C/Awo8DgD9Bl1zpC1
 HQlNkl5fBg==
 =iucY
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190125' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A collection of fixes for this release. This contains:

   - Silence sparse rightfully complaining about non-static wbt
     functions (Bart)

   - Fixes for the zoned comments/ioctl documentation (Damien)

   - direct-io fix that's been lingering for a while (Ernesto)

   - cgroup writeback fix (Tejun)

   - Set of NVMe patches for nvme-rdma/tcp (Sagi, Hannes, Raju)

   - Block recursion tracking fix (Ming)

   - Fix debugfs command flag naming for a few flags (Jianchao)"

* tag 'for-linus-20190125' of git://git.kernel.dk/linux-block:
  block: Fix comment typo
  uapi: fix ioctl documentation
  blk-wbt: Declare local functions static
  blk-mq: fix the cmd_flag_name array
  nvme-multipath: drop optimization for static ANA group IDs
  nvmet-rdma: fix null dereference under heavy load
  nvme-rdma: rework queue maps handling
  nvme-tcp: fix timeout handler
  nvme-rdma: fix timeout handler
  writeback: synchronize sync(2) against cgroup writeback membership switches
  block: cover another queue enter recursion via BIO_QUEUE_ENTERED
  direct-io: allow direct writes to empty inodes
2019-01-26 12:42:41 -08:00
Hannes Reinecke
78a61cd42a nvme-multipath: drop optimization for static ANA group IDs
Bit 6 in the ANACAP field is used to indicate that the ANA group ID
doesn't change while the namespace is attached to the controller.
There is an optimisation in the code to only allocate space
for the ANA group header, as the namespace list won't change and
hence would not need to be refreshed.
However, this optimisation was never carried over to the actual
workflow, which always assumes that the buffer is large enough
to hold the ANA header _and_ the namespace list.
So drop this optimisation and always allocate enough space.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-23 17:16:59 -07:00
Sagi Grimberg
b1064d3e33 nvme-rdma: rework queue maps handling
If the device supports less queues than provided (if the device has less
completion vectors), we might hit a bug due to the fact that we ignore
that in nvme_rdma_map_queues (we override the maps nr_queues with user
opts).

Instead, keep track of how many default/read/poll queues we actually
allocated (rather than asked by the user) and use that to assign our
queue mappings.

Fixes: b65bb777ef (" nvme-rdma: support separate queue maps for read and write")
Reported-by: Saleem, Shiraz <shiraz.saleem@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-23 17:16:59 -07:00
Sagi Grimberg
39d5775746 nvme-tcp: fix timeout handler
Currently, we have several problems with the timeout
handler:
1. If we timeout on the controller establishment flow, we will hang
because we don't execute the error recovery (and we shouldn't because
the create_ctrl flow needs to fail and cleanup on its own)
2. We might also hang if we get a disconnet on a queue while the
controller is already deleting. This racy flow can cause the controller
disable/shutdown admin command to hang.

We cannot complete a timed out request from the timeout handler without
mutual exclusion from the teardown flow (e.g. nvme_rdma_error_recovery_work).
So we serialize it in the timeout handler and teardown io and admin
queues to guarantee that no one races with us from completing the
request.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-23 17:16:59 -07:00
Sagi Grimberg
4c174e6366 nvme-rdma: fix timeout handler
Currently, we have several problems with the timeout
handler:
1. If we timeout on the controller establishment flow, we will hang
because we don't execute the error recovery (and we shouldn't because
the create_ctrl flow needs to fail and cleanup on its own)
2. We might also hang if we get a disconnet on a queue while the
controller is already deleting. This racy flow can cause the controller
disable/shutdown admin command to hang.

We cannot complete a timed out request from the timeout handler without
mutual exclusion from the teardown flow (e.g. nvme_rdma_error_recovery_work).
So we serialize it in the timeout handler and teardown io and admin
queues to guarantee that no one races with us from completing the
request.

Reported-by: Jaesoo Lee <jalee@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-23 17:16:59 -07:00
Linus Torvalds
0facb89245 for-linus-20190118
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxCLxcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmBID/0SpsLsnoeki37Wei0HzBZeRdFGIXFfg1XB
 +s66nJUL+igHIIwH+EO7Ct+vw8jfCWZfasI+ThGtkgRg88bqJO3HEbE1MQY8KlTh
 Dm9LRHH9Jp4OYbbag2RXammaD4XKi980fIyVTp9R0keszQroSUTXXS40fksszTtB
 K4LEBiAxGJgin60W9RgbbK93n/pw17U5Wu2WID6ZfVxZYVzpXl0iIQXUz1zmnpy5
 3aP7ORqJGnH7WQiV57LMpq5/QFBoKrfUsjtxpnJ+i2694sLYX8NCqnH82SUliSU+
 NN3Vxic1ozmZEADFKg4dfmgVqqyJaoVCRvsA5i5YjqG9m65G/S7/BmsaWbq5Ixf2
 N1KQr/36vkYGYWqtb2rbYRVf9XfbxWKp1wuSaX3N1Vh1/Ee180qD68+wpa/0mtkh
 2dXV6CQP9De4BXH6mxj3Yr7LI2a2r7KLYhULZCAvLMvDtsBrPyHAmRRgXxaLlKL8
 QMFYYO/1wjlzpGnn4IxP8sDMH6N2shtZHAmC2n9TM2gmk5tA0OmajJJ4uxNbtyhw
 +ZlDco3aw/x0v2onjlbTiPij2cReRgwN+tVaEvn0dj3uxaisCru60iYdnr9FX4nL
 g3NxNUJbUhl6YMwvNMI0GKfz1IILzjPgJIhDqSHvYhtCA1w0DMVXs3Qv4HXoVh3m
 ffdby6/Yqw==
 =5yJp
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190118' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - block size setting fixes for loop/nbd (Jan Kara)

 - md bio_alloc_mddev() cleanup (Marcos)

 - Ensure we don't lose the REQ_INTEGRITY flag (Ming)

 - Two NVMe fixes by way of Christoph:
    - Fix NVMe IRQ calculation (Ming)
    - Uninitialized variable in nvmet-tcp (Sagi)

 - BFQ comment fix (Paolo)

 - License cleanup for recently added blk-mq-debugfs-zoned (Thomas)

* tag 'for-linus-20190118' of git://git.kernel.dk/linux-block:
  block: Cleanup license notice
  nvme-pci: fix nvme_setup_irqs()
  nvmet-tcp: fix uninitialized variable access
  block: don't lose track of REQ_INTEGRITY flag
  blockdev: Fix livelocks on loop device
  nbd: Use set_blocksize() to set device blocksize
  md: Make bio_alloc_mddev use bio_alloc_bioset
  block, bfq: fix comments on __bfq_deactivate_entity
2019-01-20 09:12:50 +12:00
Ming Lei
c45b1fa243 nvme-pci: fix nvme_setup_irqs()
When -ENOSPC is returned from pci_alloc_irq_vectors_affinity(),
we still try to allocate multiple irq vectors again, so irq queues
covers the admin queue actually. But we don't consider that, then
number of the allocated irq vector may be same with sum of
io_queues[HCTX_TYPE_DEFAULT] and io_queues[HCTX_TYPE_READ], this way
is obviously wrong, and finally breaks nvme_pci_map_queues(), and
warning from pci_irq_get_affinity() is triggered.

IRQ queues should cover admin queues, this patch makes this
point explicitely in nvme_calc_io_queues().

We got severl boot failure internal report on aarch64, so please
consider to fix it in v4.20.

Fixes: 6451fe73fa ("nvme: fix irq vs io_queue calculations")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Tested-by: fin4478 <fin4478@hotmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-16 09:44:28 -07:00
Linus Torvalds
b8c3b8992f for-linus-20190112
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlw6VoIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptaHD/wLDzVDkptMK7Essk73vjBaz6w2MYl/Bvb4
 2YtktHlrCB8S/0Xze7t5tvsGxym2fv1ymsV48Sdnl+lW+2tf6e3LJEfAp7yLWoEs
 K6UdFkhYK7fdKx84dHIcslvyY0id9qjRteRiFHrd/023N4FZDkbKOtfeRBZi7e4Q
 yTP2FzQEnSV5a0ngh9J+xxHUXF7a4LfeVj9ddhwcBIAeJ1MUfLTaHxRR8Qc3nw3F
 7nGvLHu2vrWK3EDBBA/PsBqMJhsoW2AV0G7Xbv6Ai7e14GWVYr+MGjIYC41yIlRm
 46oBoveIRI5dkhR5/mO685vB774yAN6RExOP9Xlg/cA3XxbpdWP9uDtfRbuvo64O
 WOm2btWKjxOldCao4wClaJARcjSyocPTKwlWQE7OfVcN+3WbsD26sd+GdCvrSG7a
 +E3xcyPUURN1e7t4YyDGDL1wKBPuZP6jvcGUfoqLMr3O12aIjlAiJFPhm1Hqv+dA
 KaJjsCwShSDJfeWLvP/Lt1KtVoBinrKA/5qkkB7zJA7I+qNP7bAR25BtuXKo1U7t
 5vRGk7rfvdtsE5oH7y1we9AxoA7jD6t5Z60mPRdkVaXzi9SSw3th/YAuUhcU2g0V
 0I2icAGQI8bbZzFgJe+d4f1L57EQ/7fWfcwENNLkLu57E/gZlYYqt//tSrF8W0ur
 wcthJWqOZg==
 =OFV3
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190112' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Christoph, with little fixes all over the map

 - Loop caching fix for offset/bs change (Jaegeuk Kim)

 - Block documentation tweaks (Jeff, Jon, Weiping, John)

 - null_blk zoned tweak (John)

 - ahch mvebu suspend/resume support. Should have gone into the merge
   window, but there was some confusion on which tree had it. (Miquel)

* tag 'for-linus-20190112' of git://git.kernel.dk/linux-block: (22 commits)
  ata: ahci: mvebu: request PHY suspend/resume for Armada 3700
  ata: ahci: mvebu: add Armada 3700 initialization needed for S2RAM
  ata: ahci: mvebu: do Armada 38x configuration only on relevant SoCs
  ata: ahci: mvebu: remove stale comment
  ata: libahci_platform: comply to PHY framework
  loop: drop caches if offset or block_size are changed
  block: fix kerneldoc comment for blk_attempt_plug_merge()
  nvme: don't initlialize ctrl->cntlid twice
  nvme: introduce NVME_QUIRK_IGNORE_DEV_SUBNQN
  nvme: pad fake subsys NQN vid and ssvid with zeros
  nvme-multipath: zero out ANA log buffer
  nvme-fabrics: unset write/poll queues for discovery controllers
  nvme-tcp: don't ask if controller is fabrics
  nvme-tcp: remove dead code
  nvme-pci: fix out of bounds access in nvme_cqe_pending
  nvme-pci: rerun irq setup on IO queue init errors
  nvme-pci: use the same attributes when freeing host_mem_desc_bufs.
  nvme-pci: fix the wrong setting of nr_maps
  block: doc: add slice_idle_us to bfq documentation
  block: clarify documentation for blk_{start|finish}_plug
  ...
2019-01-12 13:40:51 -08:00
Andrey Smirnov
b8a38ea64d nvme: don't initlialize ctrl->cntlid twice
ctrl->cntlid will already be initialized from id->cntlid for
non-NVME_F_FABRICS controllers few lines below. For NVME_F_FABRICS
controllers this field should already be initialized, otherwise the
check

	if (ctrl->cntlid != le16_to_cpu(id->cntlid))

below will always be a no-op.

Signed-off-by: Andrey Smirnov <andrew.smirnov@gmail.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:08 -05:00
James Dingwall
6299358d19 nvme: introduce NVME_QUIRK_IGNORE_DEV_SUBNQN
If a device provides an NQN it is expected to be globally unique.
Unfortunately some firmware revisions for Intel 760p/Pro 7600p devices did
not satisfy this requirement.  In these circumstances if a system has >1
affected device then only one device is enabled.  If this quirk is enabled
then the device supplied subnqn is ignored and we fallback to generating
one as if the field was empty.  In this case we also suppress the version
check so we don't print a warning when the quirk is enabled.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: James Dingwall <james@dingwall.me.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:08 -05:00
Keith Busch
3da584f571 nvme: pad fake subsys NQN vid and ssvid with zeros
We need to preserve the leading zeros in the vid and ssvid when generating
a unique NQN. Truncating these may lead to naming collisions.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:06 -05:00
Hannes Reinecke
c7055fd15f nvme-multipath: zero out ANA log buffer
When nvme_init_identify() fails the ANA log buffer is deallocated
but _not_ set to NULL. This can cause double free oops when this
controller is deleted without ever being reconnected.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:06 -05:00
Sagi Grimberg
9846ac0143 nvme-fabrics: unset write/poll queues for discovery controllers
Even if user-space sent it to us, it got it wrong so lets
help by disallowing it.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:06 -05:00
Sagi Grimberg
e85037a2e9 nvme-tcp: don't ask if controller is fabrics
For sure we are a fabric driver.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:06 -05:00
Sagi Grimberg
e9c2edc098 nvme-tcp: remove dead code
We should never touch the opal device from the transport driver.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:06 -05:00
Hongbo Yao
dcca166272 nvme-pci: fix out of bounds access in nvme_cqe_pending
There is an out of bounds array access in nvme_cqe_peding().

When enable irq_thread for nvme interrupt, there is racing between the
nvmeq->cq_head updating and reading.

nvmeq->cq_head is updated in nvme_update_cq_head(), if nvmeq->cq_head
equals nvmeq->q_depth and before its value set to zero, nvme_cqe_pending()
uses its value as an array index, the index will be out of bounds.

Signed-off-by: Hongbo Yao <yaohongbo@huawei.com>
[hch: slight coding style update]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:05 -05:00
Keith Busch
8fae268b40 nvme-pci: rerun irq setup on IO queue init errors
If the driver is unable to create a subset of IO queues for any reason,
the read/write and polled queue sets will not match the actual allocated
hardware contexts. This leaves gaps in the CPU affinity mappings and
causes the following kernel panic after blk_mq_map_queue_type() returns
a NULL hctx.

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000198
  #PF error: [normal kernel read fault]
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP
  CPU: 64 PID: 1171 Comm: kworker/u259:1 Not tainted 4.20.0+ #241
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
  Workqueue: nvme-wq nvme_scan_work [nvme_core]
  RIP: 0010:blk_mq_init_allocated_queue+0x2d9/0x440
  RSP: 0018:ffffb1bf0abc3cd0 EFLAGS: 00010286
  RAX: 000000000000001f RBX: ffff8ea744cf0718 RCX: 0000000000000000
  RDX: 0000000000000002 RSI: 000000000000007c RDI: ffffffff9109a820
  RBP: ffff8ea7565f7008 R08: 000000000000001f R09: 000000000000003f
  R10: ffffb1bf0abc3c00 R11: 0000000000000000 R12: 000000000001d008
  R13: ffff8ea7565f7008 R14: 000000000000003f R15: 0000000000000001
  FS:  0000000000000000(0000) GS:ffff8ea757200000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000198 CR3: 0000000013058000 CR4: 00000000000006e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   blk_mq_init_queue+0x35/0x60
   nvme_validate_ns+0xc6/0x7c0 [nvme_core]
   ? nvme_identify_ctrl.isra.56+0x7e/0xc0 [nvme_core]
   nvme_scan_work+0xc8/0x340 [nvme_core]
   ? __wake_up_common+0x6d/0x120
   ? try_to_wake_up+0x55/0x410
   process_one_work+0x1e9/0x3d0
   worker_thread+0x2d/0x3d0
   ? process_one_work+0x3d0/0x3d0
   kthread+0x111/0x130
   ? kthread_park+0x90/0x90
   ret_from_fork+0x1f/0x30
  Modules linked in: nvme nvme_core serio_raw
  CR2: 0000000000000198

Fix by re-running the interrupt vector setup from scratch using a reduced
count that may be successful until the created queues matches the irq
affinity plus polling queue sets.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:02 -05:00
Liviu Dudau
cc667f6d5d nvme-pci: use the same attributes when freeing host_mem_desc_bufs.
When using HMB the PCIe host driver allocates host_mem_desc_bufs using
dma_alloc_attrs() but frees them using dma_free_coherent(). Use the
correct dma_free_attrs() function to free the buffers.

Signed-off-by: Liviu Dudau <liviu@dudau.co.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:02 -05:00
Jianchao Wang
c61e678f30 nvme-pci: fix the wrong setting of nr_maps
We only set the nr_maps to 3 if poll queues are supported.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:02 -05:00
Luis Chamberlain
750afb08ca cross-tree: phase out dma_zalloc_coherent()
We already need to zero out memory for dma_alloc_coherent(), as such
using dma_zalloc_coherent() is superflous. Phase it out.

This change was generated with the following Coccinelle SmPL patch:

@ replace_dma_zalloc_coherent @
expression dev, size, data, handle, flags;
@@

-dma_zalloc_coherent(dev, size, handle, flags)
+dma_alloc_coherent(dev, size, handle, flags)

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
[hch: re-ran the script on the latest tree]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-08 07:58:37 -05:00
yupeng
604c01d567 nvme-pci: trace SQ status on completions
Export the disk name, queue id, sq_head, sq_tail to a trace event in
completion handling.

Usage example:

cd /sys/kernel/debug/tracing/events/nvme/nvme_sq

echo 'disk=="nvme1n1"' > filter

echo 1 > enable

cat /sys/kernel/debug/tracing/trace_pipe

Signed-off-by: yupeng <yupeng0921@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <keith.busch@intel.com>
[hch: slight formatting tweaks, use standard nvme tracepoint
 conventions]
Signed-off-by: Christoph Hellwig <hch@lst.de>

wip
2018-12-19 08:35:36 +01:00
Sagi Grimberg
ff8519f9e9 nvme-rdma: implement polling queue map
When passed with nr_poll_queues setup additional queues with cq polling
context IB_POLL_DIRECT (no interrupts) and make sure to set
QUEUE_FLAG_POLL on the connect_q. In addition add the third queue
mapping for polling queues.

nvmf connect on this queue is polled for like all other requests so make
nvmf_connect_io_queue poll for polling queues.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-18 17:50:49 +01:00
Sagi Grimberg
89d43802b0 nvme-fabrics: allow user to pass in nr_poll_queues
This argument will specify how many polling I/O queues to connect when
creating the controller. These I/O queues will host I/O that is set with
REQ_HIPRI.

Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-18 17:50:49 +01:00
Sagi Grimberg
26c682274e nvme-fabrics: allow nvmf_connect_io_queue to poll
Preparation for polling support for fabrics. Polling support
means that our completion queues are not generating any interrupts
which means we need to poll for the nvmf io queue connect as well.

Reviewed by Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-18 17:50:48 +01:00
Sagi Grimberg
6287b51c77 nvme-core: optionally poll sync commands
Pass poll bool to indicate that we need it to poll. This prepares us for
polling support in nvmf since connect is an I/O that will be queued
and has to be polled in order to complete. If poll is passed,
we call nvme_execute_rq_polled which sends the requests and polls
for its completion.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-18 17:50:48 +01:00
Colin Ian King
56a77d26d6 nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
There is a spelling mistake in a dev_info message, fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-18 17:50:47 +01:00
Christoph Hellwig
a7273d4023 nvme-tcp: fix endianess annotations
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2018-12-18 17:50:46 +01:00
Christoph Hellwig
91a509f8b7 nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
By duplicating the nvme_process_cq in both branches we keep the
sparse lock context checking happy, so do it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2018-12-18 17:50:45 +01:00
Christoph Hellwig
ed92ad37e8 nvme-pci: only set nr_maps to 2 if poll queues are supported
The block layer now enables polling support on a queue if nr_maps
includes the poll map, so we should only set that if we actually
support poll queues.

Fixes:  6544d229bf ("block: enable polling by default if a poll map is initalized")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2018-12-18 17:50:44 +01:00
Christoph Hellwig
7e849dd9cf nvme-pci: don't share queue maps
Now that the block layer checks if a queue map has any queues inside
it there is no more reason to duplicate the maps for the non-default
types.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-17 05:44:45 -07:00