Commit Graph

6417 Commits

Author SHA1 Message Date
Michal Kubecek
37eb86c450 mlx5: avoid 64-bit division
Commit 25c13324d0 ("IB/mlx5: Add steering SW ICM device memory type")
breaks i386 build by introducing three 64-bit divisions. As the divisor is
MLX5_SW_ICM_BLOCK_SIZE() which is always a power of 2, we can replace the
division with bit operations.

Fixes: 25c13324d0 ("IB/mlx5: Add steering SW ICM device memory type")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-29 13:03:21 -03:00
Dennis Dalessandro
5f5e4eb4fb IB/hfi1: Remove extra brackets from an if
A recent patch to hfi1 left behind a checkpatch error.

Fixes: fb24ea52f7 ("drivers: Remove explicit invocations of mmiowb()")
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-29 12:59:48 -03:00
Kamenee Arumugam
97736f36db IB/hfi1: Validate page aligned for a given virtual address
User applications can register memory regions for TID buffers that are not
aligned on page boundaries. Hfi1 is expected to pin those pages in memory
and cache the pages with mmu_rb. The rb tree will fail to insert pages
that are not aligned correctly.

Validate whether a given virtual address is page aligned before pinning.

Fixes: 7e7a436ecb ("staging/hfi1: Add TID entry program function body")
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-29 12:56:05 -03:00
Mike Marciniszyn
35164f5259 IB/{qib, hfi1, rdmavt}: Correct ibv_devinfo max_mr value
The command 'ibv_devinfo -v' reports 0 for max_mr.

Fix by assigning the query values after the mr lkey_table has been built
rather than early on in the driver.

Fixes: 7b1e2099ad ("IB/rdmavt: Move memory registration into rdmavt")
Reviewed-by: Josh Collier <josh.d.collier@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-29 12:56:05 -03:00
Mike Marciniszyn
6d517353c7 IB/hfi1: Insure freeze_work work_struct is canceled on shutdown
By code inspection, the freeze_work is never canceled.

Fix by adding a cancel_work_sync in the shutdown path to insure it is no
longer running.

Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-29 12:56:05 -03:00
John Hubbard
ea99697458 RDMA: Convert put_page() to put_user_page*()
For infiniband code that retains pages via get_user_pages*(), release
those pages via the new put_user_page(), or put_user_pages*(), instead of
put_page()

This is a tiny part of the second step of fixing the problem described in
[1]. The steps are:

1) Provide put_user_page*() routines, intended to be used for releasing
   pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to invoke
   put_user_page*(), instead of put_page(). This involves dozens of call
   sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem. Again, [1] provides details as to why that is
   desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27 20:11:11 -03:00
YueHaibing
cfcc048ca7 IB/hfi1: Remove set but not used variables 'offset' and 'fspsn'
Fixes gcc '-Wunused-but-set-variable' warning:

drivers/infiniband/hw/hfi1/tid_rdma.c: In function tid_rdma_rcv_error:
drivers/infiniband/hw/hfi1/tid_rdma.c:2029:7: warning: variable offset set but not used [-Wunused-but-set-variable]
drivers/infiniband/hw/hfi1/tid_rdma.c: In function hfi1_rc_rcv_tid_rdma_ack:
drivers/infiniband/hw/hfi1/tid_rdma.c:4555:35: warning: variable fspsn set but not used [-Wunused-but-set-variable]

'offset' is never used since introduction in
commit d0d564a1ca ("IB/hfi1: Add functions to receive TID RDMA READ request")

'fspsn' is never used since introduciotn in
commit 9e93e967f7 ("IB/hfi1: Add a function to receive TID RDMA ACK packet")

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27 20:09:16 -03:00
Lijun Ou
2a3d923f87 RDMA/hns: Replace magic numbers with #defines
This patch makes the code more readable by removing magic numbers.

Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27 17:31:00 -03:00
Lang Cheng
669cefb654 RDMA/hns: Remove jiffies operation in disable interrupt context
In some functions, the jiffies operation is unnecessary, and we can
control delay using mdelay and udelay functions only.  Especially, in
hns_roce_v1_clear_hem, the function calls spin_lock_irqsave, the context
disables interrupt, so we can not use jiffies and msleep functions.

Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27 17:28:39 -03:00
Lang Cheng
780f33962e RDMA/hns: Move spin_lock_irqsave to the correct place
When hip08 set gid, it will call spin_unlock_bh when send cmq.  if main.ko
call spin_lock_irqsave firstly, and the kernel is before commit
f71b74bca6 ("irq/softirqs: Use lockdep to assert IRQs are
disabled/enabled"), it will cause WARN_ON_ONCE because of calling
spin_unlock_bh in disable context.

In fact, the spin_lock_irqsave in main.ko is only used for hip06, and
should be placed in hns_roce_hw_v1.c. hns_roce_hw_v2.c uses its own
spin_unlock_bh and does not need main.ko manage spin_lock.

Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27 17:27:59 -03:00
Lijun Ou
0502849d0b RDMA/hns: Update CQE specifications
According to hip08 UM, the maximum number of CQEs supported by each CQ is
4M.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27 17:27:59 -03:00
Yixian Liu
8ffb813255 RDMA/hns: Remove unnecessary print message in aeq
There is no need to print when communication is established, especially
while lots of qp used by application.

Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27 17:27:59 -03:00
Nirranjan Kirubaharan
f70baa7ee3 iw_cxgb4: Fix qpid leak
Add await in destroy_qp() so that all references to qp are dereferenced
and qp is freed in destroy_qp() itself.  This ensures freeing of all QPs
before invocation of dealloc_ucontext(), which prevents loss of in use
qpids stored in the ucontext.

Signed-off-by: Nirranjan Kirubaharan <nirranjan@chelsio.com>
Reviewed-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27 14:58:24 -03:00
Leon Romanovsky
cae626b978 RDMA/cxgb4: Don't expose DMA addresses
Change unconditional print of DMA address to be printed with special
printk format type specifier.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27 14:32:17 -03:00
Leon Romanovsky
34d568930b RDMA/cxgb4: Use sizeof() notation
Convert various sizeof call sites to be written in standard format
sizeof().

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27 14:32:17 -03:00
Leon Romanovsky
a80287c813 RDMA/cxgb3: Delete and properly mark unimplemented resize CQ function
Resize CQ implementation was guarded by undeclared "notyet" define while
cxgb3 was added to the kernel. Twelve years later, this call is still
unimplemented, so safely delete it and fix improper return error code when
.resize_cq() is not implemented.

Fixes: b038ced7b3 ("RDMA/cxgb3: Add driver for Chelsio T3 RNIC")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27 14:24:04 -03:00
Leon Romanovsky
0ddf8f6267 RDMA/cxgb3: Don't expose DMA addresses
DMA addresses like all other kernel addresses should be printed with
special %p* formatter. It is needed to allow control of exposure of such
information through a dedicated knob.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27 14:24:03 -03:00
Leon Romanovsky
d34d37d5a1 RDMA/cxgb3: Use sizeof() notation instead of plain sizeof
sizeof(a), sizeof a and sizeof (a) are all valid notations, but first is
more readable format recommended by checkpatch.pl.

Let's canonize it in cxgb3 drivers, so latter patches won't emit
checkpatch warnings. As part of this change, a redundant memset() was
removed.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27 14:15:26 -03:00
Michal Kalderon
3576e99e08 qed*: Add iWARP 100g support
Add iWARP engine affinity setting for supporting iWARP over 100g.
iWARP cannot be distinguished by the LLH from L2, hence the
engine division will affect L2 as well. For this reason we add
a parameter to devlink to determine the engine division.

Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 13:04:12 -07:00
Michal Kalderon
443473d2f3 qedr: Change the MSI-X vectors selection to be based on affined engine
Use the msix vectors of the affined hwfn and not the
leading one.

Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 13:04:11 -07:00
Michal Kalderon
08eb1fb0f7 qed*: Change hwfn used for sb initialization
When initializing status blocks use the affined hwfn
instead of the leading one for RDMA / Storage

Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 13:04:11 -07:00
Leon Romanovsky
62a38e704d RDMA/efa: Remove check that prevents destroy of resources in error flows
Drivers cannot check the udata for validity when doing destroy as there
will be no way to report this error back to the uverbs.

Since udata is new for destroy no driver should start to use it - instead
drivers should opt for the ioctl interface and define it in a way where it
cannot fail due to incorrect data.

Remove the checks on udata construction so EFA is consistent with
everything else.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-21 21:14:28 -03:00
Linus Torvalds
2c1212de6f SPDX update for 5.2-rc2, round 1
Here are series of patches that add SPDX tags to different kernel files,
 based on two different things:
   - SPDX entries are added to a bunch of files that we missed a year ago
     that do not have any license information at all.
 
     These were either missed because the tool saw the MODULE_LICENSE()
     tag, or some EXPORT_SYMBOL tags, and got confused and thought the
     file had a real license, or the files have been added since the last
     big sweep, or they were Makefile/Kconfig files, which we didn't
     touch last time.
 
   - Add GPL-2.0-only or GPL-2.0-or-later tags to files where our scan
     tools can determine the license text in the file itself.  Where this
     happens, the license text is removed, in order to cut down on the
     700+ different ways we have in the kernel today, in a quest to get
     rid of all of these.
 
 These patches have been out for review on the linux-spdx@vger mailing
 list, and while they were created by automatic tools, they were
 hand-verified by a bunch of different people, all whom names are on the
 patches are reviewers.
 
 The reason for these "large" patches is if we were to continue to
 progress at the current rate of change in the kernel, adding license
 tags to individual files in different subsystems, we would be finished
 in about 10 years at the earliest.
 
 There will be more series of these types of patches coming over the next
 few weeks as the tools and reviewers crunch through the more "odd"
 variants of how to say "GPLv2" that developers have come up with over
 the years, combined with other fun oddities (GPL + a BSD disclaimer?)
 that are being unearthed, with the goal for the whole kernel to be
 cleaned up.
 
 These diffstats are not small, 3840 files are touched, over 10k lines
 removed in just 24 patches.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXOP8uw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynmGQCgy3evqzleuOITDpuWaxewFdHqiJYAnA7KRw4H
 1KwtfRnMtG6dk/XaS7H7
 =O9lH
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull SPDX update from Greg KH:
 "Here is a series of patches that add SPDX tags to different kernel
  files, based on two different things:

   - SPDX entries are added to a bunch of files that we missed a year
     ago that do not have any license information at all.

     These were either missed because the tool saw the MODULE_LICENSE()
     tag, or some EXPORT_SYMBOL tags, and got confused and thought the
     file had a real license, or the files have been added since the
     last big sweep, or they were Makefile/Kconfig files, which we
     didn't touch last time.

   - Add GPL-2.0-only or GPL-2.0-or-later tags to files where our scan
     tools can determine the license text in the file itself. Where this
     happens, the license text is removed, in order to cut down on the
     700+ different ways we have in the kernel today, in a quest to get
     rid of all of these.

  These patches have been out for review on the linux-spdx@vger mailing
  list, and while they were created by automatic tools, they were
  hand-verified by a bunch of different people, all whom names are on
  the patches are reviewers.

  The reason for these "large" patches is if we were to continue to
  progress at the current rate of change in the kernel, adding license
  tags to individual files in different subsystems, we would be finished
  in about 10 years at the earliest.

  There will be more series of these types of patches coming over the
  next few weeks as the tools and reviewers crunch through the more
  "odd" variants of how to say "GPLv2" that developers have come up with
  over the years, combined with other fun oddities (GPL + a BSD
  disclaimer?) that are being unearthed, with the goal for the whole
  kernel to be cleaned up.

  These diffstats are not small, 3840 files are touched, over 10k lines
  removed in just 24 patches"

* tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (24 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 25
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 24
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 23
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 22
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 21
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 20
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 19
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 18
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 17
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 15
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 14
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 12
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 11
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 10
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 9
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 7
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 5
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 4
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 3
  ...
2019-05-21 12:33:38 -07:00
Leon Romanovsky
dab99af99c RDMA/nes: Remove second wait queue initialization call
The same wait queue is initialized a couple of lines above.

Fixes: 3c2d774cad ("RDMA/nes: Add a driver for NetEffect RNICs")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-21 15:50:53 -03:00
Leon Romanovsky
3bb58cfe07 RDMA/i40iw: Remove useless NULL checks
There is no need to check existence of structures to be destroyed.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-21 15:50:53 -03:00
Leon Romanovsky
269c97fd48 RDMA/nes: Remove useless NULL checks
The destroy functions are always called with relevant structs, there is no
need to check their existence.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-21 15:50:53 -03:00
Yuval Shaia
8ce0048f76 IB/mlx4: Delete unused func arg
The function argument virt_addr is not in use - delete it.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-21 15:27:25 -03:00
Leon Romanovsky
619122be3d RDMA/hns: Fix PD memory leak for internal allocation
free_pd is allocated internally by the driver hence needs to be freed
internally too or it leaks.

Fixes: 21a428a019 ("RDMA: Handle PD allocations by IB/core")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-21 15:25:37 -03:00
Jason Gunthorpe
d2183c6f19 RDMA/umem: Move page_shift from ib_umem to ib_odp_umem
This value has always been set to PAGE_SHIFT in the core code, the only
thing that does differently was the ODP path. Move the value into the ODP
struct and still use it for ODP, but change all the non-ODP things to just
use PAGE_SHIFT/PAGE_SIZE/PAGE_MASK directly.

Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-05-21 15:23:24 -03:00
Sagiv Ozeri
69054666df RDMA/qedr: Fix incorrect device rate.
Use the correct enum value introduced in commit 12113a35ad ("IB/core:
Add HDR speed enum") Prior to this change a 50Gbps port would show 40Gbps.

This patch also cleaned up the redundant redefiniton of ib speeds for
qedr.

Fixes: 12113a35ad ("IB/core: Add HDR speed enum")
Signed-off-by: Sagiv Ozeri <sagiv.ozeri@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-21 15:04:53 -03:00
Thomas Gleixner
ec8f24b7fa treewide: Add SPDX license identifier - Makefile/Kconfig
Add SPDX license identifiers to all Make/Kconfig files which:

 - Have no license information of any form

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:46 +02:00
Linus Torvalds
78e0365184 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:1) Use after free in __dev_map_entry_free(), from Eric Dumazet.

 1) Use after free in __dev_map_entry_free(), from Eric Dumazet.

 2) Fix TCP retransmission timestamps on passive Fast Open, from Yuchung
    Cheng.

 3) Orphan NFC, we'll take the patches directly into my tree. From
    Johannes Berg.

 4) We can't recycle cloned TCP skbs, from Eric Dumazet.

 5) Some flow dissector bpf test fixes, from Stanislav Fomichev.

 6) Fix RCU marking and warnings in rhashtable, from Herbert Xu.

 7) Fix some potential fib6 leaks, from Eric Dumazet.

 8) Fix a _decode_session4 uninitialized memory read bug fix that got
    lost in a merge. From Florian Westphal.

 9) Fix ipv6 source address routing wrt. exception route entries, from
    Wei Wang.

10) The netdev_xmit_more() conversion was not done %100 properly in mlx5
    driver, fix from Tariq Toukan.

11) Clean up botched merge on netfilter kselftest, from Florian
    Westphal.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (74 commits)
  of_net: fix of_get_mac_address retval if compiled without CONFIG_OF
  net: fix kernel-doc warnings for socket.c
  net: Treat sock->sk_drops as an unsigned int when printing
  kselftests: netfilter: fix leftover net/net-next merge conflict
  mlxsw: core: Prevent reading unsupported slave address from SFP EEPROM
  mlxsw: core: Prevent QSFP module initialization for old hardware
  vsock/virtio: Initialize core virtio vsock before registering the driver
  net/mlx5e: Fix possible modify header actions memory leak
  net/mlx5e: Fix no rewrite fields with the same match
  net/mlx5e: Additional check for flow destination comparison
  net/mlx5e: Add missing ethtool driver info for representors
  net/mlx5e: Fix number of vports for ingress ACL configuration
  net/mlx5e: Fix ethtool rxfh commands when CONFIG_MLX5_EN_RXNFC is disabled
  net/mlx5e: Fix wrong xmit_more application
  net/mlx5: Fix peer pf disable hca command
  net/mlx5: E-Switch, Correct type to u16 for vport_num and int for vport_index
  net/mlx5: Add meaningful return codes to status_to_err function
  net/mlx5: Imply MLXFW in mlx5_core
  Revert "tipc: fix modprobe tipc failed after switch order of device registration"
  vsock/virtio: free packets during the socket release
  ...
2019-05-20 08:21:07 -07:00
Parav Pandit
02f3afd975 net/mlx5: E-Switch, Correct type to u16 for vport_num and int for vport_index
To avoid any ambiguity between vport index and vport number,
rename functions that had vport, to vport_num or vport_index appropriately.

vport_num is u16 hence change mlx5_eswitch_index_to_vport_num() return
type to u16.

vport_index is an int in vport array. Hence change input type of vport
index in mlx5_eswitch_index_to_vport_num() to int.

Correct multiple eswitch representor interfaces use type u16 of
rep->vport as type int vport_index.

Send vport FW commands with correct eswitch u16 vport_num instead
host int vport_index.

Fixes: 5ae5162066 ("net/mlx5: E-Switch, Assign a different position for uplink rep and vport")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-05-17 13:16:47 -07:00
Linus Torvalds
5ac9433224 5.2 Merge Window second pull request
This is being sent to get a fix for the gcc 9.1 build warnings, and I've
 also pulled in some bug fix patches that were posted in the last two
 weeks.
 
 - Avoid the gcc 9.1 warning about overflowing a union member
 
 - Fix the wrong callback type for a single response netlink to doit
 
 - Bug fixes from more usage of the mlx5 devx interface
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlzbYAsACgkQOG33FX4g
 mxpzYw/9HxKMpU5QmHIpV17sVV5SSepfWVQ6YmrNMG5BTBI8by0zj58fJ9TLuNu+
 OYMD6dS/baLeiN6jszec6zWufjUVfMU5aw1ja+iwF78fS8NmVXlrLz/xWmkLu4fi
 pBN3PCt90ziCnVXOlsn55dKAcgmiaRws+TzGjGGvQP9IYpfO6kyj8HIrP6im910E
 j41HcGrD1fMLy0js9Aq6OzMswbop8uFTV/UBp5onKASNPwAGlnigvjTKqnSlt+Vo
 rswc/h8uIz1jnuH1s8EfggFY7nGqxNmq9G/UNBo/86JcLI97SaYN9pqQJ+HcEtDR
 tJYoDr8PFDJcDaFpm0gbNK5pO9cS7X/I/NWZrdePywZAPAMFKXWgnUejLXVcPKd9
 EdkWyg7sJxPHoo6CXrNECu7t/57q3E3qOG93HnXt64pJqv9C9lUmpGrvdv7PBVRK
 6nVBysrkV0/27sBeZzul0teRbEqRii/RJ/iphE3w3hPx696Bi5uFzN/8M3tfavj1
 pBX7eLAevA+yPlN7+sZiefPjeP0jsvwlzNdrP+9CmB5iIlj0yNlmTvT2rbv+hte0
 0JTQvDilmC0e/W0KqQ6fGGfmPFBbHm/UDLu0h24qdw1qQXGOaDH6RRMslrtgNYNw
 Mkc++uIC6/KdiehEzolht87FH4sMJrd0DS540WVqJqje7K3jyY8=
 =Lo/s
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull more rdma updates from Jason Gunthorpe:
 "This is being sent to get a fix for the gcc 9.1 build warnings, and
  I've also pulled in some bug fix patches that were posted in the last
  two weeks.

   - Avoid the gcc 9.1 warning about overflowing a union member

   - Fix the wrong callback type for a single response netlink to doit

   - Bug fixes from more usage of the mlx5 devx interface"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  net/mlx5: Set completion EQs as shared resources
  IB/mlx5: Verify DEVX general object type correctly
  RDMA/core: Change system parameters callback from dumpit to doit
  RDMA: Directly cast the sockaddr union to sockaddr
2019-05-14 20:56:31 -07:00
Ira Weiny
f3b4fdb18c IB/mthca: use the new FOLL_LONGTERM flag to get_user_pages_fast()
Use the new FOLL_LONGTERM to get_user_pages_fast() to protect against FS
DAX pages being mapped.

Link: http://lkml.kernel.org/r/20190328084422.29911-8-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-8-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:46 -07:00
Ira Weiny
664b21e717 IB/qib: use the new FOLL_LONGTERM flag to get_user_pages_fast()
Use the new FOLL_LONGTERM to get_user_pages_fast() to protect against FS
DAX pages being mapped.

Link: http://lkml.kernel.org/r/20190328084422.29911-7-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-7-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:46 -07:00
Ira Weiny
9fdf4aa156 IB/hfi1: use the new FOLL_LONGTERM flag to get_user_pages_fast()
Use the new FOLL_LONGTERM to get_user_pages_fast() to protect against FS
DAX pages being mapped.

[ira.weiny@intel.com: v3]
  Link: http://lkml.kernel.org/r/20190328084422.29911-6-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-6-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-6-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:46 -07:00
Ira Weiny
73b0140bf0 mm/gup: change GUP fast to use flags rather than a write 'bool'
To facilitate additional options to get_user_pages_fast() change the
singular write parameter to be gup_flags.

This patch does not change any functionality.  New functionality will
follow in subsequent patches.

Some of the get_user_pages_fast() call sites were unchanged because they
already passed FOLL_WRITE or 0 for the write parameter.

NOTE: It was suggested to change the ordering of the get_user_pages_fast()
arguments to ensure that callers were converted.  This breaks the current
GUP call site convention of having the returned pages be the final
parameter.  So the suggestion was rejected.

Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marshall <hubcap@omnibond.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:46 -07:00
Ira Weiny
932f4a630a mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM
Pach series "Add FOLL_LONGTERM to GUP fast and use it".

HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages.  These pages can be held for a significant time.  But
get_user_pages_fast() does not protect against mapping FS DAX pages.

Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks.  XDP has also
shown interest in using this functionality.[1]

In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.

[1] https://lkml.org/lkml/2019/3/19/939

"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move.  I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem.  Then I think we can change the flag to a
better name.

Secondly, it depends on how often you are registering memory.  I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance.  I don't have the numbers as the
tests for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

This patch (of 7):

This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast().  Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.

Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.

This patch does not change any functionality.  In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked.  However, callers of get_user_pages_fast()
were not "protected".

FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.

NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages.  This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.

As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.

In review[1] it was asked:

<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?

A couple of points.

First "longterm" is a relative thing and at this point is probably a
misnomer.  This is really flagging a pin which is going to be given to
hardware and can't move.  I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem.  Then I think we can
change the flag to a better name.

Second, It depends on how often you are registering memory.  I have spoken
with some RDMA users who consider MR in the performance path...  For the
overall application performance.  I don't have the numbers as the tests
for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

</quote>

[1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965

[ira.weiny@intel.com: v3]
  Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:45 -07:00
Yishai Hadas
cd5d20f13f IB/mlx5: Verify DEVX general object type correctly
As the obj_id in the firmware is not globally unique in general_object,
the object type must be considered upon checking for a valid object id.

Fixes: 2351776e87 ("IB/mlx5: Verify DEVX object type")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-14 10:22:09 -03:00
Jason Gunthorpe
641114d2af RDMA: Directly cast the sockaddr union to sockaddr
gcc 9 now does allocation size tracking and thinks that passing the member
of a union and then accessing beyond that member's bounds is an overflow.

Instead of using the union member, use the entire union with a cast to
get to the sockaddr. gcc will now know that the memory extends the full
size of the union.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-13 22:16:38 -03:00
Linus Torvalds
dce45af5c2 5.2 Merge Window pull request
This has been a smaller cycle than normal. One new driver was accepted,
 which is unusual, and at least one more driver remains in review on the
 list.
 
 - Driver fixes for hns, hfi1, nes, rxe, i40iw, mlx5, cxgb4, vmw_pvrdma
 
 - Many patches from MatthewW converting radix tree and IDR users to use
   xarray
 
 - Introduction of tracepoints to the MAD layer
 
 - Build large SGLs at the start for DMA mapping and get the driver to
   split them
 
 - Generally clean SGL handling code throughout the subsystem
 
 - Support for restricting RDMA devices to net namespaces for containers
 
 - Progress to remove object allocation boilerplate code from drivers
 
 - Change in how the mlx5 driver shows representor ports linked to VFs
 
 - mlx5 uapi feature to access the on chip SW ICM memory
 
 - Add a new driver for 'EFA'. This is HW that supports user space packet
   processing through QPs in Amazon's cloud
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlzTIU0ACgkQOG33FX4g
 mxrGKQ/8CqpyvuCyZDW5ovO4DI4YlzYSPXehWlwxA4CWhU1AYTujutnNOdZdngnz
 atTthOlJpZWJV26orvvzwIOi4qX/5UjLXEY3HYdn07JP1Z4iT7E3P4W2sdU3vdl3
 j8bU7xM7ZWmnGxrBZ6yQlVRadEhB8+HJIZWMw+wx66cIPnvU+g9NgwouH67HEEQ3
 PU8OCtGBwNNR508WPiZhjqMDfi/3BED4BfCihFhMbZEgFgObjRgtCV0M33SSXKcR
 IO2FGNVuDAUBlND3vU9guW1+M77xE6p1GvzkIgdCp6qTc724NuO5F2ngrpHKRyZT
 CxvBhAJI6tAZmjBVnmgVJex7rA8p+y/8M/2WD6GE3XSO89XVOkzNBiO2iTMeoxXr
 +CX6VvP2BWwCArxsfKMgW3j0h/WVE9w8Ciej1628m1NvvKEV4AGIJC1g93lIJkRN
 i3RkJ5PkIrdBrTEdKwDu1FdXQHaO7kGgKvwzJ7wBFhso8BRMrMfdULiMbaXs2Bw1
 WdL5zoSe/bLUpPZxcT9IjXRxY5qR0FpIOoo6925OmvyYe/oZo1zbitS5GGbvV90g
 tkq6Jb+aq8ZKtozwCo+oMcg9QPLYNibQsnkL3QirtURXWCG467xdgkaJLdF6s5Oh
 cp+YBqbR/8HNMG/KQlCfnNQKp1ci8mG3EdthQPhvdcZ4jtbqnSI=
 =TS64
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "This has been a smaller cycle than normal. One new driver was
  accepted, which is unusual, and at least one more driver remains in
  review on the list.

  Summary:

   - Driver fixes for hns, hfi1, nes, rxe, i40iw, mlx5, cxgb4,
     vmw_pvrdma

   - Many patches from MatthewW converting radix tree and IDR users to
     use xarray

   - Introduction of tracepoints to the MAD layer

   - Build large SGLs at the start for DMA mapping and get the driver to
     split them

   - Generally clean SGL handling code throughout the subsystem

   - Support for restricting RDMA devices to net namespaces for
     containers

   - Progress to remove object allocation boilerplate code from drivers

   - Change in how the mlx5 driver shows representor ports linked to VFs

   - mlx5 uapi feature to access the on chip SW ICM memory

   - Add a new driver for 'EFA'. This is HW that supports user space
     packet processing through QPs in Amazon's cloud"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (186 commits)
  RDMA/ipoib: Allow user space differentiate between valid dev_port
  IB/core, ipoib: Do not overreact to SM LID change event
  RDMA/device: Don't fire uevent before device is fully initialized
  lib/scatterlist: Remove leftover from sg_page_iter comment
  RDMA/efa: Add driver to Kconfig/Makefile
  RDMA/efa: Add the efa module
  RDMA/efa: Add EFA verbs implementation
  RDMA/efa: Add common command handlers
  RDMA/efa: Implement functions that submit and complete admin commands
  RDMA/efa: Add the ABI definitions
  RDMA/efa: Add the com service API definitions
  RDMA/efa: Add the efa_com.h file
  RDMA/efa: Add the efa.h header file
  RDMA/efa: Add EFA device definitions
  RDMA: Add EFA related definitions
  RDMA/umem: Remove hugetlb flag
  RDMA/bnxt_re: Use core helpers to get aligned DMA address
  RDMA/i40iw: Use core helpers to get aligned DMA address within a supported page size
  RDMA/verbs: Add a DMA iterator to return aligned contiguous memory blocks
  RDMA/umem: Add API to find best driver supported page size in an MR
  ...
2019-05-09 09:02:46 -07:00
Linus Torvalds
80f232121b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.

   2) Add fib_sync_mem to control the amount of dirty memory we allow to
      queue up between synchronize RCU calls, from David Ahern.

   3) Make flow classifier more lockless, from Vlad Buslov.

   4) Add PHY downshift support to aquantia driver, from Heiner
      Kallweit.

   5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
      contention on SLAB spinlocks in heavy RPC workloads.

   6) Partial GSO offload support in XFRM, from Boris Pismenny.

   7) Add fast link down support to ethtool, from Heiner Kallweit.

   8) Use siphash for IP ID generator, from Eric Dumazet.

   9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
      entries, from David Ahern.

  10) Move skb->xmit_more into a per-cpu variable, from Florian
      Westphal.

  11) Improve eBPF verifier speed and increase maximum program size,
      from Alexei Starovoitov.

  12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
      spinlocks. From Neil Brown.

  13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.

  14) Improve link partner cap detection in generic PHY code, from
      Heiner Kallweit.

  15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
      Maguire.

  16) Remove SKB list implementation assumptions in SCTP, your's truly.

  17) Various cleanups, optimizations, and simplifications in r8169
      driver. From Heiner Kallweit.

  18) Add memory accounting on TX and RX path of SCTP, from Xin Long.

  19) Switch PHY drivers over to use dynamic featue detection, from
      Heiner Kallweit.

  20) Support flow steering without masking in dpaa2-eth, from Ioana
      Ciocoi.

  21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
      Pirko.

  22) Increase the strict parsing of current and future netlink
      attributes, also export such policies to userspace. From Johannes
      Berg.

  23) Allow DSA tag drivers to be modular, from Andrew Lunn.

  24) Remove legacy DSA probing support, also from Andrew Lunn.

  25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
      Haabendal.

  26) Add a generic tracepoint for TX queue timeouts to ease debugging,
      from Cong Wang.

  27) More indirect call optimizations, from Paolo Abeni"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
  cxgb4: Fix error path in cxgb4_init_module
  net: phy: improve pause mode reporting in phy_print_status
  dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
  net: macb: Change interrupt and napi enable order in open
  net: ll_temac: Improve error message on error IRQ
  net/sched: remove block pointer from common offload structure
  net: ethernet: support of_get_mac_address new ERR_PTR error
  net: usb: smsc: fix warning reported by kbuild test robot
  staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
  net: dsa: support of_get_mac_address new ERR_PTR error
  net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
  vrf: sit mtu should not be updated when vrf netdev is the link
  net: dsa: Fix error cleanup path in dsa_init_module
  l2tp: Fix possible NULL pointer dereference
  taprio: add null check on sched_nest to avoid potential null pointer dereference
  net: mvpp2: cls: fix less than zero check on a u32 variable
  net_sched: sch_fq: handle non connected flows
  net_sched: sch_fq: do not assume EDT packets are ordered
  net: hns3: use devm_kcalloc when allocating desc_cb
  net: hns3: some cleanup for struct hns3_enet_ring
  ...
2019-05-07 22:03:58 -07:00
Gal Pressman
f23afd75fc RDMA/efa: Add driver to Kconfig/Makefile
Add EFA Makefile and Kconfig.

Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-07 12:47:47 -03:00
Gal Pressman
b7f5e880f3 RDMA/efa: Add the efa module
Add the main EFA module file which takes care of device
probe/initialization/registration/etc.

Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-07 12:47:47 -03:00
Gal Pressman
40909f664d RDMA/efa: Add EFA verbs implementation
Add a file that implements the EFA verbs.

Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-07 12:47:47 -03:00
Linus Torvalds
dd4e5d6106 Remove Mysterious Macro Intended to Obscure Weird Behaviours (mmiowb())
Remove mmiowb() from the kernel memory barrier API and instead, for
 architectures that need it, hide the barrier inside spin_unlock() when
 MMIO has been performed inside the critical section.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAlzMFaUACgkQt6xw3ITB
 YzRICQgAiv7wF/yIbBhDOmCNCAKDO59chvFQWxXWdGk/aAB56kwKAMXJgLOvlMG/
 VRuuLyParTFQETC3jaxKgnO/1hb+PZLDt2Q2KqixtjIzBypKUPWvK2sf6THhSRF1
 GK0DBVUd1rCrWrR815+SPb8el4xXtdBzvAVB+Fx35PXVNpdRdqCkK+EQ6UnXGokm
 rXXHbnfsnquBDtmb4CR4r2beH+aNElXbdt0Kj8VcE5J7f7jTdW3z6Q9WFRvdKmK7
 yrsxXXB2w/EsWXOwFp0SLTV5+fgeGgTvv8uLjDw+SG6t0E0PebxjNAflT7dPrbYL
 WecjKC9WqBxrGY+4ew6YJP70ijLBCw==
 =aC8m
 -----END PGP SIGNATURE-----

Merge tag 'arm64-mmiowb' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull mmiowb removal from Will Deacon:
 "Remove Mysterious Macro Intended to Obscure Weird Behaviours (mmiowb())

  Remove mmiowb() from the kernel memory barrier API and instead, for
  architectures that need it, hide the barrier inside spin_unlock() when
  MMIO has been performed inside the critical section.

  The only relatively recent changes have been addressing review
  comments on the documentation, which is in a much better shape thanks
  to the efforts of Ben and Ingo.

  I was initially planning to split this into two pull requests so that
  you could run the coccinelle script yourself, however it's been plain
  sailing in linux-next so I've just included the whole lot here to keep
  things simple"

* tag 'arm64-mmiowb' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (23 commits)
  docs/memory-barriers.txt: Update I/O section to be clearer about CPU vs thread
  docs/memory-barriers.txt: Fix style, spacing and grammar in I/O section
  arch: Remove dummy mmiowb() definitions from arch code
  net/ethernet/silan/sc92031: Remove stale comment about mmiowb()
  i40iw: Redefine i40iw_mmiowb() to do nothing
  scsi/qla1280: Remove stale comment about mmiowb()
  drivers: Remove explicit invocations of mmiowb()
  drivers: Remove useless trailing comments from mmiowb() invocations
  Documentation: Kill all references to mmiowb()
  riscv/mmiowb: Hook up mmwiob() implementation to asm-generic code
  powerpc/mmiowb: Hook up mmwiob() implementation to asm-generic code
  ia64/mmiowb: Add unconditional mmiowb() to arch_spin_unlock()
  mips/mmiowb: Add unconditional mmiowb() to arch_spin_unlock()
  sh/mmiowb: Add unconditional mmiowb() to arch_spin_unlock()
  m68k/io: Remove useless definition of mmiowb()
  nds32/io: Remove useless definition of mmiowb()
  x86/io: Remove useless definition of mmiowb()
  arm64/io: Remove useless definition of mmiowb()
  ARM/io: Remove useless definition of mmiowb()
  mmiowb: Hook up mmiowb helpers to spinlocks and generic I/O accessors
  ...
2019-05-06 16:57:52 -07:00
Gal Pressman
e9c6c53730 RDMA/efa: Add common command handlers
Add the EFA common commands implementation.

Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-06 15:18:18 -03:00
Gal Pressman
0420e54256 RDMA/efa: Implement functions that submit and complete admin commands
Add admin commands submissions/completions implementation.

Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-06 15:18:18 -03:00
Gal Pressman
cd9b3d5970 RDMA/efa: Add the com service API definitions
Header file for the various commands that can be sent through admin queue.
This includes queue create/modify/destroy, setting up and remove
protection domains, address handlers, and memory registration, etc.

Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-06 15:18:17 -03:00
Gal Pressman
43eaa49d51 RDMA/efa: Add the efa_com.h file
A helper header file for EFA admin queue, admin queue completion,
asynchronous notification queue, and various hardware configuration data
structures and functions.

Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-06 15:18:17 -03:00
Gal Pressman
853f565235 RDMA/efa: Add the efa.h header file
Add EFA driver generic header file defining driver's device independent
internal data structures and definitions.

Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-06 15:18:17 -03:00
Gal Pressman
01edac3aa2 RDMA/efa: Add EFA device definitions
EFA PCIe device implements a single Admin Queue (AQ) and Admin Completion
Queue (ACQ) pair to initialize and communicate configuration with the
device.  Through this pair, we run set/get commands for querying and
configuring the device, create/modify/destroy queues, and IB specific
commands like Address Handler (AH), Memory Registration (MR) and
Protection Domains (PD).

In addition to admin (AQ/ACQ), we have data path queues that get
classified as Queue Pairs (QP) and Completion Queues (CQ).

Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-06 13:47:50 -03:00
Shiraz Saleem
d85582517e RDMA/bnxt_re: Use core helpers to get aligned DMA address
Call the core helpers to retrieve the HW aligned address to use for the
MR, within a supported bnxt_re page size.

Remove checking the umem->hugtetlb flag as it is no longer required. The
new DMA block iterator will return the 2M aligned address if the MR is
backed by 2M huge pages.

Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-06 13:08:11 -03:00
Shiraz Saleem
eb52c0333f RDMA/i40iw: Use core helpers to get aligned DMA address within a supported page size
Call the core helpers to retrieve the HW aligned address to use for the
MR, within a supported i40iw page size.

Remove code in i40iw to determine when MR is backed by 2M huge pages which
involves checking the umem->hugetlb flag and VMA inspection.  The new DMA
iterator will return the 2M aligned address if the MR is backed by 2M
pages.

Fixes: f26c7c8339 ("i40iw: Add 2MB page support")
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-06 13:08:11 -03:00
Mike Marciniszyn
4c4b1996b5 IB/hfi1: Fix WQ_MEM_RECLAIM warning
The work_item cancels that occur when a QP is destroyed can elicit the
following trace:

 workqueue: WQ_MEM_RECLAIM ipoib_wq:ipoib_cm_tx_reap [ib_ipoib] is flushing !WQ_MEM_RECLAIM hfi0_0:_hfi1_do_send [hfi1]
 WARNING: CPU: 7 PID: 1403 at kernel/workqueue.c:2486 check_flush_dependency+0xb1/0x100
 Call Trace:
  __flush_work.isra.29+0x8c/0x1a0
  ? __switch_to_asm+0x40/0x70
  __cancel_work_timer+0x103/0x190
  ? schedule+0x32/0x80
  iowait_cancel_work+0x15/0x30 [hfi1]
  rvt_reset_qp+0x1f8/0x3e0 [rdmavt]
  rvt_destroy_qp+0x65/0x1f0 [rdmavt]
  ? _cond_resched+0x15/0x30
  ib_destroy_qp+0xe9/0x230 [ib_core]
  ipoib_cm_tx_reap+0x21c/0x560 [ib_ipoib]
  process_one_work+0x171/0x370
  worker_thread+0x49/0x3f0
  kthread+0xf8/0x130
  ? max_active_store+0x80/0x80
  ? kthread_bind+0x10/0x10
  ret_from_fork+0x35/0x40

Since QP destruction frees memory, hfi1_wq should have the WQ_MEM_RECLAIM.

The hfi1_wq does not allocate memory with GFP_KERNEL or otherwise become
entangled with memory reclaim, so this flag is appropriate.

Fixes: 0a226edd20 ("staging/rdma/hfi1: Use parallel workqueue for SDMA engines")
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-06 12:57:45 -03:00
Leon Romanovsky
10bf13c334 RDMA/mlx5: Remove MAYEXEC flag
MAYEXEC flag was mistakenly added in the commit cited in the fixes line.

Fixes: 4eb6ab13b9 ("RDMA: Remove rdma_user_mmap_page")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-06 12:56:55 -03:00
Ariel Levkovich
33cde96fb5 IB/mlx5: Device resource control for privileged DEVX user
For DEVX users who have SYS_RAWIO capability, we set the internal device
resources capability when creating the UCTX.  This will allow the device
to restrict the allocation of internal device resources such as SW ICM
memory to privileged DEVX users only.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-06 12:51:51 -03:00
Ariel Levkovich
25c13324d0 IB/mlx5: Add steering SW ICM device memory type
This patch adds support for allocating, deallocating and registering a new
device memory type, STEERING_SW_ICM.  This memory can be allocated and
used by a privileged user for direct rule insertion and management of the
device's steering tables.

The type is provided by the user via the dedicated attribute in the
alloc_dm ioctl command.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-06 12:51:51 -03:00
Ariel Levkovich
4056b12efd IB/mlx5: Warn on allocated MEMIC buffers during cleanup
Adding a warning on allocated MEMIC buffers that weren't freed prior to
driver tear down.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-06 12:51:51 -03:00
Ariel Levkovich
3b113a1ec3 IB/mlx5: Support device memory type attribute
This patch intoruduces a new mlx5_ib driver attribute to the DM allocation
method - the DM type.

In order to allow addition of new types in downstream patches this patch
also refactors the allocation, deallocation and registration handlers to
consider the requested type and perform the necessary actions according to
it.

Since not all future device memory types will be such that are mapped to
user memory, the mandatory page index output attribute is modified to be
optional.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-06 12:51:50 -03:00
David S. Miller
f3f050a4df mlx5-updates-2019-04-30
mlx5 misc updates:
 
 1) Bodong Wang and Parav Pandit (6):
    - Remove unused mlx5_query_nic_vport_vlans
    - vport macros refactoring
    - Fix vport access in E-Switch
    - Use atomic rep state to serialize state change
 
 2) Eli Britstein (2):
    - prio tag mode support, added ACLs and replace TC vlan pop with
      vlan 0 rewrite when prio tag mode is enabled.
 
 3) Erez Alfasi (2):
    - ethtool: Add SFF-8436 and SFF-8636 max EEPROM length definitions
    - mlx5e: ethtool, Add support for EEPROM high pages query
 
 4) Masahiro Yamada (1):
    - remove meaningless CFLAGS_tracepoint.o
 
 5) Maxim Mikityanskiy (1):
    - Put the common XDP code into a function
 
 6) Tariq Toukan (2):
    - Turn on HW tunnel offload in all TIRs
 
 7) Vlad Buslov (1):
    - Return error when trying to insert existing flower filter
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJcyhIFAAoJEEg/ir3gV/o+LgsH/idNT42AQewm2gn1NAt/njRx
 hA/ILH4ZmqYD8tgme5q3lByGrGRTweCPQ92+/tYP1i90PL8EJKNFbRPXuORp+hUk
 m+ywoeyBHx0ZyDlAIGNDCFprY//jZV/3XQKuJhLUliGfN77lUSkVtIz2UY+cDr2U
 XBn0B3Fy54+XP7EqVHXdxRkLiwDCsDwZBF6O9/1cw/rKsly6fIzw1b7UVjFaFA8f
 1g5Ca/+v4X0Rsky1KOGLv8HVB4bxbiSZspAjKwVGJagPUNJMRR6xZyL+VNHWX71R
 N68VMQQbwg7XDDFQNtYAFSpxOkAY+wilkRDe7+3A50cFE8ZYYskwVJunvb75fCA=
 =oqb8
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2019-04-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2019-04-30

mlx5 misc updates:

1) Bodong Wang and Parav Pandit (6):
   - Remove unused mlx5_query_nic_vport_vlans
   - vport macros refactoring
   - Fix vport access in E-Switch
   - Use atomic rep state to serialize state change

2) Eli Britstein (2):
   - prio tag mode support, added ACLs and replace TC vlan pop with
     vlan 0 rewrite when prio tag mode is enabled.

3) Erez Alfasi (2):
   - ethtool: Add SFF-8436 and SFF-8636 max EEPROM length definitions
   - mlx5e: ethtool, Add support for EEPROM high pages query

4) Masahiro Yamada (1):
   - remove meaningless CFLAGS_tracepoint.o

5) Maxim Mikityanskiy (1):
   - Put the common XDP code into a function

6) Tariq Toukan (2):
   - Turn on HW tunnel offload in all TIRs

7) Vlad Buslov (1):
   - Return error when trying to insert existing flower filter
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-04 00:25:02 -04:00
Parav Pandit
a70c07397f RDMA: Introduce and use GID attr helper to read RoCE L2 fields
Instead of RoCE drivers figuring out vlan, smac fields while working on
QP/AH, provide a helper routine to read the L2 fields such as vlan_id and
source mac address.

This moves logic from mlx5 driver to core for wider usage for RoCE ports.

This is a preparation patch to allow detaching netdev in subsequent patch.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03 11:10:02 -03:00
Kamal Heib
dd05cb828d RDMA: Get rid of iw_cm_verbs
Integrate iw_cm_verbs data members into ib_device_ops and ib_device
structs, this is done to achieve the following:

1) Avoid memory related bugs durring error unwind
2) Make the code more cleaner
3) Reduce code duplication

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03 10:56:56 -03:00
Jack Morgenstein
8f4426aa19 IB/mlx5: Add missing XRC options to QP optional params mask
The QP transition optional parameters for the various transition for XRC
QPs are identical to those for RC QPs.

Many of the XRC QP transition optional parameter bits are missing from the
QP optional mask table.  These omissions caused failures when doing XRC QP
state transitions.

For example, when trying to change the response timer of an XRC receive QP
via the RTS2RTS transition, the new timer value was ignored because
MLX5_QP_OPTPAR_RNR_TIMEOUT bit was missing from the optional params mask
for XRC qps for the RTS2RTS transition.

Fix this by adding the missing XRC optional parameters for all QP
transitions to the opt_mask table.

Fixes: e126ba97db ("mlx5: Add driver for Mellanox Connect-IB adapters")
Fixes: a4774e9095 ("IB/mlx5: Fix opt param mask according to firmware spec")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03 10:15:13 -03:00
David S. Miller
ff24e4980a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three trivial overlapping conflicts.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02 22:14:21 -04:00
Saeed Mahameed
c515e70d67 Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
This merge commit includes some misc shared code updates from mlx5-next branch needed
for net-next.

1) From Aya: Enable general events on all physical link types and
   restrict general event handling of subtype DELAY_DROP_TIMEOUT in mlx5 rdma
   driver to ethernet links only as it was intended.

2) From Eli: Introduce low level bits for prio tag mode

3) From Maor: Low level steering updates to support RDMA RX flow
   steering and enables RoCE loopback traffic when switchdev is enabled.

4) From Vu and Parav: Two small mlx5 core cleanups

5) From Yevgeny add HW definitions of geneve offloads

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-05-01 13:57:48 -07:00
Aya Levin
6cfdc7e468 IB/mlx5: Restrict 'DELAY_DROP_TIMEOUT' subtype to Ethernet interfaces
Subtype 'DELAY_DROP_TIMEOUT' (under 'GENERAL' event) is restricted to
Ethernet interfaces. This patch doesn't change functionality or breaks
current flow. In the downstream patch, non Ethernet (like IB) interfaces
will receive 'GENERAL' event.

Fixes: 5d3c537f90 ("net/mlx5: Handle event of power detection in the PCIE slot")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-04-29 16:55:05 -07:00
Vu Pham
c42260f195 net/mlx5: Separate and generalize dma device from pci device
The mlx5 Sub-Function (SF) sub device will be introduced in
subsequent patches. It will be created as mediated device and
belong to mdev bus. It is necessary to treat dma operations on
PF, VF and SF in uniform way, hence reduce the dependency on
pdev pci dev struct and work directly out of newly introduced
'struct device' from previous patch.

This patch does not change any functionality.

Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-04-29 16:55:05 -07:00
Michal Kubecek
ae0be8de9a netlink: make nla_nest_start() add NLA_F_NESTED flag
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
netlink based interfaces (including recently added ones) are still not
setting it in kernel generated messages. Without the flag, message parsers
not aware of attribute semantics (e.g. wireshark dissector or libmnl's
mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
the structure of their contents.

Unfortunately we cannot just add the flag everywhere as there may be
userspace applications which check nlattr::nla_type directly rather than
through a helper masking out the flags. Therefore the patch renames
nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
are rewritten to use nla_nest_start().

Except for changes in include/net/netlink.h, the patch was generated using
this semantic patch:

@@ expression E1, E2; @@
-nla_nest_start(E1, E2)
+nla_nest_start_noflag(E1, E2)

@@ expression E1, E2; @@
-nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
+nla_nest_start(E1, E2)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:03:44 -04:00
Matthew Wilcox
a7b36d5fa8 ib/bnxt: Remove mention of idr_alloc from comment
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-25 12:02:18 -03:00
Lijun Ou
2557fabd6e RDMA/hns: Bugfix for mapping user db
When the maximum send wr delivered by the user is zero, the qp does not
have a sq.

When allocating the sq db buffer to store the user sq pi pointer and map
it to the kernel mode, max_send_wr is used as the trigger condition, while
the kernel does not consider the max_send_wr trigger condition when
mapmping db. It will cause sq record doorbell map fail and create qp fail.

The failed print information as follows:

 hns3 0000:7d:00.1: Send cmd: tail - 418, opcode - 0x8504, flag - 0x0011, retval - 0x0000
 hns3 0000:7d:00.1: Send cmd: 0xe59dc000 0x00000000 0x00000000 0x00000000 0x00000116 0x0000ffff
 hns3 0000:7d:00.1: sq record doorbell map failed!
 hns3 0000:7d:00.1: Create RC QP failed

Fixes: 0425e3e6e0 ("RDMA/hns: Support flush cqe for hip08 in kernel space")
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-25 10:40:04 -03:00
Jason Gunthorpe
1d045aa76f Merge branch 'mlx5_tir_icm' into rdma.git for-next
Ariel Levkovich says:

====================
The series exposes the ICM address of the receive transport
interface (TIR) of Raw Packet and RSS QPs to the user since they are
required to properly create and insert steering rules that direct flows to
these QPs.
====================

For dependencies this branch is based on mlx5-next from
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

* branch 'mlx5_tir_icm':
  IB/mlx5: Expose TIR ICM address to user space
  net/mlx5: Introduce new TIR creation core API
  net/mlx5: Expose TIR ICM address in command outbox

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-25 10:33:00 -03:00
Ariel Levkovich
1f1d6abbf0 IB/mlx5: Expose TIR ICM address to user space
This patch exposes the TIR ICM address of raw packet and RSS
QPs to user space.

In order to pass the new field, the patch extends the mlx5
specific QP creation response structure and fills it with
the icm address returned by the FW command, if available.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-25 10:03:19 -03:00
Jason Gunthorpe
449a224c10 Merge branch 'rdma_mmap' into rdma.git for-next
Jason Gunthorpe says:

====================
Upon review it turns out there are some long standing problems in BAR
mapping area:
 * BAR pages intended for read-only can be switched to writable via mprotect.
 * Missing use of rdma_user_mmap_io for the mlx5 clock BAR page.
 * Disassociate causes SIGBUS when touching the pages.
 * CPU pages are being mapped through to the process via remap_pfn_range
   instead of the more appropriate vm_insert_page, causing weird behaviors
   during disassociation.

This series adds the missing VM_* flag manipulation, adds faulting a zero
page for disassociation and revises the CPU page mappings to use
vm_insert_page.
====================

For dependencies this branch is based on for-rc from
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git

* branch 'rdma_mmap':
  RDMA: Remove rdma_user_mmap_page
  RDMA/mlx5: Use get_zeroed_page() for clock_info
  RDMA/ucontext: Fix regression with disassociate
  RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages
  RDMA/mlx5: Do not allow the user to write to the clock page

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 16:20:34 -03:00
Jason Gunthorpe
4eb6ab13b9 RDMA: Remove rdma_user_mmap_page
Upon further research drivers that want this should simply call the core
function vm_insert_page(). The VMA holds a reference on the page and it
will be automatically freed when the last reference drops. No need for
disassociate to sequence the cleanup.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 16:18:36 -03:00
Jason Gunthorpe
ddcdc368b1 RDMA/mlx5: Use get_zeroed_page() for clock_info
get_zeroed_page() returns a virtual address for the page which is better
than allocating a struct page and doing a permanent kmap on it.

Cc: stable@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 13:40:50 -03:00
Jason Gunthorpe
d5e560d3f7 RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages
Since mlx5 supports device disassociate it must use this API for all
BAR page mmaps, otherwise the pages can remain mapped after the device
is unplugged causing a system crash.

Cc: stable@vger.kernel.org
Fixes: 5f9794dc94 ("RDMA/ucontext: Add a core API for mmaping driver IO memory")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-04-24 13:06:40 -03:00
Jason Gunthorpe
c660133c33 RDMA/mlx5: Do not allow the user to write to the clock page
The intent of this VMA was to be read-only from user space, but the
VM_MAYWRITE masking was missed, so mprotect could make it writable.

Cc: stable@vger.kernel.org
Fixes: 5c99eaecb1 ("IB/mlx5: Mmap the HCA's clock info to user-space")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-04-24 13:06:24 -03:00
John Fleck
3c176c9d72 IB/hfi1: Remove reference to RHF.VCRCErr
The bit VCRCErr in the receive header flag is actually a
reserved field. Remove bit operations on this field.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: John Fleck <john.fleck@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 11:48:11 -03:00
Mike Marciniszyn
a9c62e0078 IB/hfi1: Add selected Rcv counters
These counters are required for error analysis and debug.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 11:48:10 -03:00
Mike Marciniszyn
d40f69c9b9 IB/{rdmavt, qib, hfi1}: Use new routine to release reference counts
The reference count adjustments on reference count completion
are open coded throughout.

Add a routine to do all reference count adjustments and use.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 11:31:49 -03:00
Mike Marciniszyn
715ab1a862 IB/rdmavt: Fix ab/ba include issues
The currently include file ordering for rdmavt headers has an
ab/ba include issue the precludes using inlines from rdma_vt.h
in rdmavt_qp.h.

At the heart of the issue is that rdma_vt.h includes rdmavt_qp.h.

Fix the ordering issue by adjusting rdma_vt.h to not require rdmavt_qp.h
and move qp related inlines to rdmavt_qp.h.

Additionally, promote rvt_mmap_info to rdma_vt.h since it is shared
by rdmavt_cq.h and rdmavt_qp.h.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 11:31:49 -03:00
Mike Marciniszyn
62644c1d2b IB/hfi1: Make opfn.h self sufficient
The opfn.h include file build-ablility depends on the including file
having the correct includes.

Fix by making opfn.h self sufficient.

Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 11:31:49 -03:00
Kaike Wan
ea752bc5e5 IB/{rdmavt, hfi1): Miscellaneous comment fixes
This patch fixes miscellaneous comment errors.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 11:31:48 -03:00
Josh Collier
07c5ba9124 IB/hfi1: Add debugfs to control expansion ROM write protect
Some kernels now enable CONFIG_IO_STRICT_DEVMEM which prevents multiple
handles to PCI resource0. In order to continue to support expansion ROM
updates while the driver is loaded, the driver must now provide an
interface to control the expansion ROM write protection.

This patch adds an exprom_wp debugfs interface that allows the hfi1_eprom
user tool to disable the expansion ROM write protection by opening the
file and writing a '1'.  The write protection is released when writing a
'0' or automatically re-enabled when the file handle is closed.  The
current implementation will only allow one handle to be opened at a time
across all hfi1 devices.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Josh Collier <josh.d.collier@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 11:26:41 -03:00
Leon Romanovsky
5742582222 RDMA/hns: Remove asynchronic QP destroy
Verbs destroy callbacks are synchronous operations and can't be delayed.
The expectation is that after driver returned from destroy function, the
memory can be freed and user won't be able to access it again.

Ditch workqueue implementation used in HNS driver.

Fixes: d838c481e0 ("IB/hns: Fix the bug when destroy qp")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: oulijun <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24 10:55:31 -03:00
Saeed Mahameed
c3bdd5e651 Linux 5.1-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlyOup0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGHKoIAIKVuBSyD+m65TaM
 pjoAFa56weEc67Mmai2A84EOm0MVy9C6L7EOcOgVsJiLxDCYyWQ7xYwV2kceKJpW
 H5xauhb3+TxpxYeaeKdPPPHmBdejRwOPYvGAfnDMCqCCWQTad52sQUPCLI+yhF1t
 wgnuMi+SwNBWP9aYCXdFPK4fVhh27AcEAOEsRVCh4tIBH/wkf4GwrDr3IX1MFeMX
 jE/R43la4hu1swcWBsjkErWUasVPCgJSSQTfKDo9PQTVnoh0PHFp4fkOInVKLymQ
 7AGo+Knc+1he+sFsB2IbZwea0xqtJtjtr1oC+at8gNx66qVG+o7UZNi5LR1uPW4Z
 4+dwGBk=
 =pyXR
 -----END PGP SIGNATURE-----

Merge tag 'v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into mlx5-next

Linux 5.1-rc1

We forgot to reset the branch last merge window thus mlx5-next is outdated
and still based on 5.0-rc2. This merge commit is needed to sync mlx5-next
branch with 5.1-rc1.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-04-22 15:25:39 -07:00
Mark Bloch
5fb58c9e2f RDMA/mlx5: Don't create IB representors when in multiport RoCE mode
Switchdev mode and mutiport RoCE mode aren't compatible at this point.
Don't create IB reps when a user switches to switchdev mode and the driver
operates in that mode.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-22 15:24:05 -03:00
Mark Bloch
d3b5cc1cd9 RDMA/mlx5: Initialize roce port info before multiport master init
When working in mutliport RoCE mode it is possible to attach a slave
before the master. In that case the slave is waiting for a master to be
attached.  When the master is attached it goes over the list of waiting
slaves, finds a slave that is compatible and tries to bind it to itself.

The call stack is:
mlx5_ib_init_multiport_master() -> mlx5_ib_bind_slave_port()

In the bind function we will create a netdev notifier, but this is done
before we initialize the RoCE structure (this is done at a later stage by
the master in the ROCE stage).

Once events are delivered to that notifier we will use
mlx5_ib_get_native_port_mdev() to get the actual port and as the native
port is zero we will access an invalid index in the port structure.

Move the RoCE structure initialization to an earlier stage.

Fixes: 32f69e4be2 ("{net, IB}/mlx5: Manage port association for multiport RoCE")
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-22 15:24:05 -03:00
Mark Bloch
7f575103b0 RDMA/mlx5: Allow DEVX and raw creation flow on reps
Remove the limitations that were in place and provide support for DEVX and
raw flow creation on reps.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-22 15:24:05 -03:00
Maor Gottlieb
56e5acd405 RDMA/mlx5: Add query e-switch vport context to devx white list
Add MLX5_OP_QUERY_ESW_VPORT_CONTEXT to devx white list. It will be allowed
only if HCA_CAP.eswitch_manager==1.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-22 15:24:05 -03:00
Mark Bloch
52438be441 RDMA/mlx5: Allow inserting a steering rule to the FDB
Allow this only via mlx5 raw create flow API, legacy verbs are not
supported. To accommodate that, we add a new attribute to matcher creation
to indicate the type of flow table to be used.
	MLX5_IB_ATTR_FLOW_MATCHER_FT_TYPE
With this new attribute MLX5_IB_ATTR_FLOW_MATCHER_FLOW_FLAGS is no longer
needed, we keep it for compatibility but at most only a single attribute can
be passed of the two.

When inserting a flow rule to the FDB we require that a DEVX FT is
provided as a destination, no other configuration is allowed.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-22 15:24:05 -03:00
Mark Bloch
3b70508a6b RDMA/mlx5: Create flow table with max size supported
Instead of failing the request, just use the supported number of flow
entries.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-22 15:24:05 -03:00
Mark Bloch
13a4376568 RDMA/mlx5: Access the prio bypass inside the FDB flow table namespace
Now that we have a specific prio inside the FDB namespace allow retrieving
it from the RDMA side.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-22 15:24:05 -03:00
Chengguang Xu
2d95984977 infiniband/qib: Fix typo in comment
Fix typo 'faspath' -> 'pastpath'.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-22 14:08:15 -03:00
Colin Ian King
ff5eefe6d3 RDMA/cxgb4: Fix spelling mistake "immedate" -> "immediate"
There is a spelling mistake in a module parameter description. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-18 03:17:42 -03:00
Guy Levi
7249c8ea22 IB/mlx5: Fix scatter to CQE in DCT QP creation
When scatter to CQE is enabled on a DCT QP it corrupts the mailbox command
since it tried to treat it as as QP create mailbox command instead of a
DCT create command.

The corrupted mailbox command causes userspace to malfunction as the
device doesn't create the QP as expected.

A new mlx5 capability is exposed to user-space which ensures that it will
not enable the feature on DCT without this fix in the kernel.

Fixes: 5d6ff1babe ("IB/mlx5: Support scatter to CQE for DC transport type")
Signed-off-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-18 03:13:41 -03:00
David S. Miller
6b0a7f84ea Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflict resolution of af_smc.c from Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 11:26:25 -07:00
Colin Ian King
a6d2a5a92e RDMA/cxgb4: Fix null pointer dereference on alloc_skb failure
Currently if alloc_skb fails to allocate the skb a null skb is passed to
t4_set_arp_err_handler and this ends up dereferencing the null skb.  Avoid
the NULL pointer dereference by checking for a NULL skb and returning
early.

Addresses-Coverity: ("Dereference null return")
Fixes: b38a0ad8ec ("RDMA/cxgb4: Set arp error handler for PASS_ACCEPT_RPL messages")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-16 08:03:17 -03:00
Colin Ian King
1db86318c4 RDMA/mlx5: Check for error return in flow_rule rather than err
Currently when the call to create_flow_rule_vport_sq fails, the error
check is being performed on err rather than on the return pointer
flow_rule.  The return flow_rule maybe NULL (which is not considered an
error) or an error code, so check for the error on flow_rule.

Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: d5ed8ac34c ("RDMA/mlx5: Move default representors SQ steering to rule to modify QP")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-12 11:19:49 -03:00
Devesh Sharma
1c00d7bc96 RDMA/ocrdma: Remove use of idr use pci bdf instead
Removing the use of IDR variable just to name the function ids. Using the
PCI_FUNC(pdev->devfn) instead to create the device name, associated
resources and to print driver into at various places.

Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-12 10:59:02 -03:00
Kaike Wan
d737b25b1a IB/hfi1: Do not flush send queue in the TID RDMA second leg
When a QP is put into error state, the send queue will be flushed.
This mechanism is implemented in both the first and the second leg
of the send engine. Since the second leg is only responsible for
data transactions in the KDETH space for the TID RDMA WRITE request,
it should not perform the flushing of the send queue.

This patch removes the flushing function of the second leg, but
still keeps the bailing out of the QP if it is put into error state.

Fixes: 70dcb2e3dc ("IB/hfi1: Add the TID second leg send packet builder")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-10 15:09:30 -03:00
Mark Bloch
fb652d3299 RDMA/mlx5: Remove VF representor profile
Now that we have a single IB device with multiple ports we can remove the
VF representor profile.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-10 15:05:40 -03:00
Mark Bloch
26628e2d58 RDMA/mlx5: Move to single device multiport ports in switchdev mode
Move from IB device (representor) per virtual function to single IB device
with port per virtual function (port 1 represents the uplink). As number
of ports is a static property of an IB device, declare the IB device with
as many port as the possible according to the PCI bus.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-10 15:05:39 -03:00
Mark Bloch
a989ea01cb RDMA/mlx5: Move SMI caps logic
We store the SMI information in the core device's struct, make sure we set
that information only once (and not per port), while here make the for
loop based on the actual size of the array.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-10 15:05:39 -03:00
Mark Bloch
35b0aa67b2 RDMA/mlx5: Refactor netdev affinity code
The design of representors is such that once an IB representor is created,
the netdev of representor already exists, we can use that fact to simplify
the netdev affinity code.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-10 15:05:39 -03:00
Mark Bloch
d5ed8ac34c RDMA/mlx5: Move default representors SQ steering to rule to modify QP
Currently the steering for SQs created on representors is done on
creation, once we move to representors as ports of an IB device we need
the port argument which is given only at the modify QP stage, adjust the
code appropriately.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-10 15:05:39 -03:00
Mark Bloch
6a4d00be08 RDMA/mlx5: Move rep into port struct
In preparation of moving into a model of single IB device multiple ports
move rep to be part of the port structure. We mark a representor device by
setting is_rep, no functional change with this patch.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-10 15:05:39 -03:00
Mark Bloch
5d8f6a0e92 RDMA/mlx5: Use correct size for device resources
On allocation we use the array size and on destruction num_ports, use the
array size of destruction as well, in this context the array corresponds
to the native/actual ports on the NIC so no need to adjust this logic for
representors.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-10 15:05:39 -03:00
Mark Bloch
da796ccb3e RDMA/mlx5: Move ports allocation to outside of INIT stage
In downstream patches we will need access to the ports before doing any
stages, in order to set net device per representor.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-10 15:05:39 -03:00
Mark Bloch
4a6dc8552a RDMA/mlx5: Free IB device on remove
Simplify the code and move the deallocation of the IB device into the
remove function.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-10 15:05:39 -03:00
Mark Bloch
95579e785a RDMA/mlx5: Move netdev info into the port struct
Netdev info is stored in a separate array and holds data relevant on a per
port basis, move it to be part of the port struct.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-10 15:05:39 -03:00
Jason Gunthorpe
5331fa0db7 Merge branch 'mlx5-next' into rdma.git for-next
From
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Required for dependencies on the next series

* branch 'mlx5-next':
  net/mlx5: E-Switch, add a new prio to be used by the RDMA side
  net/mlx5: E-Switch, don't use hardcoded values for FDB prios
  net/mlx5: Fix false compilation warning
  net/mlx5: Expose MPEIN (Management PCIE INfo) register layout
  net/mlx5: Add rate limit print macros
  net/mlx5: Add explicit bar address field
  net/mlx5: Replace dev_err/warn/info by mlx5_core_err/warn/info
  net/mlx5: Use dev->priv.name instead of dev_name
  net/mlx5: Make mlx5_core messages independent from mdev->pdev
  net/mlx5: Break load_one into three stages
  net/mlx5: Function setup/teardown procedures
  net/mlx5: Move health and page alloc init to mdev_init
  net/mlx5: Split mdev init and pci init
  net/mlx5: Remove redundant init functions parameter
  net/mlx5: Remove spinlock support from mlx5_write64
  net/mlx5: Remove unused MLX5_*_DOORBELL_LOCK macros

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-10 14:59:27 -03:00
Steve Wise
ab7efbe24b RDMA/cxgb4: Use ib_device_set_netdev()
cxgb4 has a simple non-dynamic use of get_netdev, so conversion is
straightforward.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-09 10:14:54 -03:00
Jason Gunthorpe
4b38da75e0 RDMA/drivers: Convert easy drivers to use ib_device_set_netdev()
Drivers that never change their ndev dynamically do not need to use
the get_netdev callback.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
2019-04-09 10:14:54 -03:00
David Ahern
1550c17193 ipv4: Prepare rtable for IPv6 gateway
To allow the gateway to be either an IPv4 or IPv6 address, remove
rt_uses_gateway from rtable and replace with rt_gw_family. If
rt_gw_family is set it implies rt_uses_gateway. Rename rt_gateway
to rt_gw4 to represent the IPv4 version.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
Yangyang Li
00fb67ec6b RDMA/hns: Bugfix for SCC hem free
The method of hem free for SCC context is different from qp context.

In the current version, if free SCC hem during the execution of qp free,
there may be smmu error as below:

 arm-smmu-v3 arm-smmu-v3.1.auto: event 0x10 received:
 arm-smmu-v3 arm-smmu-v3.1.auto:  0x00007d0000000010
 arm-smmu-v3 arm-smmu-v3.1.auto:  0x000012000000017c
 arm-smmu-v3 arm-smmu-v3.1.auto:  0x00000000000009e0
 arm-smmu-v3 arm-smmu-v3.1.auto:  0x0000000000000000

As SCC context is still used by hardware after qp free, we can solve this
problem by removing SCC hem free from hns_roce_qp_free.

Fixes: 6a157f7d1b ("RDMA/hns: Add SCC context allocation support for hip08")
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:42:16 -03:00
Lijun Ou
4772e03d23 RDMA/hns: Fix bug that caused srq creation to fail
Due to the incorrect use of the seg and obj information, the position of
the mtt is calculated incorrectly, and the free space of the page is not
enough to store the entire mtt, resulting in access to the next page. This
patch fixes this problem.

 Unable to handle kernel paging request at virtual address ffff00006e3cd000
 ...
 Call trace:
  hns_roce_write_mtt+0x154/0x2f0 [hns_roce]
  hns_roce_buf_write_mtt+0xa8/0xd8 [hns_roce]
  hns_roce_create_srq+0x74c/0x808 [hns_roce]
  ib_create_srq+0x28/0xc8

Fixes: 0203b14c4f ("RDMA/hns: Unify the calculation for hem index in hip08")
Signed-off-by: chenglang <chenglang@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:27:58 -03:00
chenglang
2b277dae06 RDMA/hns: Support to create 1M srq queue
In mhop 0 mode, 64*bt_num queues can be supported.
In mhop 1 mode, 32K*bt_num queues can be supported.
Config srqc_hop_num to 1 to support 1M SRQ queues.

Signed-off-by: chenglang <chenglang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:26:44 -03:00
Lijun Ou
e1c9a0dc29 RDMA/hns: Dump detailed driver-specific CQ
This patch adds support of resource track for hip08 and take dumping cq
context state used for debugging as an example.  More resources track
supports for hns driver will be added in future.

The output should be as follows.
$ rdma res show cq dev hnseth0 -d
dev hnseth0 cqe 1023 users 2 poll-ctx WORKQUEUE pid 0 comm [ib_core] drv_state 2 drv_ceq
n 0 drv_cqn 0 drv_hopnum 1 drv_pi 0 drv_ci 0 drv_coalesce 0 drv_period 0 drv_cnt 0

Signed-off-by: Tao Tian <tiantao6@huawei.com>
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: chenglang <chenglang@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:05:25 -03:00
Leon Romanovsky
68e326dea1 RDMA: Handle SRQ allocations by IB/core
Convert SRQ allocation from drivers to be in the IB/core

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:05:25 -03:00
Leon Romanovsky
d345691471 RDMA: Handle AH allocations by IB/core
Simplify drivers by ensuring lifetime of ib_ah object. The changes
in .create_ah() go hand in hand with relevant update in .destroy_ah().

We will use this opportunity and convert .destroy_ah() to don't fail, as
it was suggested a long time ago, because there is nothing to do in case
of failure during destroy.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:05:25 -03:00
Jason Gunthorpe
e79c9c6062 IB/mlx5: Remove references to uboject->context
These should all go through udata now. Add mlx5_udata_to_mdev to convert
a udata into the struct mlx5_ib_dev as these call sites require.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:05:25 -03:00
Shiraz Saleem
d10bcf947a RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs
Combine contiguous regions of PAGE_SIZE pages into single scatter list
entry while building the scatter table for a umem. This minimizes the
number of the entries in the scatter list and reduces the DMA mapping
overhead, particularly with the IOMMU.

Set default max_seg_size in core for IB devices to 2G and do not combine
if we exceed this limit.

Also, purge npages in struct ib_umem as we now DMA map the umem SGL with
sg_nents and npage computation is not needed. Drivers should now be using
ib_umem_num_pages(), so fix the last stragglers.

Move npages tracking to ib_umem_odp as ODP drivers still need it.

Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Tested-by: Gal Pressman <galpress@amazon.com>
Tested-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:05:24 -03:00
Kamal Heib
ea7a5c706f RDMA/vmw_pvrdma: Fix memory leak on pvrdma_pci_remove
Make sure to free the DSR on pvrdma_pci_remove() to avoid the memory leak.

Fixes: 29c8d9eba5 ("IB: Add vmw_pvrdma driver")
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08 13:04:46 -03:00
Will Deacon
1b8546d7e2 i40iw: Redefine i40iw_mmiowb() to do nothing
mmiowb() is now implicit in spin_unlock(), so there's no reason to call
it from driver code. Redefine i40iw_mmiowb() to do nothing instead.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-04-08 12:09:15 +01:00
Will Deacon
fb24ea52f7 drivers: Remove explicit invocations of mmiowb()
mmiowb() is now implied by spin_unlock() on architectures that require
it, so there is no reason to call it from driver code. This patch was
generated using coccinelle:

	@mmiowb@
	@@
	- mmiowb();

and invoked as:

$ for d in drivers include/linux/qed sound; do \
spatch --include-headers --sp-file mmiowb.cocci --dir $d --in-place; done

NOTE: mmiowb() has only ever guaranteed ordering in conjunction with
spin_unlock(). However, pairing each mmiowb() removal in this patch with
the corresponding call to spin_unlock() is not at all trivial, so there
is a small chance that this change may regress any drivers incorrectly
relying on mmiowb() to order MMIO writes between CPUs using lock-free
synchronisation. If you've ended up bisecting to this commit, you can
reintroduce the mmiowb() calls using wmb() instead, which should restore
the old behaviour on all architectures other than some esoteric ia64
systems.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-04-08 12:01:02 +01:00
Will Deacon
949b8c7276 drivers: Remove useless trailing comments from mmiowb() invocations
In preparation for using coccinelle to remove all mmiowb() instances
from drivers, remove all trailing comments since they won't be picked up
by spatch later on and will end up being preserved in the code.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-04-08 12:00:56 +01:00
Saeed Mahameed
b6460c72c3 Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
This merge commit includes some misc shared code updates from mlx5-next branch needed
for net-next.

1) From Maxim, Remove un-used macros and spinlock from mlx5 code.

2) From Aya, Expose Management PCIE info register layout and add rate limit
print macros.

3) From Tariq, Compilation warning fix in fs_core.c

4) From Vu, Huy and Saeed, Improve mlx5 initialization flow:
The goal is to provide a better logical separation of mlx5 core
device initialization flow and will help to seamlessly support
creating different mlx5 device types such as PF, VF and SF
mlx5 sub-function virtual devices.

Mlx5_core driver needs to separate HCA resources from pci resources.
Its initialize/load/unload will be broken into stages:
1. Initialize common data structures
2. Setup function which initializes pci resources (for PF/VF)
   or some other specific resources for virtual device
3. Initialize software objects according to hardware capabilities
4. Load all mlx5_core components

It is also necessary to detach mlx5_core mdev name/message from pci
device mdev->pdev name/message for a clearer report/debug of
different mlx5 device types.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-04-05 14:10:16 -07:00
Potnuri Bharat Teja
d2c33370ae RDMA/iw_cxgb4: Always disconnect when QP is transitioning to TERMINATE state
On receiving a TERM from tje peer, Host moves the QP to TERMINATE state
and then moves the adapter out of RDMA mode. After issuing a TERM, peer
issues a CLOSE and at this point of time if the connectivity between peer
and host is lost for a significant amount of time, the QP remains in
TERMINATE state.

Therefore c4iw_modify_qp() needs to initiate a close on entering terminate
state.

Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-04 08:29:04 -03:00
Leon Romanovsky
0f51427bd0 RDMA/mlx5: Cleanup WQE page fault handler
Refactor the page fault handler to be more readable and extensible, this
cleanup was triggered by the error reported below. The code structure made
it unclear to the automatic tools to identify that such a flow is not
possible in real life because "requestor != NULL" means that "qp != NULL"
too.

    drivers/infiniband/hw/mlx5/odp.c:1254 mlx5_ib_mr_wqe_pfault_handler()
    error: we previously assumed 'qp' could be null (see line 1230)

Fixes: 08100fad5c ("IB/mlx5: Add ODP SRQ support")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-04 08:28:59 -03:00
Jason Gunthorpe
1c726c4421 Merge HFI1 updates into k.o/for-next
Based on rdma.git for-rc for dependencies.

From Dennis Dalessandro:

====================

Here are some code improvement patches and fixes for less serious bugs to
TID RDMA than we sent for RC.

====================

* HFI1 updates:
  IB/hfi1: Implement CCA for TID RDMA protocol
  IB/hfi1: Remove WARN_ON when freeing expected receive groups
  IB/hfi1: Unify the software PSN check for TID RDMA READ/WRITE
  IB/hfi1: Add a function to read next expected psn from hardware flow
  IB/hfi1: Delay the release of destination mr for TID RDMA WRITE DATA

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-03 15:28:05 -03:00
Kaike Wan
747b931fbe IB/hfi1: Implement CCA for TID RDMA protocol
Currently, FECN handling is not implemented on TID RDMA expected receive
packets and therefore CCA can't be turned on when TID RDMA is
enabled. This patch adds the CCA support to TID RDMA protocol by:

- modifying FECN RSM rule to include kernel receive contexts

- For TID_RDMA READ RESP or TID RDMA ACK packet, a CNP will be sent out if
  the FECN bit is set. For other TID RDMA packets that generate at least
  one response packet, the BECN bit will be set in the first response
  packet

- Copying expected packet data to destination buffer when FECN bit is set
  in the TID RDMA READ RESP or TID RDMA WRITE DATA packet. In this case,
  the expected packet is received as an eager packet

- Handling the TID sequence error for subsequent normal expected packets.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-03 15:27:39 -03:00
Kaike Wan
8da0f0f26f IB/hfi1: Remove WARN_ON when freeing expected receive groups
When PSM user receive context is freed, the expected receive groups
allocated by the receive context will also been freed. However, if there
are still TID entries in use, the receive groups rcd->tid_full_list or
rcd->tid_used_list will not be empty, and thus triggering the WARN_ONs in
the function hfi1_free_ctxt_rcv_groups().  Even if the two lists may not
be empty, the hfi1 driver will free all TID entries and receive groups
associated with the receive context to prevent any resource leakage. Since
a clean user application exit is not controlled by the hfi1 driver, this
patch will remove the WARN_ONs in hfi1_free_ctxt_rcv_groups().

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-03 15:27:30 -03:00
Kaike Wan
b885d5be9c IB/hfi1: Unify the software PSN check for TID RDMA READ/WRITE
For expected packet receiving, the hfi1 hardware checks the KDETH PSN
automatically. However, when sequence error occurs, the hfi1 driver can
check the sequence instead until the hardware flow generation is reloaded.

TID RDMA READ and WRITE protocols implement similar software checking
mechanisms, but with different flags and different local variables to
store next expected PSN.

Unify the handling by using only one set of flag and local variable for
both TID RDMA READ and WRITE protocols.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-03 15:27:30 -03:00
Kaike Wan
6a40693a88 IB/hfi1: Add a function to read next expected psn from hardware flow
This patch adds a function to read next expected KDETH PSN from hardware
flow to simplify the code.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-03 15:27:30 -03:00
Kaike Wan
f6f3f53255 IB/hfi1: Delay the release of destination mr for TID RDMA WRITE DATA
The reference of destination memory region is first obtained when TID RDMA
WRITE request is first received on the responder side. This reference is
released once all TID RDMA WRITE RESP packets are sent to the requester
side, even though not all TID RDMA WRITE DATA packets may have been
received. This early release will especially be undesired if the software
needs to access the destination memory before the last data packet is
received.

This patch delays the release of the MR until all TID RDMA DATA packets
have been received. A helper function to release the reference is also
created to simplify the code.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-03 15:27:30 -03:00
Huy Nguyen
aa8106f137 net/mlx5: Add explicit bar address field
Add bar_addr field to store bar-0 address to avoid calling
pci_resource_start with hard-coded bar-0 as parameter.
Also note that different mlx5 device types will have bar_addr
on different bars.

This patch does not change any functionality.

Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-04-02 12:49:38 -07:00
Maxim Mikityanskiy
bbf29f618e net/mlx5: Remove spinlock support from mlx5_write64
As there is no user of mlx5_write64 that passes a spinlock to
mlx5_write64, remove this functionality and simplify the function.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-04-02 12:49:37 -07:00
Leon Romanovsky
6734b29735 RDMA/hns: Fix bad endianess of port_pd variable
port_pd is treated as le32 in declaration and read, fix assignment to be
in le32 too. This change fixes the following compilation warnings.

drivers/infiniband/hw/hns/hns_roce_ah.c:67:24: warning: incorrect type
in assignment (different base types)
drivers/infiniband/hw/hns/hns_roce_ah.c:67:24: expected restricted __le32 [usertype] port_pd
drivers/infiniband/hw/hns/hns_roce_ah.c:67:24: got restricted __be32 [usertype]

Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Lijun Ou <ouliun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-02 10:40:41 -03:00
Shamir Rabinovitch
ff23dfa134 IB: Pass only ib_udata in function prototypes
Now when ib_udata is passed to all the driver's object create/destroy APIs
the ib_udata will carry the ib_ucontext for every user command. There is
no need to also pass the ib_ucontext via the functions prototypes.

Make ib_udata the only argument psssed.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01 15:00:47 -03:00
Shamir Rabinovitch
bdeacabd1a IB: Remove 'uobject->context' dependency in object destroy APIs
Now that we have the udata passed to all the ib_xxx object destroy APIs
and the additional macro 'rdma_udata_to_drv_context' to get the
ib_ucontext from ib_udata stored in uverbs_attr_bundle, we can finally
start to remove the dependency of the drivers in the
ib_xxx->uobject->context.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01 14:59:35 -03:00
Shamir Rabinovitch
c4367a2635 IB: Pass uverbs_attr_bundle down ib_x destroy path
The uverbs_attr_bundle with the ucontext is sent down to the drivers ib_x
destroy path as ib_udata. The next patch will use the ib_udata to free the
drivers destroy path from the dependency in 'uobject->context' as we
already did for the create path.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01 14:57:35 -03:00
Shamir Rabinovitch
a6a3797df2 IB: Pass uverbs_attr_bundle down uobject destroy path
Pass uverbs_attr_bundle down the uobject destroy path. The next patch will
use this to eliminate the dependecy of the drivers in ib_x->uobject
pointers.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01 14:55:36 -03:00
Matthew Wilcox
059d48fbf6 qib: Convert qib_unit_table to XArray
Also remove qib_devs_list.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01 13:31:35 -03:00
Matthew Wilcox
03b92789e5 hfi1: Convert hfi1_unit_table to XArray
Also remove hfi1_devs_list.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01 13:27:35 -03:00
Matthew Wilcox
9fd15987ed qedr: Convert srqidr to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-29 14:54:51 -03:00
Matthew Wilcox
b6014f9e5f qedr: Convert qpidr to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-29 14:53:57 -03:00
Matthew Wilcox
0ee3b915b1 hfi1: Convert vesw_idr to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-29 14:51:50 -03:00
Matthew Wilcox
736b5a70db RDMA/hns: Convert qp_table_tree to XArray
Also fully initialise the qp before storing it in the XArray.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-29 14:51:50 -03:00
Matthew Wilcox
27e19f4510 RDMA/hns: Convert cq_table to XArray
Change the locking to not disable interrupts as the lookup in interrupt
context will not see a freed CQ, thanks to the synchronize_irq() call
before freeing the CQ.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-29 14:40:40 -03:00
Shiraz Saleem
41d34865b2 RDMA/mthca: Use correct sizing on buffers holding page DMA addresses
The buffer that holds the page DMA addresses is sized off umem->nmap.
This can potentially cause out of bound accesses on the PBL array when
iterating the umem DMA-mapped SGL. This is because if umem pages are
combined, umem->nmap can be much lower than the number of system pages
in umem.

Use ib_umem_num_pages() to size this buffer.

Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 14:13:27 -03:00
Shiraz Saleem
5f818d676a RDMA/cxbg: Use correct sizing on buffers holding page DMA addresses
The PBL array that hold the page DMA address is sized off umem->nmap.
This can potentially cause out of bound accesses on the PBL array when
iterating the umem DMA-mapped SGL. This is because if umem pages are
combined, umem->nmap can be much lower than the number of system pages
in umem.

Use ib_umem_num_pages() to size this array.

Cc: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 14:13:27 -03:00
Selvin Xavier
5aa8484080 RDMA/bnxt_re: Use correct sizing on buffers holding page DMA addresses
umem->nmap is used while allocating internal buffer for storing
page DMA addresses. This causes out of bounds array access while iterating
the umem DMA-mapped SGL with umem page combining as umem->nmap can be
less than number of system pages in umem.

Use ib_umem_num_pages() instead of umem->nmap to size the page array.
Add a new structure (bnxt_qplib_sg_info) to pass sglist, npages and nmap.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 14:13:27 -03:00
Bart Van Assche
196b4ce57d IB/qib: Remove a set-but-not-used variable
This patch avoids that a compiler warning is reported when building with
W=1.

Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Fixes: 49c0e2414b ("IB/qib: Change SDMA progression mode depending on single- or multi-rail")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 11:03:49 -03:00
Bart Van Assche
920d10e458 IB/hfi1: Fix two format strings
Enable format string checking for hfi1_cdbg() and fix the resulting
compiler warnings.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 11:03:49 -03:00
Bart Van Assche
1f687edee2 IB/mlx5: Declare devx_async_cmd_event_fops static
Avoid that sparse complains about a missing declaration.

Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Fixes: 6bf8f22aea ("IB/mlx5: Introduce MLX5_IB_OBJECT_DEVX_ASYNC_CMD_FD")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28 10:22:48 -03:00
David S. Miller
356d71e00d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-03-27 17:37:58 -07:00
Yuval Shaia
6a1096611c RDMA/vmw_pvrdma: Skip zeroing device attrs
Caller already clears props before calling query_device.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-27 15:30:43 -03:00
Artemy Kovalyov
d623dfd283 IB/mlx5: Compare only index part of a memory window rkey
The InfiniBand Architecture Specification section 10.6.7.2.4 TYPE 2 MEMORY
WINDOWS says that if the CI supports the Base Memory Management Extensions
defined in this specification, the R_Key format for a Type 2 Memory Window
must consist of:

* 24 bit index in the most significant bits of the R_Key, which is owned
  by the CI, and
* 8 bit key in the least significant bits of the R_Key, which is owned by
  the Consumer.

This means that the kernel should compare only the index part of a R_Key
to determine equality with another R_Key.

Fixes: db570d7dea ("IB/mlx5: Add ODP support to MW")
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-27 15:27:56 -03:00
Artemy Kovalyov
1e5887b700 IB/mlx5: WQE dump jumps over first 16 bytes
Move index increment after its is used or otherwise it will start the dump
of the WQE from second WQE BB.

Fixes: 34f4c9554d ("IB/mlx5: Use fragmented QP's buffer for in-kernel users")
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-27 15:27:33 -03:00
Moni Shoua
1abe186ed8 IB/mlx5: Reset access mask when looping inside page fault handler
If page-fault handler spans multiple MRs then the access mask needs to
be reset before each MR handling or otherwise write access will be
granted to mapped pages instead of read-only.

Cc: <stable@vger.kernel.org> # 3.19
Fixes: 7bdf65d411 ("IB/mlx5: Handle page faults")
Reported-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-27 15:27:11 -03:00
Leon Romanovsky
e95e52a178 RDMA/hns: Limit scope of hns_roce_cmq_send()
The forgotten static keyword causes to the following error to appear while
building HNS driver. Declare hns_roce_cmq_send() to be static function to
fix this warning.

drivers/infiniband/hw/hns/hns_roce_hw_v2.c:1089:5: warning: no previous
prototype for _hns_roce_cmq_send_ [-Wmissing-prototypes]
 int hns_roce_cmq_send(struct hns_roce_dev *hr_dev,

Fixes: 6a04aed6af ("RDMA/hns: Fix the chip hanging caused by sending mailbox&CMQ during reset")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-27 15:25:50 -03:00
Kaike Wan
d029434447 IB/hfi1: Fix the allocation of RSM table
The receive side mapping (RSM) on hfi1 hardware is a special
matching mechanism to direct an incoming packet to a given
hardware receive context. It has 4 instances of matching capabilities
(RSM0 - RSM3) that share the same RSM table (RMT). The RMT has a total of
256 entries, each of which points to a receive context.

Currently, three instances of RSM have been used:
1. RSM0 by QOS;
2. RSM1 by PSM FECN;
3. RSM2 by VNIC.

Each RSM instance should reserve enough entries in RMT to function
properly. Since both PSM and VNIC could allocate any receive context
between dd->first_dyn_alloc_ctxt and dd->num_rcv_contexts, PSM FECN must
reserve enough RMT entries to cover the entire receive context index
range (dd->num_rcv_contexts - dd->first_dyn_alloc_ctxt) instead of only
the user receive contexts allocated for PSM
(dd->num_user_contexts). Consequently, the sizing of
dd->num_user_contexts in set_up_context_variables is incorrect.

Fixes: 2280740f01 ("IB/hfi1: Virtual Network Interface Controller (VNIC) HW support")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-27 14:34:31 -03:00
Kaike Wan
a8639a79e8 IB/hfi1: Eliminate opcode tests on mr deref
When an old ack_queue entry is used to store an incoming request, it may
need to clean up the old entry if it is still referencing the
MR. Originally only RDMA READ request needed to reference MR on the
responder side and therefore the opcode was tested when cleaning up the
old entry. The introduction of tid rdma specific operations in the
ack_queue makes the specific opcode tests wrong.  Multiple opcodes (RDMA
READ, TID RDMA READ, and TID RDMA WRITE) may need MR ref cleanup.

Remove the opcode specific tests associated with the ack_queue.

Fixes: f48ad614c1 ("IB/hfi1: Move driver out of staging")
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-27 14:34:31 -03:00
Kaike Wan
93b289b9af IB/hfi1: Clear the IOWAIT pending bits when QP is put into error state
When a QP is put into error state, it may be waiting for send engine
resources. In this case, the QP will be removed from the send engine's
waiting list, but its IOWAIT pending bits are not cleared. This will
normally not have any major impact as the QP is being destroyed.  However,
the QP still needs to wind down its operations, such as draining the send
queue by scheduling the send engine. Clearing the pending bits will avoid
any potential complications. In addition, if the QP will eventually hang,
clearing the pending bits can help debugging by presenting a consistent
picture if the user dumps the qp_stats.

This patch clears a QP's IOWAIT_PENDING_IB and IO_PENDING_TID bits in
priv->s_iowait.flags in this case.

Fixes: 5da0fc9dbf ("IB/hfi1: Prepare resource waits for dual leg")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-27 14:34:31 -03:00
Kaike Wan
662d664666 IB/hfi1: Failed to drain send queue when QP is put into error state
When a QP is put into error state, all pending requests in the send work
queue should be drained. The following sequence of events could lead to a
failure, causing a request to hang:

(1) The QP builds a packet and tries to send through SDMA engine.
    However, PIO engine is still busy. Consequently, this packet is put on
    the QP's tx list and the QP is put on the PIO waiting list. The field
    qp->s_flags is set with HFI1_S_WAIT_PIO_DRAIN;

(2) The QP is put into error state by the user application and
    notify_error_qp() is called, which removes the QP from the PIO waiting
    list and the packet from the QP's tx list. In addition, qp->s_flags is
    cleared of RVT_S_ANY_WAIT_IO bits, which does not include
    HFI1_S_WAIT_PIO_DRAIN bit;

(3) The hfi1_schdule_send() function is called to drain the QP's send
    queue. Subsequently, hfi1_do_send() is called. Since the flag bit
    HFI1_S_WAIT_PIO_DRAIN is set in qp->s_flags, hfi1_send_ok() fails.  As
    a result, hfi1_do_send() bails out without draining any request from
    the send queue;

(4) The PIO engine completes the sending and tries to wake up any QP on
    its waiting list. But the QP has been removed from the PIO waiting
    list and therefore is kept in sleep forever.

The fix is to clear qp->s_flags of HFI1_S_ANY_WAIT_IO bits in step (2).
HFI1_S_ANY_WAIT_IO includes RVT_S_ANY_WAIT_IO and HFI1_S_WAIT_PIO_DRAIN.

Fixes: 2e2ba09e48 ("IB/rdmavt, IB/hfi1: Create device dependent s_flags")
Cc: <stable@vger.kernel.org> # 4.19.x+
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-27 14:34:31 -03:00
Kangjie Lu
e2a438bd71 RDMA/i40iw: Handle workqueue allocation failure
alloc_ordered_workqueue may fail and return NULL.  The fix captures the
failure and handles it properly to avoid potential NULL pointer
dereferences.

Signed-off-by: Kangjie Lu <kjlu@umn.edu>
Reviewed-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-27 10:19:07 -03:00
Enrico Weigelt, metux IT consult
1a2e158327 drivers: infiniband: Fix whitespace in kconfig
Adjust the kconfig whitespace in bnxt_re/iser to match the kernel
standard.

Signed-off-by: Enrico Weigelt, metux IT consult <info@metux.net>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-26 12:49:33 -03:00
Colin Ian King
a6a9274a7c RDMA/nes: remove redundant check on udata
The non-null check on udata is redundant as this check was performed just
a few statements earlier and the check is always true as udata must be
non-null at this point. Remove redundant the check on udata and the
redundant else part that can never be executed.

Detected by CoverityScan, CID#1477317 ("Logically dead code")

Fixes: 8994445054 ("IB/{hw,sw}: Remove 'uobject->context' dependency in object creation APIs")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-26 12:07:45 -03:00
Matthew Wilcox
f1430536e0 mlx4: Convert pv_id_table to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-26 10:09:54 -03:00
Matthew Wilcox
b02a29eb88 mlx5: Convert mlx5_srq_table to XArray
Remove the custom spinlock as the XArray handles its own locking.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-26 09:39:42 -03:00
Mike Marciniszyn
270a9833b2 IB/hfi1: Add running average for adaptive pio
The adaptive PIO implementation only considers the current packet size
when deciding between SDMA and pio for a packet.

This causes credit return forces if small and large packets are
interleaved.

Add a running average to avoid costly credit forces so that a large
sequence of small packets is required to go below the threshold that
chooses pio.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-26 09:33:21 -03:00
Erez Alfasi
19b1a294b0 RDMA: Use __packed annotation instead of __attribute__ ((packed))
"__attribute__" set of macros has been standardized, have became more
potentially portable and consistent code back in v2.6.21 by commit
82ddcb040 ("[PATCH] extend the set of "__attribute__" shortcut macros").
Moreover, nowadays checkpatch.pl warns about using __attribute__((packed))
instead of __packed.

This patch converts all the "__attribute__ ((packed))" annotations to
"__packed" within the RDMA subsystem.

Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 21:14:12 -03:00
Lijun Ou
d0a935563b RDMA/hns: Delete unused variable in hns_roce_v2_modify_qp function
The src_mac array is not used in hns_roce_v2_modify_qp function.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 21:11:30 -03:00
Lijun Ou
82342e493b RDMA/hns: Bugfix for sending with invalidate
According to IB protocol, the send with invalidate operation will not
invalidate mr that was created through a register mr or reregister mr.

Fixes: e93df01085 ("RDMA/hns: Support local invalidate for hip08 in kernel space")
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 20:59:56 -03:00
Lijun Ou
07c2339a91 RDMA/hns: Hide error print information with roce vf device
The driver should not print the error information when the hip08 driver
not support virtual function.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 20:59:55 -03:00
Lijun Ou
5b01b243b0 RDMA/hns: Only assgin some fields if the relatived attr_mask is set
According to IB protocol, some fields of qp context are filled with
optional when the relatived attr_mask are set. The relatived attr_mask
include IB_QP_TIMEOUT, IB_QP_RETRY_CNT, IB_QP_RNR_RETRY and
IB_QP_MIN_RNR_TIMER.  Besides, we move some assignments of the fields of
qp context into the outside of the specific qp state jump function.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 20:59:55 -03:00
Lijun Ou
834fa8cf6f RDMA/hns: Update the range of raq_psn field of qp context
According to hip08 UM(User Manual), the raq_psn field size is [23:0].

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 20:59:55 -03:00
Lijun Ou
601f3e6d06 RDMA/hns: Only assign the fields of the rq psn if IB_QP_RQ_PSN is set
Only when the IB_QP_RQ_PSN flags of attr_mask is set is it valid to assign
the relatived fields of rq'psn into the qp context when modified qp.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 20:59:55 -03:00
Lijun Ou
f04cc17878 RDMA/hns: Only assign the relatived fields of psn if IB_QP_SQ_PSN is set
Only when the IB_QP_SQ_PSN flags of attr_mask is set is it valid to assign
the relatived fields of psn into the qp context when modified qp.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 20:59:55 -03:00
Matthew Wilcox
401b44804c cxgb4: Convert stid_idr to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 20:56:36 -03:00
Matthew Wilcox
9f5a9632e4 cxgb4: Convert atid_idr to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 20:56:36 -03:00
Matthew Wilcox
f254ba6ae5 cxgb4: Convert hwtid_idr to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 15:42:12 -03:00
Matthew Wilcox
7a268a9397 cxgb4: Convert mmidr to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 15:40:37 -03:00
Matthew Wilcox
2f43129127 cxgb4: Convert qpidr to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 15:39:18 -03:00
Matthew Wilcox
52e124c27e cxgb4: Convert cqidr to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 15:38:18 -03:00
Matthew Wilcox
e64a7c02f1 cxgb3: Convert mmidr to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 15:36:29 -03:00
Matthew Wilcox
27114876ce cxgb3: Convert qpidr to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 15:35:08 -03:00
Matthew Wilcox
a2f409713e cxgb3: Convert cqidr to XArray
It would make sense to convert this to an allocating XArray and remove the
kfifo that is currently used to allocate the CQID, but that work is better
done by someone who has the hardware to test with.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-25 15:34:22 -03:00
Paolo Abeni
a350eccee5 net: remove 'fallback' argument from dev->ndo_select_queue()
After the previous patch, all the callers of ndo_select_queue()
provide as a 'fallback' argument netdev_pick_tx.
The only exceptions are nested calls to ndo_select_queue(),
which pass down the 'fallback' available in the current scope
- still netdev_pick_tx.

We can drop such argument and replace fallback() invocation with
netdev_pick_tx(). This avoids an indirect call per xmit packet
in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen)
with device drivers implementing such ndo. It also clean the code
a bit.

Tested with ixgbe and CONFIG_FCOE=m

With pktgen using queue xmit:
threads		vanilla 	patched
		(kpps)		(kpps)
1		2334		2428
2		4166		4278
4		7895		8100

 v1 -> v2:
 - rebased after helper's name change

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-20 11:18:55 -07:00
Feng Tang
ec4fe4bcc5 i40iw: Avoid panic when handling the inetdev event
There is a panic reported that on a system with x722 ethernet, when doing
the operations like:

	# ip link add br0 type bridge
	# ip link set eno1 master br0
	# systemctl restart systemd-networkd

The system will panic "BUG: unable to handle kernel null pointer
dereference at 0000000000000034", with call chain:

	i40iw_inetaddr_event
	notifier_call_chain
	blocking_notifier_call_chain
	notifier_call_chain
	__inet_del_ifa
	inet_rtm_deladdr
	rtnetlink_rcv_msg
	netlink_rcv_skb
	rtnetlink_rcv
	netlink_unicast
	netlink_sendmsg
	sock_sendmsg
	__sys_sendto

It is caused by "local_ipaddr = ntohl(in->ifa_list->ifa_address)", while
the in->ifa_list is NULL.

So add a check for the "in->ifa_list == NULL" case, and skip the ARP
operation accordingly.

Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-17 21:40:40 -03:00
Aya Levin
cd27287562 IB/mlx5: Fix mapping of link-mode to IB width and speed
Add mapping of link mode: CAUI4 100Gbps CR4/KR4 with 4 lines and 25Gbps.
Fix mapping of link mode: GAUI2 50Gbps CR2/KR2 to be 2 lines with 25Gbps.

Fixes: 08e8676f16 ("IB/mlx5: Add support for 50Gbps per lane link modes")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-17 21:40:39 -03:00
Yishai Hadas
c5ae1954c4 IB/mlx5: Use mlx5 core to create/destroy a DEVX DCT
To prevent a hardware memory leak when a DEVX DCT object is destroyed
without calling DRAIN DCT before, (e.g. under cleanup flow), need to
manage its creation and destruction via mlx5 core.

In that case the DRAIN DCT command will be called and only once that it
will be completed the DESTROY DCT command will be called.  Otherwise, the
DESTROY DCT may fail and a hardware leak may occur.

As of that change the DRAIN DCT command should not be exposed any more
from DEVX, it's managed internally by the driver to work as expected by
the device specification.

Fixes: 7efce3691d ("IB/mlx5: Add obj create and destroy functionality")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-17 21:40:39 -03:00
Jack Morgenstein
587443e777 IB/mlx4: Fix race condition between catas error reset and aliasguid flows
Code review revealed a race condition which could allow the catas error
flow to interrupt the alias guid query post mechanism at random points.
Thiis is fixed by doing cancel_delayed_work_sync() instead of
cancel_delayed_work() during the alias guid mechanism destroy flow.

Fixes: a0c64a17ab ("mlx4: Add alias_guid mechanism")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-17 21:40:39 -03:00
Linus Torvalds
a50243b1dd 5.1 Merge Window Pull Request
This has been a slightly more active cycle than normal with ongoing core
 changes and quite a lot of collected driver updates.
 
 - Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe
 
 - A new data transfer mode for HFI1 giving higher performance
 
 - Significant functional and bug fix update to the mlx5 On-Demand-Paging MR
   feature
 
 - A chip hang reset recovery system for hns
 
 - Change mm->pinned_vm to an atomic64
 
 - Update bnxt_re to support a new 57500 chip
 
 - A sane netlink 'rdma link add' method for creating rxe devices and fixing
   the various unregistration race conditions in rxe's unregister flow
 
 - Allow lookup up objects by an ID over netlink
 
 - Various reworking of the core to driver interface:
   * Drivers should not assume umem SGLs are in PAGE_SIZE chunks
   * ucontext is accessed via udata not other means
   * Start to make the core code responsible for object memory
     allocation
   * Drivers should convert struct device to struct ib_device
     via a helper
   * Drivers have more tools to avoid use after unregister problems
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlyAJYYACgkQOG33FX4g
 mxrWwQ/+OyAx4Moru7Aix0C6GWxTJp/wKgw21CS3reZxgLai6x81xNYG/s2wCNjo
 IccObVd7mvzyqPdxOeyHBsJBbQDqWvoD6O2duH8cqGMgBRgh3CSdUep2zLvPpSAx
 2W1SvWYCLDnCuarboFrCA8c4AN3eCZiqD7z9lHyFQGjy3nTUWzk1uBaOP46uaiMv
 w89N8EMdXJ/iY6ONzihvE05NEYbMA8fuvosKLLNdghRiHIjbMQU8SneY23pvyPDd
 ZziPu9NcO3Hw9OVbkwtJp47U3KCBgvKHmnixyZKkikjiD+HVoABw2IMwcYwyBZwP
 Bic/ddONJUvAxMHpKRnQaW7znAiHARk21nDG28UAI7FWXH/wMXgicMp6LRcNKqKF
 vqXdxHTKJb0QUR4xrYI+eA8ihstss7UUpgSgByuANJ0X729xHiJtlEvPb1DPo1Dz
 9CB4OHOVRl5O8sA5Jc6PSusZiKEpvWoyWbdmw0IiwDF5pe922VLl5Nv88ta+sJ38
 v2Ll5AgYcluk7F3599Uh9D7gwp5hxW2Ph3bNYyg2j3HP4/dKsL9XvIJPXqEthgCr
 3KQS9rOZfI/7URieT+H+Mlf+OWZhXsZilJG7No0fYgIVjgJ00h3SF1/299YIq6Qp
 9W7ZXBfVSwLYA2AEVSvGFeZPUxgBwHrSZ62wya4uFeB1jyoodPk=
 =p12E
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "This has been a slightly more active cycle than normal with ongoing
  core changes and quite a lot of collected driver updates.

   - Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe

   - A new data transfer mode for HFI1 giving higher performance

   - Significant functional and bug fix update to the mlx5
     On-Demand-Paging MR feature

   - A chip hang reset recovery system for hns

   - Change mm->pinned_vm to an atomic64

   - Update bnxt_re to support a new 57500 chip

   - A sane netlink 'rdma link add' method for creating rxe devices and
     fixing the various unregistration race conditions in rxe's
     unregister flow

   - Allow lookup up objects by an ID over netlink

   - Various reworking of the core to driver interface:
       - drivers should not assume umem SGLs are in PAGE_SIZE chunks
       - ucontext is accessed via udata not other means
       - start to make the core code responsible for object memory
         allocation
       - drivers should convert struct device to struct ib_device via a
         helper
       - drivers have more tools to avoid use after unregister problems"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (280 commits)
  net/mlx5: ODP support for XRC transport is not enabled by default in FW
  IB/hfi1: Close race condition on user context disable and close
  RDMA/umem: Revert broken 'off by one' fix
  RDMA/umem: minor bug fix in error handling path
  RDMA/hns: Use GFP_ATOMIC in hns_roce_v2_modify_qp
  cxgb4: kfree mhp after the debug print
  IB/rdmavt: Fix concurrency panics in QP post_send and modify to error
  IB/rdmavt: Fix loopback send with invalidate ordering
  IB/iser: Fix dma_nents type definition
  IB/mlx5: Set correct write permissions for implicit ODP MR
  bnxt_re: Clean cq for kernel consumers only
  RDMA/uverbs: Don't do double free of allocated PD
  RDMA: Handle ucontext allocations by IB/core
  RDMA/core: Fix a WARN() message
  bnxt_re: fix the regression due to changes in alloc_pbl
  IB/mlx4: Increase the timeout for CM cache
  IB/core: Abort page fault handler silently during owning process exit
  IB/mlx5: Validate correct PD before prefetch MR
  IB/mlx5: Protect against prefetch of invalid MR
  RDMA/uverbs: Store PR pointer before it is overwritten
  ...
2019-03-09 15:53:03 -08:00
Michael J. Ruhl
bc5add0976 IB/hfi1: Close race condition on user context disable and close
When disabling and removing a receive context, it is possible for an
asynchronous event (i.e IRQ) to occur.  Because of this, there is a race
between cleaning up the context, and the context being used by the
asynchronous event.

cpu 0  (context cleanup)
    rc->ref_count-- (ref_count == 0)
    hfi1_rcd_free()
cpu 1  (IRQ (with rcd index))
	rcd_get_by_index()
	lock
	ref_count+++     <-- reference count race (WARNING)
	return rcd
	unlock
cpu 0
    hfi1_free_ctxtdata() <-- incorrect free location
    lock
    remove rcd from array
    unlock
    free rcd

This race will cause the following WARNING trace:

WARNING: CPU: 0 PID: 175027 at include/linux/kref.h:52 hfi1_rcd_get_by_index+0x84/0xa0 [hfi1]
CPU: 0 PID: 175027 Comm: IMB-MPI1 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.el7.x86_64 #1
Hardware name: Intel Corporation S2600KP/S2600KP, BIOS SE5C610.86B.11.01.0076.C4.111920150602 11/19/2015
Call Trace:
  dump_stack+0x19/0x1b
  __warn+0xd8/0x100
  warn_slowpath_null+0x1d/0x20
  hfi1_rcd_get_by_index+0x84/0xa0 [hfi1]
  is_rcv_urgent_int+0x24/0x90 [hfi1]
  general_interrupt+0x1b6/0x210 [hfi1]
  __handle_irq_event_percpu+0x44/0x1c0
  handle_irq_event_percpu+0x32/0x80
  handle_irq_event+0x3c/0x60
  handle_edge_irq+0x7f/0x150
  handle_irq+0xe4/0x1a0
  do_IRQ+0x4d/0xf0
  common_interrupt+0x162/0x162

The race can also lead to a use after free which could be similar to:

general protection fault: 0000 1 SMP
CPU: 71 PID: 177147 Comm: IMB-MPI1 Kdump: loaded Tainted: G W OE ------------ 3.10.0-957.el7.x86_64 #1
Hardware name: Intel Corporation S2600KP/S2600KP, BIOS SE5C610.86B.11.01.0076.C4.111920150602 11/19/2015
task: ffff9962a8098000 ti: ffff99717a508000 task.ti: ffff99717a508000 __kmalloc+0x94/0x230
Call Trace:
  ? hfi1_user_sdma_process_request+0x9c8/0x1250 [hfi1]
  hfi1_user_sdma_process_request+0x9c8/0x1250 [hfi1]
  hfi1_aio_write+0xba/0x110 [hfi1]
  do_sync_readv_writev+0x7b/0xd0
  do_readv_writev+0xce/0x260
  ? handle_mm_fault+0x39d/0x9b0
  ? pick_next_task_fair+0x5f/0x1b0
  ? sched_clock_cpu+0x85/0xc0
  ? __schedule+0x13a/0x890
  vfs_writev+0x35/0x60
  SyS_writev+0x7f/0x110
  system_call_fastpath+0x22/0x27

Use the appropriate kref API to verify access.

Reorder context cleanup to ensure context removal before cleanup occurs
correctly.

Cc: stable@vger.kernel.org # v4.14.0+
Fixes: f683c80ca6 ("IB/hfi1: Resolve kernel panics by reference counting receive contexts")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-06 14:47:09 -04:00
Anshuman Khandual
98fa15f34c mm: replace all open encodings for NUMA_NO_NODE
Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

All these places for replacement were found by running the following
grep patterns on the entire kernel code.  Please let me know if this
might have missed some instances.  This might also have replaced some
false positives.  I will appreciate suggestions, inputs and review.

1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"

This patch (of 2):

At present there are multiple places where invalid node number is
encoded as -1.  Even though implicitly understood it is always better to
have macros in there.  Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE.  This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.

Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:14 -08:00
YueHaibing
4e69cf1fe2 RDMA/hns: Use GFP_ATOMIC in hns_roce_v2_modify_qp
The the below commit, hns_roce_v2_modify_qp is called inside spinlock
while using GFP_KERNEL. Change it to GFP_ATOMIC.

Fixes: 0425e3e6e0 ("RDMA/hns: Support flush cqe for hip08 in kernel space")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-04 16:41:30 -04:00
Shaobo He
952a3cc9c0 cxgb4: kfree mhp after the debug print
In function `c4iw_dealloc_mw`, variable mhp's value is printed after
freed, it is clearer to have the print before the kfree.

Otherwise racing threads could allocate another mhp with the same pointer
value and create confusing tracing.

Signed-off-by: Shaobo He <shaobo@cs.utah.edu>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-04 16:41:30 -04:00
Moni Shoua
7095ec3ca0 IB/mlx5: Set correct write permissions for implicit ODP MR
The write access of an implicit MR is inherited to all of its children.
Therefore we must set the correct write access to the parent MR.

Pass full access_flags when creating umem to let it calculate write access
correctly.

Fixes: da6a496a34 ("IB/mlx5: Ranges in implicit ODP MR inherit its write access")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-04 10:57:19 -04:00
Devesh Sharma
0fca467e81 bnxt_re: Clean cq for kernel consumers only
Kernel space provider driver should clean the CQs belonging to kernel
space consumers only. The current implementation is doing reverse of it.

Fixing the same by avoiding the call to __clean_cq on a kernel qp during
destroy.

Fixes: c50866e285 ("bnxt_re: fix the regression due to changes in alloc_pbl")
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-04 10:51:14 -04:00
Jakub Kicinski
f4b6bcc700 net: devlink: turn devlink into a built-in
Being able to build devlink as a module causes growing pains.
First all drivers had to add a meta dependency to make sure
they are not built in when devlink is built as a module.  Now
we are struggling to invoke ethtool compat code reliably.

Make devlink code built-in, users can still not build it at
all but the dynamically loadable module option is removed.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26 08:49:05 -08:00
David S. Miller
70f3522614 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three conflicts, one of which, for marvell10g.c is non-trivial and
requires some follow-up from Heiner or someone else.

The issue is that Heiner converted the marvell10g driver over to
use the generic c45 code as much as possible.

However, in 'net' a bug fix appeared which makes sure that a new
local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0
is cleared.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24 12:06:19 -08:00
Leon Romanovsky
a2a074ef39 RDMA: Handle ucontext allocations by IB/core
Following the PD conversion patch, do the same for ucontext allocations.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-22 14:11:37 -07:00
Devesh Sharma
c50866e285 bnxt_re: fix the regression due to changes in alloc_pbl
While adding the use of for_each_sg_dma_page iterator for Brodcom's rdma
driver, there was a regression added in the __alloc_pbl path. The change
left bnxt_re in DOA state in for-next branch.

Fixing the regression to avoid the host crash when a user space object is
created. Restricting the unconditional access to hwq.pg_arr when hwq is
initialized for user space objects.

Fixes: 161ebe2498 ("RDMA/bnxt_re: Use for_each_sg_dma_page iterator on umem SGL")
Reported-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-22 11:17:22 -07:00
Håkon Bugge
2612d723aa IB/mlx4: Increase the timeout for CM cache
Using CX-3 virtual functions, either from a bare-metal machine or
pass-through from a VM, MAD packets are proxied through the PF driver.

Since the VF drivers have separate name spaces for MAD Transaction Ids
(TIDs), the PF driver has to re-map the TIDs and keep the book keeping
in a cache.

Following the RDMA Connection Manager (CM) protocol, it is clear when
an entry has to evicted form the cache. But life is not perfect,
remote peers may die or be rebooted. Hence, it's a timeout to wipe out
a cache entry, when the PF driver assumes the remote peer has gone.

During workloads where a high number of QPs are destroyed concurrently,
excessive amount of CM DREQ retries has been observed

The problem can be demonstrated in a bare-metal environment, where two
nodes have instantiated 8 VFs each. This using dual ported HCAs, so we
have 16 vPorts per physical server.

64 processes are associated with each vPort and creates and destroys
one QP for each of the remote 64 processes. That is, 1024 QPs per
vPort, all in all 16K QPs. The QPs are created/destroyed using the
CM.

When tearing down these 16K QPs, excessive CM DREQ retries (and
duplicates) are observed. With some cat/paste/awk wizardry on the
infiniband_cm sysfs, we observe as sum of the 16 vPorts on one of the
nodes:

cm_rx_duplicates:
      dreq  2102
cm_rx_msgs:
      drep  1989
      dreq  6195
       rep  3968
       req  4224
       rtu  4224
cm_tx_msgs:
      drep  4093
      dreq 27568
       rep  4224
       req  3968
       rtu  3968
cm_tx_retries:
      dreq 23469

Note that the active/passive side is equally distributed between the
two nodes.

Enabling pr_debug in cm.c gives tons of:

[171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
1,sl_cm_id: 0xd393089f} is NULL!

By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
tear-down phase of the application is reduced from approximately 90 to
50 seconds. Retries/duplicates are also significantly reduced:

cm_rx_duplicates:
      dreq  2460
[]
cm_tx_retries:
      dreq  3010
       req    47

Increasing the timeout further didn't help, as these duplicates and
retries stems from a too short CMA timeout, which was 20 (~4 seconds)
on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
numbers fell down to about 10 for both of them.

Adjustment of the CMA timeout is not part of this commit.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-22 09:35:03 -07:00
Moni Shoua
81dd4c4be3 IB/mlx5: Validate correct PD before prefetch MR
When prefetching odp mr it is required to verify that pd of the mr is
identical to the pd for which the advise_mr request arrived with.

This check was missing from synchronous flow and is added now.

Fixes: 813e90b1ae ("IB/mlx5: Add advise_mr() support")
Reported-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-21 16:32:45 -07:00
Moni Shoua
a6bc3875f1 IB/mlx5: Protect against prefetch of invalid MR
When deferring a prefetch request we need to protect against MR or PD
being destroyed while the request is still enqueued.

The first step is to validate that PD owns the lkey that describes the MR
and that the MR that the lkey refers to is owned by that PD.

The second step is to dequeue all requests when MR is destroyed.

Since PD can't be destroyed while it owns MRs it is guaranteed that when a
worker wakes up the request it refers to is still valid.

Now, it is possible to refrain from taking a reference on the device since
it is assured to be present as pd.

While that, replace the dedicated ordered workqueue with the system
unbound workqueue to reuse an existing resource and improve
performance. This will also fix a bug of queueing to the wrong workqueue.

Fixes: 813e90b1ae ("IB/mlx5: Add advise_mr() support")
Reported-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-21 16:32:45 -07:00
Gustavo A. R. Silva
7264235ee7 IB/hfi1: Add missing break in switch statement
Fix the following warning by adding a missing break:

drivers/infiniband/hw/hfi1/tid_rdma.c: In function ‘hfi1_tid_rdma_wqe_interlock’:
drivers/infiniband/hw/hfi1/tid_rdma.c:3251:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   switch (prev->wr.opcode) {
   ^~~~~~
drivers/infiniband/hw/hfi1/tid_rdma.c:3259:2: note: here
  case IB_WR_RDMA_READ:
  ^~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.

Fixes: c6c231175c ("IB/hfi1: Add interlock between TID RDMA WRITE and other requests")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Kaike Wan <Kaike.wan@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-21 14:08:12 -07:00
Jason Gunthorpe
815f748037 Merge branch 'mlx5-next' into rdma.git for-next
From
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

To resolve conflicts with net-next and pick up the first patch.

* branch 'mlx5-next':
  net/mlx5: Factor out HCA capabilities functions
  IB/mlx5: Add support for 50Gbps per lane link modes
  net/mlx5: Add support to ext_* fields introduced in Port Type and Speed register
  net/mlx5: Add new fields to Port Type and Speed register
  net/mlx5: Refactor queries to speed fields in Port Type and Speed register
  net/mlx5: E-Switch, Avoid magic numbers when initializing offloads mode
  net/mlx5: Relocate vport macros to the vport header file
  net/mlx5: E-Switch, Normalize the name of uplink vport number
  net/mlx5: Provide an alternative VF upper bound for ECPF
  net/mlx5: Add host params change event
  net/mlx5: Add query host params command
  net/mlx5: Update enable HCA dependency
  net/mlx5: Introduce Mellanox SmartNIC and modify page management logic
  IB/mlx5: Use unified register/load function for uplink and VF vports
  net/mlx5: Use consistent vport num argument type
  net/mlx5: Use void pointer as the type in address_of macro
  net/mlx5: Align ODP capability function with netdev coding style
  mlx5: use RCU lock in mlx5_eq_cq_get()

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-21 12:40:18 -07:00
Davidlohr Bueso
ec95e0fa21 drivers/IB,qib: Fix pinned/locked limit check in qib_get_user_pages()
The current check does not take into account the previous value of
pinned_vm; thus it is quite bogus as is. Fix this by checking the
new value after the (optimistic) atomic inc.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-20 14:42:41 -07:00
Wei Yongjun
3b8f8b95d9 iw_cxgb4: Make function read_tcb() static
Fixes the following sparse warning:

drivers/infiniband/hw/cxgb4/cm.c:658:6: warning:
 symbol 'read_tcb' was not declared. Should it be static?

Fixes: 11a27e2121 ("iw_cxgb4: complete the cached SRQ buffers")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 20:52:19 -07:00
Yangyang Li
6ac16e4039 RDMA/hns: Bugfix for set hem of SCC
The method of set hem for scc context is different from other contexts. It
should notify the hardware with the detailed idx in bt0 for scc, while for
other contexts, it only need to notify the bt step and the hardware will
calculate the idx.

Here fixes the following error when unloading the hip08 driver:

[  123.570768] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[  123.579023] {1}[Hardware Error]: event severity: recoverable
[  123.584670] {1}[Hardware Error]:  Error 0, type: recoverable
[  123.590317] {1}[Hardware Error]:   section_type: PCIe error
[  123.595877] {1}[Hardware Error]:   version: 4.0
[  123.600395] {1}[Hardware Error]:   command: 0x0006, status: 0x0010
[  123.606562] {1}[Hardware Error]:   device_id: 0000:7d:00.0
[  123.612034] {1}[Hardware Error]:   slot: 0
[  123.616120] {1}[Hardware Error]:   secondary_bus: 0x00
[  123.621245] {1}[Hardware Error]:   vendor_id: 0x19e5, device_id: 0xa222
[  123.627847] {1}[Hardware Error]:   class_code: 000002
[  123.632977] hns3 0000:7d:00.0: aer_status: 0x00000000, aer_mask: 0x00000000
[  123.639928] hns3 0000:7d:00.0: aer_layer=Transaction Layer, aer_agent=Receiver ID
[  123.647400] hns3 0000:7d:00.0: aer_uncor_severity: 0x00000000
[  123.653136] hns3 0000:7d:00.0: PCI error detected, state(=1)!!
[  123.658959] hns3 0000:7d:00.0: ROCEE uncorrected RAS error identified
[  123.665395] hns3 0000:7d:00.0: ROCEE RAS AXI rresp error
[  123.670713] hns3 0000:7d:00.0: requesting reset due to PCI error
[  123.676715] hns3 0000:7d:00.0: received reset event , reset type is 5
[  123.683147] hns3 0000:7d:00.0: AER: Device recovery successful
[  123.688978] hns3 0000:7d:00.0: PF Reset requested
[  123.693684] hns3 0000:7d:00.0: PF failed(=-5) to send mailbox message to VF
[  123.700633] hns3 0000:7d:00.0: inform reset to vf(1) failded -5!

Fixes: 6a157f7d1b ("RDMA/hns: Add SCC context allocation support for hip08")
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Reviewed-by: Yixian Liu <liuyixian@huawei.com>
Reviewed-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 20:52:19 -07:00
Lijun Ou
3e394f9413 RDMA/hns: Modify qp&cq&pd specification according to UM
Accroding to hip08's limitation, qp&cq specification is 1M, mtpt
specification 1M in kernel space.

Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 20:52:19 -07:00
Parvi Kaustubhi
5bb3c1e9d4 IB/usnic: Fix deadlock
There is a dead lock in usnic ib_register and netdev_notify path.

	usnic_ib_discover_pf()
	| mutex_lock(&usnic_ib_ibdev_list_lock);
	 | usnic_ib_device_add();
	  | ib_register_device()
	   | usnic_ib_query_port()
	    | mutex_lock(&us_ibdev->usdev_lock);
	     | ib_get_eth_speed()
	      | rtnl_lock()

order of lock: &usnic_ib_ibdev_list_lock -> usdev_lock -> rtnl_lock

	rtnl_lock()
	 | usnic_ib_netdevice_event()
	  | mutex_lock(&usnic_ib_ibdev_list_lock);

order of lock: rtnl_lock -> &usnic_ib_ibdev_list_lock

Solution is to use the core's lock-free ib_device_get_by_netdev() scheme
to lookup ib_dev while handling netdev & inet events.

Signed-off-by: Parvi Kaustubhi <pkaustub@cisco.com>
Reviewed-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
Reviewed-by: Tanmay Inamdar <tinamdar@cisco.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-19 20:52:19 -07:00
Leon Romanovsky
cfe876d8e6 RDMA/cxgb4: Remove kref accounting for sync operation
Ucontext allocation and release aren't async events and don't need kref
accounting. The common layer of RDMA subsystem ensures that dealloc
ucontext will be called after all other objects are released.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-15 21:39:15 -07:00
Leon Romanovsky
be56b07b4f RDMA/nes: Remove useless usecnt variable and redundant memset
The internal design of RDMA/core ensures that there dealloc ucontext will
be called only if alloc_ucontext succeeded, hence there is no need to
manage internal variable to mark validity of ucontext.

As part of this change, remove redundant memeset too.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-15 21:39:15 -07:00
Bodong Wang
f8e8fa0262 net/mlx5: E-Switch, Centralize repersentor reg/unreg to eswitch driver
Eswitch has two users: IB and ETH. They both register repersentors
when mlx5 interface is added, and unregister the repersentors when
mlx5 interface is removed. Ideally, each driver should only deal with
the entities which are unique to itself. However, current IB and ETH
drivers have to perform the following eswitch operations:

1. When registering, specify how many vports to register. This number
   is the same for both drivers which is the total available vport
   numbers.
2. When unregistering, specify the number of registered vports to do
   unregister. Also, unload the repersentors which are already loaded.

It's unnecessary for eswitch driver to hands out the control of above
operations to individual driver users, as they're not unique to each
driver. Instead, such operations should be centralized to eswitch
driver. This consolidates eswitch control flow, and simplified IB and
ETH driver.

This patch doesn't change any functionality.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-15 17:25:58 -08:00
Saeed Mahameed
259fae5a2c Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Merge mlx5-next shared branched into net-next,

From Bodong Wang:
1) Introduction of ECPF (Embedded CPU Physical Function), and low level
bits for mlx5 SmartNic capabilities support.
2) Vport enumeration refactoring that affect mlx5_ib and mlx5_core

From Aya Levin,
3) Add support for 50Gbps per lane link modes in the Port Type and Speed
register (PTYS)
4) Refactor low level query functions for PTYS register
5) Add support for 50Gbps per lane link modes to mlx5_ib

Note: due to a change in API in mlx5/core and a later patch from net-next,
a fixup was squashed with this merge commit that replaces FDB_UPLINK_VPORT
with MLX5_VPORT_UPLINK which exists only in upstream net-next.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-15 16:45:31 -08:00
Kaike Wan
e50838c27f IB/hfi1: Fix a build warning for TID RDMA READ
The following build warning was produced for the TID RDMA READ
patch ("IB/hfi1: Enable TID RDMA READ protocol"):

drivers/infiniband/hw/hfi1/qp.c: In function 'hfi1_setup_wqe':
drivers/infiniband/hw/hfi1/qp.c:328:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   hfi1_setup_tid_rdma_wqe(qp, wqe);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/infiniband/hw/hfi1/qp.c:329:2: note: here
  case IB_QPT_UC:
  ^~~~

This patch will fix the issue by adding the "fall through" comment.

Fixes: f1ab4efa6d ("IB/hfi1: Enable TID RDMA READ protocol")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-15 15:55:50 -07:00
Shamir Rabinovitch
8994445054 IB/{hw,sw}: Remove 'uobject->context' dependency in object creation APIs
Now when we have the udata passed to all the ib_xxx object creation APIs
and the additional macro 'rdma_udata_to_drv_context' to get the
ib_ucontext from ib_udata stored in uverbs_attr_bundle, we can finally
start to remove the dependency of the drivers in the
ib_xxx->uobject->context.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-15 15:38:38 -07:00
Raju Rangoju
f09ef134a7 iw_cxgb4: cq/qp mask depends on bar2 pages in a host page
Adjust the cq/qp mask based on the number of bar2 pages in a host page.

For user-mode rdma, the granularity of the BAR2 memory mapped to a user
rdma process during queue allocation must be based on the host page
size. The lld attributes udb_density and ucq_density are used to figure
out how many sge contexts are in a bar2 page. So the rdev->qpmask and
rdev->cqmask in iw_cxgb4 need to now be adjusted based on how many sge
bar2 pages are in a host page.

Otherwise the device fails to work on non 4k page size systems.

Fixes: 2391b0030e ("cxgb4: Remove SGE_HOST_PAGE_SIZE dependency on page size")
Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-15 09:39:39 -07:00
Lijun Ou
dad1f9802e RDMA/hns: Configure capacity of hns device
This patch adds new device capability for IB_DEVICE_MEM_MGT_EXTENSIONS to
indicate device support for the following features:

1. Fast register memory region.
2. send with remote invalidate by frmr
3. local invalidate memory regsion

As well as adds the max depth of frmr page list len.

Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-14 13:20:19 -07:00
Yixian Liu
e95c716c7f RDMA/hns: Delete useful prints for aeq subtype event
Current all messages printed for aeq subtype event are wrong.  Thus,
delete them and only the value of subtype event is printed.

Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-14 13:20:19 -07:00
Yixian Liu
f7f27a5f03 RDMA/hns: Set allocated memory to zero for wrid
The memory allocated for wrid should be initialized to zero.

Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-14 13:20:19 -07:00
Yixian Liu
ab22bf0521 RDMA/hns: Fix the state of rereg mr
The state of mr after reregister operation should be set to valid
state. Otherwise, it will keep the same as the state before reregistered.

Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-14 13:20:18 -07:00
chenglang
704e0e613a RDMA/hns: Limit minimum ROCE CQ depth to 64
This patch modifies the minimum CQ depth specification of hip08 and is
consistent with the processing of hip06.

Signed-off-by: chenglang <chenglang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-14 13:20:18 -07:00
Aya Levin
08e8676f16 IB/mlx5: Add support for 50Gbps per lane link modes
Driver now supports new link modes: 50Gbps per lane support for
50G/100G/200G. This patch reads the correct field (legacy vs. extended)
based on a FW indication bit, and adds a translation function (link
modes to IB width and speed) to the new link modes.

Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-14 12:14:42 -08:00
Aya Levin
a08b4ed137 net/mlx5: Add support to ext_* fields introduced in Port Type and Speed register
This patch exposes new link modes (including 50Gbps per lane), and ext_*
fields which describes the new link modes in Port Type and Speed
register (PTYS).
Access functions, translation functions (speed <-> HW bits) and
link max speed function were modified.

Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-14 12:14:42 -08:00
Aya Levin
bc4e12ffef net/mlx5: Refactor queries to speed fields in Port Type and Speed register
This patch fascicles queries to speed related fields in Port Type and
Speed register (PTYS) into a single API. I addition, this patch
refactors functions which serves only Ethernet driver: remove the
protocol type as an input parameter, move code from 'core' directory
into 'en' directory and add 'eth' prefix to the function's name. The
patch also encapsulates functions that are not used outside the Ethernet
driver removes redundant include files.

Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-14 12:14:42 -08:00
Bodong Wang
b05af6aacd net/mlx5: E-Switch, Normalize the name of uplink vport number
Driver used to name uplink vport as FDB_UPLINK_VPORT, it's hard to
comply with the same naming convention along with the introduction of
other vports. Use MLX5_VPORT as the prefix for such vports and
relocate the uplink vport definition to public header file for the
benefits of both net and IB drivers.

This patch doesn't change any functionality.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-14 12:14:42 -08:00
Bodong Wang
f0666f1f22 IB/mlx5: Use unified register/load function for uplink and VF vports
IB driver maintains different registration and load function calls
for uplink and VF vports. This is not necessary as they only differ
with each other on their profiles.

This patch doesn't change any functionality.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-14 12:14:41 -08:00
Shiraz, Saleem
52a572e9f7 RDMA/nes: Use for_each_sg_dma_page iterator for umem SGL
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-13 09:00:43 -07:00
Colin Ian King
e8ac9389f0 RDMA: Fix allocation failure on pointer pd
The null check on an allocation failure on pd is currently checking
if pd is non-null rather than null. Fix this by adding the missing !
operator.

Fixes: 21a428a019 ("RDMA: Handle PD allocations by IB/core")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-13 09:00:43 -07:00
Doug Ledford
d892273bb5 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma into for-next
I had merged the hfi1-tid code into my local copy of for-next, but was
waiting on 0day testing before pushing it (I pushed it to my wip
branch).  Having waited several days for 0day testing to show up, I'm
finally just going to push it out.  In the meantime, though, Jason
pushed other stuff to for-next, so I needed to merge up the branches
before pushing.

Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-13 09:35:39 -05:00
Colin Ian King
a87145957e RDMA/bnxt_re: fix or'ing of data into an uninitialized struct member
The struct member comp_mask has not been initialized however a bit
pattern is being bitwise or'd into the member and hence other bit
fields in comp_mask may contain any garbage from the stack. Fix this
by making the bitwise or into an assignment.

Fixes: 95b86d1c91 ("RDMA/bnxt_re: Update kernel user abi to pass chip context")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11 15:34:54 -07:00
Mark Bloch
fc9e4477f9 RDMA/mlx5: Fix memory leak in case we fail to add an IB device
Make sure the IB device is freed on failure.

Fixes: b5ca15ad7e ("IB/mlx5: Add proper representors support")
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11 15:32:29 -07:00
Yishai Hadas
0da4d48d99 IB/mlx5: Fix bad flow upon DEVX mkey creation
Fix bad flow upon DEVX mkey creation to prevent deleting the indirect mkey
from the radix tree in case there was a previous failure to insert it.

Fixes: 534fd7aac5 ("IB/mlx5: Manage indirection mkey upon DEVX flow for ODP")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11 15:30:40 -07:00
Shiraz, Saleem
be8c456abf RDMA/ocrdma: Use for_each_sg_dma_page iterator on umem SGL
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11 15:24:55 -07:00
Shiraz, Saleem
95ad233ffb RDMA/qedr: Use for_each_sg_dma_page iterator on umem SGL
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11 15:24:55 -07:00
Shiraz, Saleem
f3e6d31179 RDMA/vmw_pvrdma: Use for_each_sg_dma_page iterator on umem SGL
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11 15:24:55 -07:00
Shiraz, Saleem
b44e47eb06 RDMA/cxgb3: Use for_each_sg_dma_page iterator on umem SGL
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11 15:24:55 -07:00
Shiraz, Saleem
48b586ac36 RDMA/cxgb4: Use for_each_sg_dma_page iterator on umem SGL
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11 15:24:55 -07:00
Shiraz, Saleem
3856ec5527 RDMA/hns: Use for_each_sg_dma_page iterator on umem SGL
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11 15:02:33 -07:00
Shiraz, Saleem
43fae91276 RDMA/i40iw: Use for_each_sg_dma_page iterator on umem SGL
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11 15:02:33 -07:00
Shiraz, Saleem
8d249af3e6 RDMA/mthca: Use for_each_sg_dma_page iterator on umem SGL
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11 15:02:33 -07:00
Shiraz, Saleem
161ebe2498 RDMA/bnxt_re: Use for_each_sg_dma_page iterator on umem SGL
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.

Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.

Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11 15:02:33 -07:00
Doug Ledford
82771f2033 Merge branch 'wip/dl-for-next' into for-next
Due to concurrent work by myself and Jason, a normal fast forward merge
was not possible.  This brings in a number of hfi1 changes, mainly the
hfi1 TID RDMA support (roughly 10,000 LOC change), which was reviewed
and integrated over a period of days.

Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-09 12:54:04 -05:00
Raju Rangoju
f368ff188a iw_cxgb4: fix srqidx leak during connection abort
When an application aborts the connection by moving QP from RTS to ERROR,
then iw_cxgb4's modify_rc_qp() RTS->ERROR logic sets the
*srqidxp to 0 via t4_set_wq_in_error(&qhp->wq, 0), and aborts the
connection by calling c4iw_ep_disconnect().

c4iw_ep_disconnect() does the following:
 1. sends up a close_complete_upcall(ep, -ECONNRESET) to libcxgb4.
 2. sends abort request CPL to hw.

But, since the close_complete_upcall() is sent before sending the
ABORT_REQ to hw, libcxgb4 would fail to release the srqidx if the
connection holds one. Because, the srqidx is passed up to libcxgb4 only
after corresponding ABORT_RPL is processed by kernel in abort_rpl().

This patch handle the corner-case by moving the call to
close_complete_upcall() from c4iw_ep_disconnect() to abort_rpl().  So that
libcxgb4 is notified about the -ECONNRESET only after abort_rpl(), and
libcxgb4 can relinquish the srqidx properly.

Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 17:02:05 -07:00
Raju Rangoju
11a27e2121 iw_cxgb4: complete the cached SRQ buffers
If TP fetches an SRQ buffer but ends up not using it before the connection
is aborted, then it passes the index of that SRQ buffer to the host in
ABORT_REQ_RSS or ABORT_RPL CPL message.

But, if the srqidx field is zero in the received ABORT_RPL or
ABORT_REQ_RSS CPL, then we need to read the tcb.rq_start field to see if
it really did have an RQE cached. This works around a case where HW does
not include the srqidx in the ABORT_RPL/ABORT_REQ_RSS CPL.

The final value of rq_start is the one present in TCB with the
TF_RX_PDU_OUT bit cleared. So, we need to read the TCB, examine the
TF_RX_PDU_OUT (bit 49 of t_flags) in order to determine if there's a rx
PDU feedback event pending.

Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 17:02:05 -07:00
Leon Romanovsky
21a428a019 RDMA: Handle PD allocations by IB/core
The PD allocations in IB/core allows us to simplify drivers and their
error flows in their .alloc_pd() paths. The changes in .alloc_pd() go hand
in had with relevant update in .dealloc_pd().

We will use this opportunity and convert .dealloc_pd() to don't fail, as
it was suggested a long time ago, failures are not happening as we have
never seen a WARN_ON print.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:51:04 -07:00
Parvi Kaustubhi
0c23660649 IB/usnic: Fix locking when unregistering
Move the call to usnic_ib_device_remove after usnic_ib_ibdev_list_lock has
been released.

Signed-off-by: Parvi Kaustubhi <pkaustub@cisco.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:21:59 -07:00
Steve Wise
c8a7eb554a iw_cxgb4: use tos when finding ipv6 routes
When IPv6 support was added, the correct tos was not passed to
cxgb_find_route6(). This potentially results in the wrong route entry.

Fixes: 830662f6f0 ("RDMA/cxgb4: Add support for active and passive open connection with IPv6 address")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:18:06 -07:00
Steve Wise
cb3ba0bde8 iw_cxgb4: use tos when importing the endpoint
import_ep() is passed the correct tos, but doesn't use it correctly.

Fixes: ac8e4c69a0 ("cxgb4/iw_cxgb4: TOS support")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:18:06 -07:00
Steve Wise
7235ea227e iw_cxgb4: use listening ep tos when accepting new connections
If the parent listening endpoint has a service type set, then use that
when setting up the connection.  This allows server-side applications to
mandate the tos for passive side connections via rdma_set_service_type()
on the listening endpoints.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08 16:18:06 -07:00
Devesh Sharma
95b86d1c91 RDMA/bnxt_re: Update kernel user abi to pass chip context
User space verbs provider library would need chip context.  Changing the
ABI to add chip version details in structure.  Furthermore, changing the
kernel driver ucontext allocation code to initialize the abi structure
with appropriate values.

As suggested by community, appended the new fields at the bottom of the
ABI structure and retaining to older fields as those were in the older
versions.

Keeping the ABI version at 1 and adding a new field in the ucontext
response structure to hold the component mask.  The user space library
should check pre-defined flags to figure out if a certain feature is
supported on not.

Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 13:24:49 -07:00
Devesh Sharma
37f91cff2d RDMA/bnxt_re: Add extended psn structure for 57500 adapters
The new 57500 series of adapter has bigger psn search structure.  The size
of new structure is 16B. Changing the control path memory allocation and
fast path code to accommodate the new psn structure while maintaining the
backward compatibility.

There are few additional changes listed below:
 - For 57500 chip max-sge are limited to 6 for now.
 - For 57500 chip max-receive-sge should be set to 6 for now.
 - Add driver/hardware interface structure for new chip.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 13:24:48 -07:00
Devesh Sharma
374c5285ab RDMA/bnxt_re: Enable GSI QP support for 57500 series
In the new 57500 series of adapters the GSI qp is a UD type QP unlike the
previous generation where it was a Raw Eth QP. Changing the control and
data path to support the same. Listing all the significant diffs:

 - AH creation resolve network type unconditionally
 - Add check at relevant places to distinguish from Raw Eth
   processing flow.
 - bnxt_re_process_res_ud_wc report completion with GRH flag
   when qp is GSI.
 - Change length, cfa_meta and smac to match new driver/hardware
   interface.
 - Add new driver/hardware interface.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 13:24:48 -07:00
Devesh Sharma
e0387e1dd4 RDMA/bnxt_re: Skip backing store allocation for 57500 series
The backing store to keep HW context data structures is allocated and
initialized by L2 driver. For 57500 chip RoCE driver do not require to
allocate and initialize additional memory. Changing to skip duplicate
allocation and initialization for 57500 adapters. Driver continues as
before for older chips.

This patch also takes care of stats context memory alignment to 128
boundary, a requirement for 57500 series of chip. Older chips do not care
of alignment, thus the change is unconditional.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 13:24:48 -07:00
Devesh Sharma
b353ce556d RDMA/bnxt_re: Add 64bit doorbells for 57500 series
The new chip series has 64 bit doorbell for notification queues. Thus,
both control and data path event queues need new routines to write 64 bit
doorbell. Adding the same. There is new doorbell interface between the
chip and driver. Changing the chip specific data structure definitions.

Additional significant changes are listed below
- bnxt_re_net_ring_free/alloc takes a new argument
- bnxt_qplib_enable_nq and enable_rcfw uses new doorbell offset
  for new chip.
- DB mapping for NQ and CREQ now maps 8 bytes.
- DBR_DBR_* macros renames to DBC_DBC_*
- store nq_db_offset in a 32bit data type.
- got rid of __iowrite64_copy, used writeq instead.
- changed the DB header initialization to simpler scheme.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 13:24:48 -07:00
Devesh Sharma
ae8637e131 RDMA/bnxt_re: Add chip context to identify 57500 series
Adding setup and destroy routines for chip-context. The chip context would
be used frequently in control and data path to take execution flow
depending on the chip type.  chip context structure pointer is added to
the relevant data structures.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 13:24:48 -07:00
Gal Pressman
af8b38ed0b IB/mlx5: Simplify WQE count power of two check
Use is_power_of_2() instead of hard coding it in the driver. While at it,
fix the meaningless error print.

Signed-off-by: Gal Pressman <galpress@amazon.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 13:14:55 -07:00
Davidlohr Bueso
8ea1f989aa drivers/IB,usnic: reduce scope of mmap_sem
usnic_uiom_get_pages() uses gup_longterm() so we cannot really get rid of
mmap_sem altogether in the driver, but we can get rid of some complexity
that mmap_sem brings with only pinned_vm.  We can get rid of the wq
altogether as we no longer need to defer work to unpin pages as the
counter is now atomic. We also share the lock.

Acked-by: Parvi Kaustubhi <pkaustub@cisco.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 12:54:02 -07:00
Davidlohr Bueso
0e15c25336 drivers/IB,hfi1: do not se mmap_sem
This driver already uses gup_fast() and thus we can just drop the mmap_sem
protection around the pinned_vm counter. Note that the window between when
hfi1_can_pin_pages() is called and the actual counter is incremented
remains the same as mmap_sem was _only_ used for when ->pinned_vm was
touched.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.det>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 12:54:02 -07:00
Davidlohr Bueso
3a2a1e9056 drivers/IB,qib: optimize mmap_sem usage
The driver uses mmap_sem for both pinned_vm accounting and
get_user_pages(). Because rdma drivers might want to use gup_longterm() in
the future we still need some sort of mmap_sem serialization (as opposed
to removing it entirely by using gup_fast()). Now that pinned_vm is atomic
the writer lock can therefore be converted to reader.

This also fixes a bug that __qib_get_user_pages was not taking into
account the current value of pinned_vm.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 12:54:02 -07:00
Davidlohr Bueso
70f8a3ca68 mm: make mm->pinned_vm an atomic64 counter
Taking a sleeping lock to _only_ increment a variable is quite the
overkill, and pretty much all users do this. Furthermore, some drivers
(ie: infiniband and scif) that need pinned semantics can go to quite
some trouble to actually delay via workqueue (un)accounting for pinned
pages when not possible to acquire it.

By making the counter atomic we no longer need to hold the mmap_sem and
can simply some code around it for pinned_vm users. The counter is 64-bit
such that we need not worry about overflows such as rdma user input
controlled from userspace.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 12:54:02 -07:00
Kaike Wan
34025fb0c4 IB/hfi1: Prioritize the sending of ACK packets
ACK packets are generally associated with request completion and resource
release and therefore should be sent first. This patch optimizes the
send engine by using the following policies:
(1) QPs with RVT_S_ACK_PENDING bit set in qp->s_flags or qpriv->s_flags
should have their priority incremented;
(2) QPs with ACK or TID-ACK packet queued should have their priority
incremented;
(3) When a QP is queued to the wait list due to resource constraints, it
will be queued to the head if it has ACK packet to send;
(4) When selecting qps to run from the wait list, the one with the highest
priority and starve_cnt will be selected; each priority will be equivalent
to a fixed number of starve_cnt (16).

Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:44 -05:00
Kaike Wan
a05c9bdcfd IB/hfi1: Add static trace for TID RDMA WRITE protocol
This patch makes the following changes to the static trace:
1. Adds the decoding of TID RDMA WRITE packets in IB header trace;
2. Adds trace events for various stages of the TID RDMA WRITE
protocol. These events provide a fine-grained control for monitoring
and debugging the hfi1 driver in the filed.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:44 -05:00
Kaike Wan
ad00889e7c IB/hfi1: Enable TID RDMA WRITE protocol
This patch enables TID RDMA WRITE protocol by converting a qualified
RDMA WRITE request into a TID RDMA WRITE request internally:
(1) The TID RDMA cability must be enabled;
(2) The request must start on a 4K page boundary;
(3) The request length must be a multiple of 4K and must be larger or
equal to 256K.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:44 -05:00
Kaike Wan
c6c231175c IB/hfi1: Add interlock between TID RDMA WRITE and other requests
This locking mechanism is designed to provent vavious memory corruption
scenarios from occurring when requests are pipelined, especially when
RDMA WRITE requests are interleaved with TID RDMA READ requests:
1. READ-AFTER-READ;
2. READ-AFTER-WRITE;
3. WRITE-AFTER-READ;
4. WRITE-AFTER-WRITE.
When memory corruption is likely, a request will be held back until
previous requests have been completed.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:44 -05:00
Kaike Wan
3c6cb20a0d IB/hfi1: Add TID RDMA WRITE functionality into RDMA verbs
This patch integrates TID RDMA WRITE protocol into normal RDMA verbs
framework. The TID RDMA WRITE protocol is an end-to-end protocol
between the hfi1 drivers on two OPA nodes that converts a qualified
RDMA WRITE request into a TID RDMA WRITE request to avoid data copying
on the responder side.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:44 -05:00
Kaike Wan
572f0c3301 IB/hfi1: Add the dual leg code
The "Second Leg" of the TID RDMA WRITE protocol deals with
the transfer of data and ack packets, which are in the KDETH
PSN space, as opposed to the IB PSN space.

Therefore, the Second Leg could be considered as a separate
state machine. As such, it is handled by a different work
queue item which is scheduled along with the normal IB state
machine work item.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:44 -05:00
Kaike Wan
24c5bfeaf1 IB/hfi1: Add the TID second leg ACK packet builder
This patch adds the TID packet builder for the responder side, which
contains the state machine to build TID RDMA ACK packet for either
TID RDMA WRITE DATA or TID RDMA RESYNC packets.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:44 -05:00
Kaike Wan
70dcb2e3dc IB/hfi1: Add the TID second leg send packet builder
To improve performance, the TID RDMA WRITE protocol is designed to
own a second leg to send data and ack packets in the KDETH PSN space.
This patch adds the packet builder for the requester side, which
contains the state machine to build TID RDMA WRITE DATA and TID
RDMA RESYNC packet.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:44 -05:00
Kaike Wan
6e38fca6b1 IB/hfi1: Resend the TID RDMA WRITE DATA packets
This patch adds the logic to resend TID RDMA WRITE DATA packets.
The tracking indices will be reset properly so that the correct
TID entries will be used.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:44 -05:00
Kaike Wan
7cf0ad679d IB/hfi1: Add a function to receive TID RDMA RESYNC packet
This patch adds a function to receive TID RDMA RESYNC packet on the
responder side. The QP's hardware flow will be updated and all
allocated software flows will be updated accordingly in order to
drop all stale packets.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:44 -05:00
Kaike Wan
6e391c6a4a IB/hfi1: Add a function to build TID RDMA RESYNC packet
This patch adds a function to build TID RDMA RESYNC packet, which is
sent by the requester to notify the responder that no TID RDMA ACK
packet has been received for a given KDETH PSN.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:44 -05:00
Kaike Wan
829eaee5d0 IB/hfi1: Add TID RDMA retry timer
This patch adds the TID RDMA retry timer to make sure that TID RDMA
WRITE DATA packets for a segment are received successfully by the
responder. This timer is generally armed when the last TID RDMA
WRITE DATA packet for a segment is sent out and stopped when all
TID RDMA DATA packets are acknowledged.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:43 -05:00
Kaike Wan
9e93e967f7 IB/hfi1: Add a function to receive TID RDMA ACK packet
This patch adds a function to receive TID RDMA ACK packet, which could
be an acknowledge to either a TID RDMA WRITE DATA packet or an TID
RDMA RESYNC packet. For an ACK to TID RDMA WRITE DATA packet, the
request segments are completed appropriately. For an ACK to a TID
RDMA RESYNC packet, any pending segment flow information is updated
accordingly.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:43 -05:00
Kaike Wan
0f75e325aa IB/hfi1: Add a function to build TID RDMA ACK packet
This patch adds a function to build TID RDMA ACJ packet, which is also
in the KDETH PSN space for packet ordering. This packet is used to
acknowledge the receiving of all the TID RDMA WRITE DATA packets
before the given KDETH PSN. Similar to RC ACK packets, TID RDMA ACK
packets could also be coalesced.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:43 -05:00
Kaike Wan
d72fe7d500 IB/hfi1: Add a function to receive TID RDMA WRITE DATA packet
This patch adds a function to receive TID RDMA WRITE DATA packet,
which is in the KDETH PSN space in packet ordering. Due to the use
of header suppression, software is generally only notified when
the last data packet for a segment is received. This patch also
adds code to handle KDETH EFLAGS errors for ingress TID RDMA WRITE
DATA packets.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:43 -05:00
Kaike Wan
539e1908e4 IB/hfi1: Add a function to build TID RDMA WRITE DATA packet
This patch adds a function to build TID RDMA WRITE DATA packet.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:43 -05:00
Kaike Wan
72a0ea99ec IB/hfi1: Add a function to receive TID RDMA WRITE response
This patch adds a function to receive TID RDMA WRITE response.
The TID entries will be stored for encoding TID RDMA WRITE DATA
packet later.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:43 -05:00
Kaike Wan
3c759e003a IB/hfi1: Add TID resource timer
This patch adds the TID resource timer, which is used by the responder
to free any TID resources that are allocated for TID RDMA WRITE request
and not returned by the requester after a reasonable time.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:43 -05:00
Kaike Wan
38d46d3676 IB/hfi1: Add a function to build TID RDMA WRITE response
This patch adds the function to build TID RDMA WRITE response. The
main role of the TID RDMA WRITE RESP packet is to send TID entries
to the requester so that they can be used to encode TID RDMA WRITE
DATA packet.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:43 -05:00
Kaike Wan
07b923701e IB/hfi1: Add functions to receive TID RDMA WRITE request
This patch adds the functions to receive TID RDMA WRITE request. The
request will be stored in the QP's s_ack_queue. This patch also adds
code to handle duplicate TID RDMA WRITE request and a function to
allocate TID resources for data receiving on the responder side.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:43 -05:00
Kaike Wan
4f9264d156 IB/hfi1: Add an s_acked_ack_queue pointer
The s_ack_queue is managed by two pointers into the ring:
r_head_ack_queue and s_tail_ack_queue. r_head_ack_queue is the index of
where the next received request is going to be placed and s_tail_ack_queue
is the entry of the request currently being processed. This works
perfectly fine for normal Verbs as the requests are processed one at a
time and the s_tail_ack_queue is not moved until the request that it
points to is fully completed.

In this fashion, s_tail_ack_queue constantly chases r_head_ack_queue and
the two pointers can easily be used to determine "queue full" and "queue
empty" conditions.

The detection of these two conditions are imported in determining when an
old entry can safely be overwritten with a new received request and the
resources associated with the old request be safely released.

When pipelined TID RDMA WRITE is introduced into this mix, things look
very different. r_head_ack_queue is still the point at which a newly
received request will be inserted, s_tail_ack_queue is still the
currently processed request. However, with pipelined TID RDMA WRITE
requests, s_tail_ack_queue moves to the next request once all TID RDMA
WRITE responses for that request have been sent. The rest of the protocol
for a particular request is managed by other pointers specific to TID RDMA
- r_tid_tail and r_tid_ack - which point to the entries for which the next
TID RDMA DATA packets are going to arrive and the request for which
the next TID RDMA ACK packets are to be generated, respectively.

What this means is that entries in the ring, which are "behind"
s_tail_ack_queue (entries which s_tail_ack_queue has gone past) are no
longer considered complete. This is where the problem is - a newly
received request could potentially overwrite a still active TID RDMA WRITE
request.

The reason why the TID RDMA pointers trail s_tail_ack_queue is that the
normal Verbs send engine uses s_tail_ack_queue as the pointer for the next
response. Since TID RDMA WRITE responses are processed by the normal Verbs
send engine, s_tail_ack_queue had to be moved to the next entry once all
TID RDMA WRITE response packets were sent to get the desired pipelining
between requests. Doing otherwise would mean that the normal Verbs send
engine would not be able to send the TID RDMA WRITE responses for the next
TID RDMA request until the current one is fully completed.

This patch introduces the s_acked_ack_queue index to point to the next
request to complete on the responder side. For requests other than TID
RDMA WRITE, s_acked_ack_queue should always be kept in sync with
s_tail_ack_queue. For TID RDMA WRITE request, it may fall behind
s_tail_ack_queue.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:43 -05:00
Kaike Wan
f5a4a95f4d IB/hfi1: Allow for extra entries in QP's s_ack_queue
The TID RDMA WRITE protocol differs from normal IB RDMA WRITE
in that TID RDMA WRITE requests do require responses, not just
ACKs.

Therefore, TID RDMA WRITE requests need to be treated as RDMA
READ requests from the point of view of the QPs' s_ack_queue.
In other words, the QPs' need to allow for TID RDMA WRITE
requests to be stored in their s_ack_queue.

However, because the user does not know anything about the TID
RDMA capability and/or protocols, these extra entries in the
queue cannot be advertized to the user.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:43 -05:00
Kaike Wan
c098bbb00c IB/hfi1: Build TID RDMA WRITE request
This patch adds the functions to build TID RDMA WRITE request.
The work request opcode, packet opcode, and packet formats for TID
RDMA WRITE protocol are also defined in this patch.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 18:07:43 -05:00
Kaike Wan
3ce5daa2c1 IB/hfi1: Add static trace for TID RDMA READ protocol
This patch makes the following changes to the static trace:
1. Adds the decoding of TID RDMA READ packets in IB header trace;
2. Tracks qpriv->s_flags and iow_flags in qpsleepwakeup trace;
3. Adds a new event to track RC ACK receiving;
4. Adds trace events for various stages of the TID RDMA READ
protocol. These events provide a fine-grained control for monitoring
and debugging the hfi1 driver in the filed.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:56 -05:00
Kaike Wan
f1ab4efa6d IB/hfi1: Enable TID RDMA READ protocol
This patch enables TID RDMA READ protocol by converting a qualified
RDMA READ request into a TID RDMA READ request internally:
(1) The TID RDMA capability must be enabled;
(2) The request must start on a 4K page boundary and all receiving
 buffers must start on 4K page boundaries;
(3) The request length must be a multiple of 4K and must be larger or
equal to 256K. Each receiving buffer length must be a multiple of 4K.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
a0b34f75ec IB/hfi1: Add interlock between a TID RDMA request and other requests
This locking mechanism is designed to provent vavious memory corruption
scenarios from occurring when requests are pipelined, especially when
RDMA READ/WRITE requests are interleaved with TID RDMA READ/WRITE
requests:
1. READ-AFTER-READ;
2. READ-AFTER-WRITE;
3. WRITE-AFTER-READ;
When memory corruption is likely, a request will be held back until
previous requests have been completed.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
24b11923da IB/hfi1: Integrate TID RDMA READ protocol into RC protocol
This patch integrates the TID RDMA READ protocol into the IB RC protocol.
This protocol is an end-to-end protocol between the hfi1 drivers on two
OPA nodes that converts a qualified RDMA READ request into a TID RDMA
READ request to avoid data copying on the requester side. The following
codes are added in this patch:
- Send the TID RDMA READ request;
- Complete the TID RDMA READ send request;
- Send the TID RDMA READ response;
- Complete the TID RDMA READ request;

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
b126078e89 IB/hfi1: Add functions for restarting TID RDMA READ request
This patch adds functions to retry TID RDMA READ request. Since TID RDMA
READ request could be retried from any segment boundary, it requires
a number of tracking fields in various structures and those fields
should be reset properly. The qp->s_num_rd_atomic field is reset before
retry and therefore should be incremented for each new or retried
RDMA READ or atomic request.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
22d136d756 IB/hfi1: Add TID RDMA handlers
This commit adds the TID RDMA READ pointers to the receiving opcode
handlers. It also adds TID RDMA READ header sizes to header size table.
A function to print the RHF EFLAGS errors is created so that it can be
shared by both IB and TID RDMA receiving functions.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
9905bf06e8 IB/hfi1: Add functions to receive TID RDMA READ response
This patch adds the functions to receive TID RDMA READ response. The TID
resource information in the KDETH packet header will direct the hardware
to deliver the packet payload to the user buffer automatically and the
software will handle the packet header for the last packet of a segment
as all other packet headers are suppressed by default. The TID entries
will be freed when all packets for a segment have been received. This
patch also adds the functions to handle KDETH eflag errors, including
flow sequence and generation errors, when a TID RDMA READ response
packet is received . The flow sequence error can be recovered by software
checking of the flow sequence and will disappear when the hardware flow
is programmed with a new generation number.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
1db21b5050 IB/hfi1: Add a function to build TID RDMA READ response
This patch adds the function to build TID RDMA READ response packet.
The previously received TID resource information will be used to
build the KDETH packet, which will direct the delivery of packet payload
by hardware.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
d0d564a1ca IB/hfi1: Add functions to receive TID RDMA READ request
This patch adds the functions to receive TID RDMA READ request. The TID
resource information will be stored and tracked. Duplicate request
will also be handled properly.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
6b6cf9357f IB/hfi1: Set PbcInsertHcrc for TID RDMA packets
All TID RDMA packets are in KDETH packet format and therefore the
PbcInsertHcrc must be set properly before sending the packet to
hardware. Otherwise, the packets will be dropped by the receiver.
By default, HCRC is not inserted for 9B packets without KDETH, and
this patch adds that back for TID RDMA packets.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
742a3826cf IB/hfi1: Add functions to build TID RDMA READ request
This patch adds the helper functions to build the TID RDMA READ request
on the requester side. The key is to allocate TID resources (TID flow
and TID entries) and send the resource information to the responder side
along with the read request. Since the TID resources are limited, each
TID RDMA READ request has to be split into segments with a default
segment size of 256K. A software flow is allocated to track the data
transaction for each segment. The work request opcode, packet opcode, and
packet formats for TID RDMA READ protocol are also defined in this patch.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
84f4a40d46 IB/hfi1: Add static trace for flow and TID management functions
This patch adds the static trace for the flow and TID management
functions to help debugging in the filed.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
2f16a696a0 IB/hfi1: Add the counter n_tidwait
This patch adds the counter n_tidwait to count the number of times the
TID resource allocator has to wait for TID resources.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
838b6fd2d9 IB/hfi1: TID RDMA RcvArray programming and TID allocation
TID entries are used by hfi1 hardware to receive data payload from
incoming packets directly into a user buffer and thus avoid data copying
by software. This patch implements the functions for TID allocation,
freeing, and programming TID RcvArray entries in hardware for kernel
clients. TID entries are managed via lists of TID groups similar to PSM.
Furthermore, to track TID resource allocation for each request, software
flows are also allocated and freed as needed. Since software flows
consume large amount of memory for tracking TID allocation and freeing,
it is generally desirable to allocate them dynamically in the send queue
and only for TID RDMA requests, but pre-allocate them for receive queue
because the send queue could have thousands of entries while the receive
queue has only a limited number of entries.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:55 -05:00
Kaike Wan
37356e7832 IB/hfi1: TID RDMA flow allocation
The hfi1 hardware flow is a hardware flow-control mechanism for a KDETH
data packet that is received on a hfi1 port. It validates the packet by
checking both the generation and sequence. Each QP that uses the TID RDMA
mechanism will allocate a hardware flow from its receiving context for
any incoming KDETH data packets.

This patch implements:
(1) a function to allocate hardware flow
(2) a function to free hardware flow
(3) a function to initialize hardware flow generation for a receiving
    context
(4) a wait mechanism if the hardware flow is not available
(4) a function to remove the qp from the wait queue for hardware flow
    when the qp is reset or destroyed.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:53:54 -05:00
Kaike Wan
385156c5f2 IB/hfi: Move RC functions into a header file
This patch moves some RC helper functions into a header file so that
they can be called from both RC and  TID RDMA functions. In addition,
a common function for rewinding a request is created in rdmavt so that
it can be shared between qib and hfi1 driver.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05 17:51:09 -05:00
Bart Van Assche
bf3b4f066d IB/mlx5: Do not use hw_access_flags for be and CPU data
Avoid that sparse reports the following for the mlx5 driver:

drivers/infiniband/hw/mlx5/qp.c:2671:34: warning: invalid assignment: |=
drivers/infiniband/hw/mlx5/qp.c:2671:34:    left side has type restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2671:34:    right side has type int
drivers/infiniband/hw/mlx5/qp.c:2679:34: warning: invalid assignment: |=
drivers/infiniband/hw/mlx5/qp.c:2679:34:    left side has type restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2679:34:    right side has type int
drivers/infiniband/hw/mlx5/qp.c:2680:34: warning: invalid assignment: |=
drivers/infiniband/hw/mlx5/qp.c:2680:34:    left side has type restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2680:34:    right side has type int
drivers/infiniband/hw/mlx5/qp.c:2684:34: warning: invalid assignment: |=
drivers/infiniband/hw/mlx5/qp.c:2684:34:    left side has type restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2684:34:    right side has type int
drivers/infiniband/hw/mlx5/qp.c:2686:28: warning: cast from restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2686:28: warning: incorrect type in argument 1 (different base types)
drivers/infiniband/hw/mlx5/qp.c:2686:28:    expected unsigned int [usertype] val
drivers/infiniband/hw/mlx5/qp.c:2686:28:    got restricted __be32 [usertype]
drivers/infiniband/hw/mlx5/qp.c:2686:28: warning: cast from restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2686:28: warning: cast from restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2686:28: warning: cast from restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2686:28: warning: cast from restricted __be32

This patch does not change any functionality.

Fixes: a60109dc9a ("IB/mlx5: Add support for extended atomic operations")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-05 15:22:45 -07:00
Steve Wise
95b8e384d8 iw_cxgb*: kzalloc the iwcm verbs struct
So future additions to that struct get initialized by default.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 16:26:02 -07:00
Wei Hu (Xavier)
d3743fa94c RDMA/hns: Fix the chip hanging caused by sending doorbell during reset
On hi08 chip, There is a possibility of chip hanging when sending doorbell
during reset. We can fix it by prohibiting doorbell during reset.

Fixes: 2d40788825 ("RDMA/hns: Add support for processing send wr and receive wr")
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 16:13:50 -07:00
Wei Hu (Xavier)
6a04aed6af RDMA/hns: Fix the chip hanging caused by sending mailbox&CMQ during reset
On hi08 chip, There is a possibility of chip hanging and some errors when
sending mailbox & doorbell during reset.  We can fix it by prohibiting
mailbox and doorbell during reset and reset occurred to ensure that
hardware can work normally.

Fixes: a04ff739f2 ("RDMA/hns: Add command queue support for hip08 RoCE driver")
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 16:13:50 -07:00
Wei Hu (Xavier)
d061effc36 RDMA/hns: Fix the Oops during rmmod or insmod ko when reset occurs
In the reset process, the hns3 NIC driver notifies the RoCE driver to
perform reset related processing by calling the .reset_notify() interface
registered by the RoCE driver in hip08 SoC.

In the current version, if a reset occurs simultaneously during the
execution of rmmod or insmod ko, there may be Oops error as below:

 Internal error: Oops: 86000007 [#1] PREEMPT SMP
 Modules linked in: hns_roce(O) hns3(O) hclge(O) hnae3(O) [last unloaded: hns_roce_hw_v2]
 CPU: 0 PID: 14 Comm: kworker/0:1 Tainted: G           O      4.19.0-ge00d540 #1
 Hardware name: Huawei Technologies Co., Ltd.
 Workqueue: events hclge_reset_service_task [hclge]
 pstate: 60c00009 (nZCv daif +PAN +UAO)
 pc : 0xffff00000100b0b8
 lr : 0xffff00000100aea0
 sp : ffff000009afbab0
 x29: ffff000009afbab0 x28: 0000000000000800
 x27: 0000000000007ff0 x26: ffff80002f90c004
 x25: 00000000000007ff x24: ffff000008f97000
 x23: ffff80003efee0a8 x22: 0000000000001000
 x21: ffff80002f917ff0 x20: ffff8000286ea070
 x19: 0000000000000800 x18: 0000000000000400
 x17: 00000000c4d3225d x16: 00000000000021b8
 x15: 0000000000000400 x14: 0000000000000400
 x13: 0000000000000000 x12: ffff80003fac6e30
 x11: 0000800036303000 x10: 0000000000000001
 x9 : 0000000000000000 x8 : ffff80003016d000
 x7 : 0000000000000000 x6 : 000000000000003f
 x5 : 0000000000000040 x4 : 0000000000000000
 x3 : 0000000000000004 x2 : 00000000000007ff
 x1 : 0000000000000000 x0 : 0000000000000000
 Process kworker/0:1 (pid: 14, stack limit = 0x00000000af8f0ad9)
 Call trace:
  0xffff00000100b0b8
  0xffff00000100b3a0
  hns_roce_init+0x624/0xc88 [hns_roce]
  0xffff000001002df8
  0xffff000001006960
  hclge_notify_roce_client+0x74/0xe0 [hclge]
  hclge_reset_service_task+0xa58/0xbc0 [hclge]
  process_one_work+0x1e4/0x458
  worker_thread+0x40/0x450
  kthread+0x12c/0x130
  ret_from_fork+0x10/0x18
 Code: bad PC value

In the reset process, we will release the resources firstly, and after the
hardware reset is completed, we will reapply resources and reconfigure the
hardware.

We can solve this problem by modifying both the NIC and the RoCE
driver. We can modify the concurrent processing in the NIC driver to avoid
calling the .reset_notify and .uninit_instance ops at the same time. And
we need to modify the RoCE driver to record the reset stage and the
driver's init/uninit state, and check the state in the .reset_notify,
.init_instance. and uninit_instance functions to avoid NULL pointer
operation.

Fixes: cb7a94c9c8 ("RDMA/hns: Add reset process for RoCE in hip08")
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 16:13:50 -07:00
YueHaibing
c3c668e742 RDMA/hns: Make some function static
Fixes the following sparse warnings:

drivers/infiniband/hw/hns/hns_roce_hw_v2.c:5822:5: warning:
 symbol 'hns_roce_v2_query_srq' was not declared. Should it be static?
drivers/infiniband/hw/hns/hns_roce_srq.c:158:6: warning:
 symbol 'hns_roce_srq_free' was not declared. Should it be static?
drivers/infiniband/hw/hns/hns_roce_srq.c:81:5: warning:
 symbol 'hns_roce_srq_alloc' was not declared. Should it be static?

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 15:33:58 -07:00
Jason Gunthorpe
6a8a2aa62d Linux 5.0-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxXYaEeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGkSQH/2yrfnviNPFYpZOR
 QQdc71Bfhkd8m85SmWIsSebkxmi3hKFVj15sGbWXd6+0/VxjEEGvQCZpvVwJceke
 LwDxtkKGg/74wAqJvlSAWxFNZ+Had4jDeoSoeQChddsBVXBBCxQx2v6ECg3o2x7W
 k8Z8t4+3RijDf8fYXY9ETyO2zW8R/wgT+dnl+DPgUH7u4dxh7FzAUfc4bgZIDg+i
 FzBQfbTJuz4BU7uRZ9IJiwhWKv0Iyi2DR3BY8Z1pqEpRaUMJMrCs2WGytHbTgt9e
 0EtO1airbVneU4eumU/ZaF9cyEbah9HousEPnP7J09WG4s/Odxc4zE+uK1QqS2im
 5Xv88is=
 =dVd1
 -----END PGP SIGNATURE-----

Merge tag 'v5.0-rc5' into rdma.git for-next

Linux 5.0-rc5

Needed to merge the include/uapi changes so we have an up to date
single-tree for these files. Patches already posted are also expected to
need this for dependencies.
2019-02-04 14:53:42 -07:00
Moni Shoua
6141f8fa5b IB/mlx5: Advertise XRC ODP support
Query all per transport caps for XRC and set the appropriate bits in the
per transport field of the advertised struct.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 14:34:07 -07:00
Moni Shoua
2e68daceac IB/mlx5: Advertise SRQ ODP support for supported transports
ODP support in SRQ is per transport capability. Based on device
capabilities set this flag in device structure for future queries.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 14:34:07 -07:00
Moni Shoua
08100fad5c IB/mlx5: Add ODP SRQ support
Add changes to the WQE page-fault handler to

1. Identify that the event is for a SRQ WQE
2. Pass SRQ object instead of a QP to the function that reads the WQE
3. Parse the SRQ WQE with respect to its structure

The rest is handled as for regular RQ WQE.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 14:34:07 -07:00
Moni Shoua
fbeb4075c6 IB/mlx5: Let read user wqe also from SRQ buffer
Reading a WQE from SRQ is almost identical to reading from regular RQ.
The differences are the size of the queue, the size of a WQE and buffer
location.

Make necessary changes to mlx5_ib_read_user_wqe() to let it read a WQE
from a SRQ or RQ by caller choice.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 14:34:07 -07:00
Moni Shoua
29917f4750 IB/mlx5: Add XRC initiator ODP support
Skip XRC segment in the beginning of a send WQE and fetch ODP XRC
capabilities when QP type is IB_QPT_XRC_INI. The rest of the handling is
the same as in RC QP.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 14:34:07 -07:00
Moni Shoua
6ff7414a17 IB/mlx5: Clean mlx5_ib_mr_responder_pfault_handler() signature
In the function mlx5_ib_mr_responder_pfault_handler()

1. The parameter wqe is used as read-only so there is no need to pass it
   by reference.
2. Remove the unused argument pfault from list of arguments.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 14:34:06 -07:00
Moni Shoua
586f4e95c7 IB/mlx5: Remove useless check in ODP handler
When handling an ODP event for a receive WQE in SRQ the target QP is
unknown. Therefore, it is wrong to ask if QP has a SRQ in the page-fault
handler.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 14:34:06 -07:00
Moni Shoua
10f56242e3 IB/mlx5: Fix the locking of SRQ objects in ODP events
QP and SRQ objects are stored in different containers so the action to get
and lock a common resource during ODP event needs to address that.

While here get rid of 'refcount' and 'free' fields in mlx5_core_srq struct
and use the fields with same semantics in common structure.

Fixes: 032080ab43 ("IB/mlx5: Lock QP during page fault handling")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04 14:34:06 -07:00
YueHaibing
da91ddfdc7 RDMA/hns: Remove set but not used variable 'rst'
Fixes gcc '-Wunused-but-set-variable' warning:

drivers/infiniband/hw/hns/hns_roce_hw_v2.c: In function 'hns_roce_v2_qp_flow_control_init':
drivers/infiniband/hw/hns/hns_roce_hw_v2.c:4384:33: warning:
 variable 'rst' set but not used [-Wunused-but-set-variable]

It never used since introduction.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-31 15:41:07 -07:00
Kaike Wan
a131d16460 IB/hfi1: Add static trace for OPFN
This patch adds the static trace to the OPFN code and moves tid related
static trace code into a new header file.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-01-31 11:37:40 -05:00
Kaike Wan
48a615dc00 IB/hfi1: Integrate OPFN into RC transactions
OPFN parameter negotiation allows a pair of connected RC QPs to exchange
a set of parameters in succession. This negotiation does not commence
till the first ULP request. Because OPFN operations are operations
private to the driver, they do not generate user completions or put the
QP into error when they run out of retries. This patch integrates the
OPFN protocol into the transactions of an RC QP.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-01-31 11:37:34 -05:00
Kaike Wan
ddf922c31f IB/hfi1, IB/rdmavt: Allow for extending of QP's s_ack_queue
The OPFN protocol uses the COMPARE_SWAP request to exchange data
between the requester and the responder and therefore needs to
be stored in the QP's s_ack_queue when the request is received
on the responder side. However, because the user does not know
anything about the OPFN protocol, this extra entry in the
queue cannot be advertised to the user. This patch adds an extra
entry in a QP's s_ack_queue.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-01-31 11:36:05 -05:00
Kaike Wan
f01b4d5a43 IB/hfi1: OPFN interface
OPFN allows a pair of connected RC QPs to exchange a set of parameters
in succession. The parameter exchange itself is done using the IB compare
and swap request with a special virtual address. The request is triggered
using a reserved IB work request opcode. This patch implements the OPFN
interface to initialize, start, process, and reset the OPFN request.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-01-31 11:36:05 -05:00
Kaike Wan
d22a207d74 IB/hfi1: Add OPFN helper functions for TID RDMA feature
This patch adds the OPFN helper functions to initialize, encode, decode,
and reset OPFN parameters for the TID RDMA feature.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-01-31 11:36:05 -05:00
Mitko Haralanov
44e43d91ad IB/hfi1: OPFN support discovery
OPFN (Omni Path Feature Negotiation) support discovery allows a RC QP to
announce that it supports OPFN and also discover if OPFN is supported by
the peer QP. OPFN parameter negotiation is skipped unless OPFN support is
first discovered. OPFN support is announced by claiming what was
the reserved bit in dword 1 of OmniPath modified base transport header
in requests and responses.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-01-31 11:36:04 -05:00
Leon Romanovsky
02da375097 RDMA/core: Use the ops infrastructure to keep all callbacks in one place
As preparation to hide rdma_restrack_root, refactor the code to use the
ops structure instead of a special callback which is hidden in
rdma_restrack_root.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-30 21:34:21 -07:00
Parav Pandit
cf34e1fe52 IB/mlx5: Consider vlan of lower netdev for macvlan GID entries
When a given netdev of the GID entry is macvlan netdevice, and if the
lower netdevice is vlan device, GID entry for macvlan based IP address
needs to inherit the vlan of the lower netdevice.

Therefore, attempt to find out if the lower device exist and if so, if
it is vlan device and setup the vlan tag correctly.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Yuval Avnery <yuvalav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-30 20:43:46 -07:00
Gal Pressman
cfc30ad3d0 IB/usnic: Remove stub functions
Lack of mandatory verbs no longer fail device registration, the device
will be marked as a non-kverbs provider.

Signed-off-by: Gal Pressman <galpress@amazon.com>
Tested-by: Parvi Kaustubhi <pkaustub@cisco.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-30 20:32:25 -07:00
Leon Romanovsky
459cc69fa4 RDMA: Provide safe ib_alloc_device() function
All callers to ib_alloc_device() provide a larger size than struct
ib_device and rely on the fact that struct ib_device is embedded in their
driver specific structure as the first member.

Provide a safer variant of ib_alloc_device() that checks and enforces this
approach to make sure the drivers are using it right.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-30 15:52:30 -07:00
Kamal Heib
e5c1bb47cc IB/mlx5: Remove set but not used variable
Remove 'del_mkey' variable that is set but not used.

Fixes: 534fd7aac5 ("IB/mlx5: Manage indirection mkey upon DEVX flow for ODP")
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-30 15:17:17 -07:00
Kamal Heib
f3ffed0ce4 IB/mlx5: Make mlx5_ib_stage_odp_cleanup() static
The function mlx5_ib_stage_odp_cleanup() is only used in main.c

Fixes: d5d284b829 ("{net,IB}/mlx5: Move Page fault EQ and ODP logic to RDMA")
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-30 15:17:17 -07:00
Michael J. Ruhl
db421a5499 IB/{hfi1, qib, rvt} Cleanup open coded sge usage
Several locations for manipulating sges use an open coded sequence
that is covered by helper functions.

Use the appropriate helper functions.

Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-01-30 14:22:32 -05:00
Michael J. Ruhl
87fc34b575 IB/{hfi1,qib}: Cleanup open coded sge sizing
Sge sizing is done in several places using an open coded method.

This can cause maintenance issues.  The open coded method is
encapsulated in a helper routine.  The helper was introduced with
commit:

1198fcea8a ("IB/hfi1, rdmavt: Move SGE state helper routines into
rdmavt")

Update all call sites that have the open coded path with the helper
routine.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-01-30 14:22:32 -05:00
Adit Ranadive
8aa04ad3b3 RDMA/vmw_pvrdma: Support upto 64-bit PFNs
Update the driver to use the new device capability to report 64-bit UAR
PFNs.

Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
Reviewed-by: Vishnu Dasa <vdasa@vmware.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-29 21:40:28 -07:00
Jason Gunthorpe
55c293c38e Merge branch 'devx-async' into k.o/for-next
Yishai Hadas says:

Enable DEVX asynchronous query commands

This series enables querying a DEVX object in an asynchronous mode.

The userspace application won't block when calling the firmware and it will be
able to get the response back once that it will be ready.

To enable the above functionality:

- DEVX asynchronous command completion FD object was introduced.
- The applicable file operations were implemented to enable using it by
  the user application.
- Query asynchronous method was added to the DEVX object, it will call the
  firmware asynchronously and manages the response on the given input FD.
- Hot unplug support was added for the FD to work properly upon
  unbind/disassociate.
- mlx5 core fence for asynchronous commands was implemented and used to
  prevent racing upon unbind/disassociate.

This branch is based on mlx5-next & v5.0-rc2 due to dependencies, from
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

* branch 'devx-async':
  IB/mlx5: Implement DEVX hot unplug for async command FD
  IB/mlx5: Implement the file ops of DEVX async command FD
  IB/mlx5: Introduce async DEVX obj query API
  IB/mlx5: Introduce MLX5_IB_OBJECT_DEVX_ASYNC_CMD_FD

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-29 13:49:31 -07:00
Yishai Hadas
eaebaf77e7 IB/mlx5: Implement DEVX hot unplug for async command FD
Implement DEVX hot unplug for the async command FD.

This is done by managing a list of the inflight commands and wait until
all launched work is completed as part of
devx_hot_unplug_async_cmd_event_file.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-29 13:33:00 -07:00
Yishai Hadas
4accbb3fd2 IB/mlx5: Implement the file ops of DEVX async command FD
Implement the file ops of the DEVX async command FD, this enables using
the FD for reading the events and manage other options on the FD.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-29 13:33:00 -07:00
Yishai Hadas
a124edba26 IB/mlx5: Introduce async DEVX obj query API
Introduce async DEVX obj query API to get the command response back to
user space once it's ready without blocking when calling the firmware.

The event's data includes a header with some meta data then the firmware
output command data.

The header includes:
- The input 'wr_id' to let application recognizing the response.

The input FD attribute is used to have the event data ready on.
Downstream patches from this series will implement the file ops to let
application read it.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-29 13:33:00 -07:00
Yishai Hadas
6bf8f22aea IB/mlx5: Introduce MLX5_IB_OBJECT_DEVX_ASYNC_CMD_FD
Introduce MLX5_IB_OBJECT_DEVX_ASYNC_CMD_FD and its initial implementation.

This object is from type class FD and will be used to read DEVX async
commands completion.

The core layer should allow the driver to set object from type FD in a
safe mode, this option was added with a matching comment in place.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-29 13:32:43 -07:00
Masahiro Yamada
b360ce3b2b infiniband: prefix header search paths with $(srctree)/
Currently, the Kbuild core manipulates header search paths in a crazy
way [1].

To fix this mess, I want all Makefiles to add explicit $(srctree)/ to
the search paths in the srctree. Some Makefiles are already written in
that way, but not all. The goal of this work is to make the notation
consistent, and finally get rid of the gross hacks.

Having whitespaces after -I does not matter since commit 48f6e3cf5b
("kbuild: do not drop -I without parameter").

[1]: https://patchwork.kernel.org/patch/9632347/

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Parvi Kaustubhi <pkaustub@cisco.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-25 15:28:50 -07:00
Mark Bloch
c1b03c25f5 RDMA/mlx5: Fix flow creation on representors
The intention of the flow_is_supported was to disable the entire tree and
methods that allow raw flow creation, but the grammar syntax has this
disable the entire UVERBS_FLOW object. Since the method requires a
MLX5_IB_OBJECT_FLOW_MATCHER there is no need to do anything, as it is
automatically disabled when matchers are disabled.

This restores the ability to create flow steering rules on representors
via regular verbs.

Fixes: a1462351b5 ("RDMA/mlx5: Fail early if user tries to create flows on IB representors")
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-25 11:58:06 -07:00
Lijun Ou
9d9d4ff788 RDMA/hns: Update the kernel header file of hns
The hns_roce_ib_create_srq_resp is used to interact with the user for
data, this was open coded to use a u32 directly, instead use a properly
sized structure.

Fixes: c7bcb13442 ("RDMA/hns: Add SRQ support for hip08 kernel mode")
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-25 09:55:48 -07:00
Ira Weiny
ff0244bb59 RDMA/qib: Use GUP longterm for PSM page pining
Similar to the core change commit 5f1d43de54 ("IB/core: disable memory
registration of filesystem-dax vmas")

PSM should be prevented from using filesystem DAX pages.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-24 09:22:30 -07:00
Yangyang Li
0e40dc2f70 RDMA/hns: Add timer allocation support for hip08
This patch adds qpc timer and cqc timer allocation support for hardware
timeout retransmission in kernel space driver.

Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-24 09:22:30 -07:00
Yangyang Li
aa84fa1874 RDMA/hns: Add SCC context clr support for hip08
This patch adds SCC context clear support for DCQCN in kernel space
driver.

Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-24 09:22:30 -07:00
Yangyang Li
6a157f7d1b RDMA/hns: Add SCC context allocation support for hip08
This patch adds SCC context allocation and initialization support for
DCQCN in kernel space driver.

Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-24 09:22:30 -07:00