Commit Graph

2744 Commits

Author SHA1 Message Date
Avihai Horon
3ff4de8f60 RDMA/core: Change rdma_get_gid_attr returned error code
Change the error code returned from rdma_get_gid_attr when the GID entry
is invalid but the GID index is in the gid table size range to -ENODATA
instead of -EINVAL.

This change is done in order to provide a more accurate error reporting to
be used by the new GID query API in user space. Nevertheless, -EINVAL is
still returned from sysfs in the aforementioned case to maintain
compatibility with user space that expects -EINVAL.

Link: https://lore.kernel.org/r/20200923165015.2491894-2-leon@kernel.org
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-01 21:20:11 -03:00
Rikard Falkeborn
42d5179c89 RDMA/core: Constify struct attribute_group
The only usage of the pma_table field in the ib_port struct is to pass its
address to sysfs_create_group() and sysfs_remove_group(). Make it const to
make it possible to constify a couple of static struct
attribute_group. This allows the compiler to put them in read-only memory.

Link: https://lore.kernel.org/r/20200930224004.24279-2-rikard.falkeborn@gmail.com
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-01 20:44:51 -03:00
Yishai Hadas
8bfafde086 IB/core: Enable ODP sync without faulting
Enable ODP sync without faulting, this improves performance by reducing
the number of page faults in the system.

The gain from this option is that the device page table can be aligned
with the presented pages in the CPU page table without causing page
faults.

As of that, the overhead on data path from hardware point of view to
trigger a fault which end-up by calling the driver to bring the pages
will be dropped.

Link: https://lore.kernel.org/r/20200930163828.1336747-3-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-01 16:44:05 -03:00
Yishai Hadas
36f30e486d IB/core: Improve ODP to use hmm_range_fault()
Move to use hmm_range_fault() instead of get_user_pags_remote() to improve
performance in a few aspects:

This includes:
- Dropping the need to allocate and free memory to hold its output

- No need any more to use put_page() to unpin the pages

- The logic to detect contiguous pages is done based on the returned
  order, no need to run per page and evaluate.

In addition, moving to use hmm_range_fault() enables to reduce page faults
in the system with it's snapshot mode, this will be introduced in next
patches from this series.

As part of this, cleanup some flows and use the required data structures
to work with hmm_range_fault().

Link: https://lore.kernel.org/r/20200930163828.1336747-2-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-01 16:39:54 -03:00
Jason Gunthorpe
2ee9bf346f RDMA/addr: Fix race with netevent_callback()/rdma_addr_cancel()
This three thread race can result in the work being run once the callback
becomes NULL:

       CPU1                 CPU2                   CPU3
 netevent_callback()
                     process_one_req()       rdma_addr_cancel()
                      [..]
     spin_lock_bh()
  	set_timeout()
     spin_unlock_bh()

						spin_lock_bh()
						list_del_init(&req->list);
						spin_unlock_bh()

		     req->callback = NULL
		     spin_lock_bh()
		       if (!list_empty(&req->list))
                         // Skipped!
		         // cancel_delayed_work(&req->work);
		     spin_unlock_bh()

		    process_one_req() // again
		     req->callback() // BOOM
						cancel_delayed_work_sync()

The solution is to always cancel the work once it is completed so any
in between set_timeout() does not result in it running again.

Cc: stable@vger.kernel.org
Fixes: 44e75052bc ("RDMA/rdma_cm: Make rdma_addr_cancel into a fence")
Link: https://lore.kernel.org/r/20200930072007.1009692-1-leon@kernel.org
Reported-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-30 15:29:05 -03:00
Jason Gunthorpe
a6f0b08dba RDMA/core: Remove ucontext->closing
Nothing reads this any more, and the reason for its existence has passed
due to the deferred fput() scheme.

Fixes: 8ea1f989aa ("drivers/IB,usnic: reduce scope of mmap_sem")
Link: https://lore.kernel.org/r/0-v1-df64ff042436+42-uctx_closing_jgg@nvidia.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-30 15:27:19 -03:00
Leon Romanovsky
5807bb3205 RDMA/core: Align write and ioctl checks of QP types
The ioctl flow checks that the user provides only a supported list of QP
types, while write flow didn't do it and relied on the driver to check
it. Align those flows to fail as early as possible.

Link: https://lore.kernel.org/r/20200926102450.2966017-8-leon@kernel.org
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-29 13:11:06 -03:00
Leon Romanovsky
b09c4d7012 RDMA/restrack: Improve readability in task name management
Use rdma_restrack_set_name() and rdma_restrack_parent_name() instead of
tricky uses of rdma_restrack_attach_task()/rdma_restrack_uadd().

This uniformly makes all restracks add'd using rdma_restrack_add().

Link: https://lore.kernel.org/r/20200922091106.2152715-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-22 19:47:35 -03:00
Leon Romanovsky
c34a23c28c RDMA/restrack: Simplify restrack tracking in kernel flows
Have a single rdma_restrack_add() that adds an entry, there is no reason
to split the user/kernel here, the rdma_restrack_set_task() is responsible
for this difference.

This patch prepares the code to the future requirement of making restrack
is mandatory for managing ib objects.

Link: https://lore.kernel.org/r/20200922091106.2152715-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-22 19:47:35 -03:00
Leon Romanovsky
13ef5539de RDMA/restrack: Count references to the verbs objects
Refactor the restrack code to make sure the kref inside the restrack entry
properly kref's the object in which it is embedded. This slight change is
needed for future conversions of MR and QP which are refcounted before the
release and kfree.

The ideal flow from ib_core perspective as follows:
* Allocate ib_* structure with rdma_zalloc_*.
* Set everything that is known to ib_core to that newly created object.
* Initialize kref with restrack help
* Call to driver specific allocation functions.
* Insert into restrack DB
....
* Return and release restrack with restrack_put.

Largely this means a rdma_restrack_new() should be called near allocating
the containing structure.

Link: https://lore.kernel.org/r/20200922091106.2152715-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-22 19:47:35 -03:00
Leon Romanovsky
60aaeffa36 RDMA/cma: Delete from restrack DB after successful destroy
Update the code to have similar destroy pattern like other IB objects.

This change create asymmetry to the rdma_id_private create flow to make
sure that memory is managed by restrack.

Link: https://lore.kernel.org/r/20200922091106.2152715-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-22 19:47:34 -03:00
Jason Gunthorpe
f5449e7480 RDMA/ucma: Rework ucma_migrate_id() to avoid races with destroy
ucma_destroy_id() assumes that all things accessing the ctx will do so via
the xarray. This assumption violated only in the case the FD is being
closed, then the ctx is reached via the ctx_list. Normally this is OK
since ucma_destroy_id() cannot run concurrenty with release(), however
with ucma_migrate_id() is involved this can violated as the close of the
2nd FD can run concurrently with destroy on the first:

                CPU0                      CPU1
        ucma_destroy_id(fda)
                                  ucma_migrate_id(fda -> fdb)
                                       ucma_get_ctx()
        xa_lock()
         _ucma_find_context()
         xa_erase()
        xa_unlock()
                                       xa_lock()
                                        ctx->file = new_file
                                        list_move()
                                       xa_unlock()
                                      ucma_put_ctx()

                                   ucma_close(fdb)
                                      _destroy_id()
                                      kfree(ctx)

        _destroy_id()
          wait_for_completion()
          // boom, ctx was freed

The ctx->file must be modified under the handler and xa_lock, and prior to
modification the ID must be rechecked that it is still reachable from
cur_file, ie there is no parallel destroy or migrate.

To make this work remove the double locking and streamline the control
flow. The double locking was obsoleted by the handler lock now directly
preventing new uevents from being created, and the ctx_list cannot be read
while holding fgets on both files. Removing the double locking also
removes the need to check for the same file.

Fixes: 88314e4dda ("RDMA/cma: add support for rdma_migrate_id()")
Link: https://lore.kernel.org/r/0-v1-05c5a4090305+3a872-ucma_syz_migrate_jgg@nvidia.com
Reported-and-tested-by: syzbot+cc6fc752b3819e082d0c@syzkaller.appspotmail.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-18 20:54:01 -03:00
Jason Gunthorpe
5dee5872f8 Merge branch 'mlx5_active_speed' into rdma.git for-next
Leon Romanovsky says:

====================
IBTA declares speed as 16 bits, but kernel stores it in u8. This series
fixes in-kernel declaration while keeping external interface intact.
====================

Based on the mlx5-next branch at
     git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
due to dependencies.

* branch 'mlx5_active_speed':
  RDMA: Fix link active_speed size
  RDMA/mlx5: Delete duplicated mlx5_ptys_width enum
  net/mlx5: Refactor query port speed functions
2020-09-18 10:31:45 -03:00
Aharon Landau
376ceb31ff RDMA: Fix link active_speed size
According to the IB spec active_speed size should be u16 and not u8 as
before. Changing it to allow further extensions in offered speeds.

Link: https://lore.kernel.org/r/20200917090223.1018224-4-leon@kernel.org
Signed-off-by: Aharon Landau <aharonl@mellanox.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-18 10:31:24 -03:00
Leon Romanovsky
c0a6b5ecc5 RDMA: Convert RWQ table logic to ib_core allocation scheme
Move struct ib_rwq_ind_table allocation to ib_core.

Link: https://lore.kernel.org/r/20200902081623.746359-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 14:04:33 -03:00
Leon Romanovsky
d18bb3e152 RDMA: Clean MW allocation and free flows
Move allocation and destruction of memory windows under ib_core
responsibility and clean drivers to ensure that no updates to MW
ib_core structures are done in driver layer.

Link: https://lore.kernel.org/r/20200902081623.746359-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 14:04:32 -03:00
Jason Gunthorpe
b5de0c60cc RDMA/cma: Fix use after free race in roce multicast join
The roce path triggers a work queue that continues to touch the id_priv
but doesn't hold any reference on it. Futher, unlike in the IB case, the
work queue is not fenced during rdma_destroy_id().

This can trigger a use after free if a destroy is triggered in the
incredibly narrow window after the queue_work and the work starting and
obtaining the handler_mutex.

The only purpose of this work queue is to run the ULP event callback from
the standard context, so switch the design to use the existing
cma_work_handler() scheme. This simplifies quite a lot of the flow:

- Use the cma_work_handler() callback to launch the work for roce. This
  requires generating the event synchronously inside the
  rdma_join_multicast(), which in turn means the dummy struct
  ib_sa_multicast can become a simple stack variable.

- cm_work_handler() used the id_priv kref, so we can entirely eliminate
  the kref inside struct cma_multicast. Since the cma_multicast never
  leaks into an unprotected work queue the kfree can be done at the same
  time as for IB.

- Eliminating the general multicast.ib requires using cma_set_mgid() in a
  few places to recompute the mgid.

Fixes: 3c86aa70bf ("RDMA/cm: Add RDMA CM support for IBoE devices")
Link: https://lore.kernel.org/r/20200902081122.745412-9-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:25 -03:00
Jason Gunthorpe
3788d2997b RDMA/cma: Consolidate the destruction of a cma_multicast in one place
Two places were open coding this sequence, and also pull in
cma_leave_roce_mc_group() which was called only once.

Link: https://lore.kernel.org/r/20200902081122.745412-8-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:24 -03:00
Jason Gunthorpe
1bb5091def RDMA/cma: Remove dead code for kernel rdmacm multicast
There is no kernel user of RDMA CM multicast so this code managing the
multicast subscription of the kernel-only internal QP is dead. Remove it.

This makes the bug fixes in the next patches much simpler.

Link: https://lore.kernel.org/r/20200902081122.745412-7-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:24 -03:00
Jason Gunthorpe
7e85bcda8b RDMA/cma: Combine cma_ndev_work with cma_work
These are the same thing, except that cma_ndev_work doesn't have a state
transition. Signal no state transition by setting old_state and new_state
== 0.

In all cases the handler function should not be called once
rdma_destroy_id() has progressed passed setting the state.

Link: https://lore.kernel.org/r/20200902081122.745412-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:24 -03:00
Jason Gunthorpe
5cfbf9291e RDMA/cma: Remove cma_comp()
The only place that still uses it is rdma_join_multicast() which is only
doing a sanity check that the caller hasn't done something wrong and
doesn't need the spinlock.

At least in the case of rdma_join_multicast() the information it needs
will remain until the ID is destroyed once it enters these
states. Similarly there is no reason to check for these specific states in
the handler callback, instead use the usual check for a destroyed id under
the handler_mutex.

Link: https://lore.kernel.org/r/20200902081122.745412-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:24 -03:00
Jason Gunthorpe
d490ee52f0 RDMA/cma: Fix locking for the RDMA_CM_LISTEN state
There is a strange unlocked read of the ID state when checking for
reuseaddr. This is because an ID cannot be reusable once it becomes a
listening ID. Instead of using the state to exclude reuse, just clear it
as part of rdma_listen()'s flow to convert reusable into not reusable.

Once a ID goes to listen there is no way back out, and the only use of
reusable is on the bind_list check.

Finally, update the checks under handler_mutex to use READ_ONCE and audit
that once RDMA_CM_LISTEN is observed in a req callback it is stable under
the handler_mutex.

Link: https://lore.kernel.org/r/20200902081122.745412-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:24 -03:00
Jason Gunthorpe
732d41c545 RDMA/cma: Make the locking for automatic state transition more clear
Re-organize things so the state variable is not read unlocked. The first
attempt to go directly from ADDR_BOUND immediately tells us if the ID is
already bound, if we can't do that then the attempt inside
rdma_bind_addr() to go from IDLE to ADDR_BOUND confirms the ID needs
binding.

Link: https://lore.kernel.org/r/20200902081122.745412-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:23 -03:00
Jason Gunthorpe
2a7cec5381 RDMA/cma: Fix locking for the RDMA_CM_CONNECT state
It is currently a bit confusing, but the design is if the handler_mutex
is held, and the state is in RDMA_CM_CONNECT, then the state cannot leave
RDMA_CM_CONNECT without also serializing with the handler_mutex.

Make this clearer by adding a direct assertion, fixing the usage in
rdma_connect and generally using READ_ONCE to read the state value.

Link: https://lore.kernel.org/r/20200902081122.745412-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17 09:09:23 -03:00
Linus Torvalds
b1df2a0783 RDMA second 5.9-rc pull request
A number of driver bug fixes and a few recent regressions:
 
 - Several bug fixes for bnxt_re. Crashing, incorrect data reported,
   and corruption on new HW
 
 - Memory leak and crash in rxe
 
 - Fix sysfs corruption in rxe if the netdev name is too long
 
 - Fix a crash on error unwind in the new cq_pool code
 
 - Fix kobject panics in rtrs by working device lifetime properly
 
 - Fix a data corruption bug in iser target related to misaligned buffers
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl9atAYACgkQOG33FX4g
 mxqSdw//Qi29dnxzVGpsaO4/krd/VmI6NT6eNpgK7Nqx80DaCYer0JhtwZOUxHqK
 KbHIV9XB/f6BSI67c9ydYj4PNX6FpFnoUWQLvqZwip5VM7R6ifIVjm0ap1jCAUSS
 axDLFZOySIOYNhcZ5I+MtY/kxykKBjMteuMXdpBe4FwZ+XSmsC5KkfRH/+FUhjVG
 peL6aRVDv9TByH8w+iZE1wSmVrOphOE1C/jN5TyotQTmKe7IHoXJtkalosYHXFWw
 KZaaz52e4IYKVFl4HIcl6+FfPExhxsyfDtRHluvn+vzY/wFy1RZw6F0BZt7mioy5
 J8R6w82xEe/SNugTGuvIzqXOymmy9H4CrG9pHy4NRMMzC28LGI7qHJgVhr/jZy8+
 GPxR26cywDhPsd4XA2K3mvs7DVSoBUPYlIUnHdYjfBZl/ghColg9+XGyNv6pdrke
 Q7Kog5blcpOAahBX+ElBLvIZXk5oEk5W+3H/M0OeuVMQ/DrMtALrCnwpp4wDKVvO
 9QuYfGgQ+25xbV9kwzckLGo5eedN3cRD/v4hcqvQUZo+9zLYZ/HZRMjpOdrscQ+I
 QL4FgpcLpOASKZ+bYjjpFxK3rNVTDT9CYJw4/hxEaOhxRhtAO1Q9mJdvJTK6dj09
 oR9LPyefQkyKCAt+heWHKKkEYDiwT8U1SlR8STotg24VHIj6Rb4=
 =2DDd
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
 "A number of driver bug fixes and a few recent regressions:

   - Several bug fixes for bnxt_re. Crashing, incorrect data reported,
     and corruption on new HW

   - Memory leak and crash in rxe

   - Fix sysfs corruption in rxe if the netdev name is too long

   - Fix a crash on error unwind in the new cq_pool code

   - Fix kobject panics in rtrs by working device lifetime properly

   - Fix a data corruption bug in iser target related to misaligned
     buffers"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  IB/isert: Fix unaligned immediate-data handling
  RDMA/rtrs-srv: Set .release function for rtrs srv device during device init
  RDMA/bnxt_re: Remove set but not used variable 'qplib_ctx'
  RDMA/core: Fix reported speed and width
  RDMA/core: Fix unsafe linked list traversal after failing to allocate CQ
  RDMA/bnxt_re: Remove the qp from list only if the qp destroy succeeds
  RDMA/bnxt_re: Fix driver crash on unaligned PSN entry address
  RDMA/bnxt_re: Restrict the max_gids to 256
  RDMA/bnxt_re: Static NQ depth allocation
  RDMA/bnxt_re: Fix the qp table indexing
  RDMA/bnxt_re: Do not report transparent vlan from QP1
  RDMA/mlx4: Read pkey table length instead of hardcoded value
  RDMA/rxe: Fix panic when calling kmem_cache_create()
  RDMA/rxe: Fix memleak in rxe_mem_init_user
  RDMA/rxe: Fix the parent sysfs read when the interface has 15 chars
  RDMA/rtrs-srv: Replace device_register with device_initialize and device_add
2020-09-11 10:02:36 -07:00
Jason Gunthorpe
81655d3c4a RDMA/mlx4: Use ib_umem_num_dma_blocks()
For the calls linked to mlx4_ib_umem_calc_optimal_mtt_size() use
ib_umem_num_dma_blocks() inside the function, it is just some weird static
default.

All other places are just using it with PAGE_SIZE, switch to
ib_umem_num_dma_blocks().

As this is the last call site, remove ib_umem_num_count().

Link: https://lore.kernel.org/r/15-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-11 10:24:54 -03:00
Jason Gunthorpe
a665aca89a RDMA/umem: Split ib_umem_num_pages() into ib_umem_num_dma_blocks()
ib_umem_num_pages() should only be used by things working with the SGL in
CPU pages directly.

Drivers building DMA lists should use the new ib_num_dma_blocks() which
returns the number of blocks rdma_umem_for_each_block() will return.

To make this general for DMA drivers requires a different implementation.
Computing DMA block count based on umem->address only works if the
requested page size is < PAGE_SIZE and/or the IOVA == umem->address.

Instead the number of DMA pages should be computed in the IOVA address
space, not umem->address. Thus the IOVA has to be stored inside the umem
so it can be used for these calculations.

For now set it to umem->address by default and fix it up if
ib_umem_find_best_pgsz() was called. This allows drivers to be converted
to ib_umem_num_dma_blocks() safely.

Link: https://lore.kernel.org/r/6-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-11 10:24:53 -03:00
Jason Gunthorpe
3361c29e92 RDMA/umem: Use simpler logic for ib_umem_find_best_pgsz()
The calculation in rdma_find_pg_bit() is fairly complicated, and the
function is never called anywhere else. Inline a simpler version into
ib_umem_find_best_pgsz()

Link: https://lore.kernel.org/r/3-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-09 15:33:17 -03:00
Jason Gunthorpe
10c75ccb54 RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()
rdma_for_each_block() makes assumptions about how the SGL is constructed
that don't work if the block size is below the page size used to to build
the SGL.

The rules for umem SGL construction require that the SG's all be PAGE_SIZE
aligned and we don't encode the actual byte offset of the VA range inside
the SGL using offset and length. So rdma_for_each_block() has no idea
where the actual starting/ending point is to compute the first/last block
boundary if the starting address should be within a SGL.

Fixing the SGL construction turns out to be really hard, and will be the
subject of other patches. For now block smaller pages.

Fixes: 4a35339958 ("RDMA/umem: Add API to find best driver supported page size in an MR")
Link: https://lore.kernel.org/r/2-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-09 15:33:17 -03:00
Jason Gunthorpe
a40c20dabd RDMA/umem: Fix ib_umem_find_best_pgsz() for mappings that cross a page boundary
It is possible for a single SGL to span an aligned boundary, eg if the SGL
is

  61440 -> 90112

Then the length is 28672, which currently limits the block size to
32k. With a 32k page size the two covering blocks will be:

  32768->65536 and 65536->98304

However, the correct answer is a 128K block size which will span the whole
28672 bytes in a single block.

Instead of limiting based on length figure out which high IOVA bits don't
change between the start and end addresses. That is the highest useful
page size.

Fixes: 4a35339958 ("RDMA/umem: Add API to find best driver supported page size in an MR")
Link: https://lore.kernel.org/r/1-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-09 15:33:17 -03:00
Leon Romanovsky
71ff3f6268 RDMA: Make counters destroy symmetrical
Change counters to return failure like any other verbs destroy, however
this flow shouldn't return error at all.

Link: https://lore.kernel.org/r/20200907120921.476363-10-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-09 14:14:29 -03:00
Leon Romanovsky
add53535fb RDMA: Restore ability to return error for destroy WQ
Make this interface symmetrical to other destroy paths.

Fixes: a49b1dc7ae ("RDMA: Convert destroy_wq to be void")
Link: https://lore.kernel.org/r/20200907120921.476363-9-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-09 14:14:29 -03:00
Leon Romanovsky
d0c45c8556 RDMA: Change XRCD destroy return value
Update XRCD destroy flow to allow command failure.

Fixes: 28ad5f65c3 ("RDMA: Move XRCD to be under ib_core responsibility")
Link: https://lore.kernel.org/r/20200907120921.476363-8-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-09 14:14:29 -03:00
Leon Romanovsky
43d781b9fa RDMA: Allow fail of destroy CQ
Like any other verbs objects, CQ shouldn't fail during destroy, but
mlx5_ib didn't follow this contract with mixed IB verbs objects with
DEVX. Such mix causes to the situation where FW and kernel are fully
interdependent on the reference counting of each side.

Kernel verbs and drivers that don't have DEVX flows shouldn't fail.

Fixes: e39afe3d6d ("RDMA: Convert CQ allocations to be under core responsibility")
Link: https://lore.kernel.org/r/20200907120921.476363-7-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-09 14:14:29 -03:00
Leon Romanovsky
7e3c66c9a9 RDMA/core: Delete function indirection for alloc/free kernel CQ
The ib_alloc_cq*() and ib_free_cq*() are solely kernel verbs to manage CQs
and doesn't need extra indirection just to call same functions with
constant parameter NULL as udata.

Link: https://lore.kernel.org/r/20200907120921.476363-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-09 14:14:28 -03:00
Leon Romanovsky
119181d1d4 RDMA: Restore ability to fail on SRQ destroy
In similar way to other IB objects, restore the ability to return error on
SRQ destroy. Strictly speaking, this change is not necessary, and provided
here to ensure a symmetrical interface like other destroy functions.

Fixes: 68e326dea1 ("RDMA: Handle SRQ allocations by IB/core")
Link: https://lore.kernel.org/r/20200907120921.476363-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-09 14:14:24 -03:00
Leon Romanovsky
9a9ebf8cd7 RDMA: Restore ability to fail on AH destroy
Like any other IB verbs objects, AH are refcounted by ib_core. The release
of those objects are controlled by ib_core with promise that AH destroy
can't fail.

Being SW object for now, this change makes dealloc_ah() to behave like any
other destroy IB flows.

Fixes: d345691471 ("RDMA: Handle AH allocations by IB/core")
Link: https://lore.kernel.org/r/20200907120921.476363-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-09 13:57:22 -03:00
Leon Romanovsky
91a7c58fce RDMA: Restore ability to fail on PD deallocate
The IB verbs objects are counted by the kernel and ib_core ensures that
deallocate PD will success so it will be called once all other objects
that depends on PD will be released. This is achieved by managing various
reference counters on such objects.

The mlx5 driver didn't follow this standard flow when allowed DEVX objects
that are not managed by ib_core to be interleaved with the ones under
ib_core responsibility.

In such interleaved scenarios deallocate command can fail and ib_core will
leave uobject in internal DB and attempt to clean it later to free
resources anyway.

This change partially restores returned value from dealloc_pd() for all
drivers, but keeping in mind that non-DEVX devices and kernel verbs paths
shouldn't fail.

Fixes: 21a428a019 ("RDMA: Handle PD allocations by IB/core")
Link: https://lore.kernel.org/r/20200907120921.476363-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-09 13:57:22 -03:00
Jason Gunthorpe
f553246f7f RDMA/core: Change how failing destroy is handled during uobj abort
Currently it triggers a WARN_ON and then goes ahead and destroys the
uobject anyhow, leaking any driver memory.

The only place that leaks driver memory should be during FD close() in
uverbs_destroy_ufile_hw().

Drivers are only allowed to fail destroy uobjects if they guarantee
destroy will eventually succeed. uverbs_destroy_ufile_hw() provides the
loop to give the driver that chance.

Link: https://lore.kernel.org/r/20200902081708.746631-1-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-09 13:16:48 -03:00
Alex Dewar
4f680cb9f1 RDMA/ucma: Fix resource leak on error path
In ucma_process_join(), if the call to xa_alloc() fails, the function will
return without freeing mc. Fix this by jumping to the correct line.

In the process I renamed the jump labels to something more memorable for
extra clarity.

Link: https://lore.kernel.org/r/20200902162454.332828-1-alex.dewar90@gmail.com
Addresses-Coverity-ID: 1496814 ("Resource leak")
Fixes: 95fe51096b ("RDMA/ucma: Remove mc_list and rely on xarray")
Signed-off-by: Alex Dewar <alex.dewar90@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-02 17:06:48 -03:00
Kamal Heib
28b0865714 RDMA/core: Fix reported speed and width
When the returned speed from __ethtool_get_link_ksettings() is
SPEED_UNKNOWN this will lead to reporting a wrong speed and width for
providers that uses ib_get_eth_speed(), fix that by defaulting the
netdev_speed to SPEED_1000 in case the returned value from
__ethtool_get_link_ksettings() is SPEED_UNKNOWN.

Fixes: d41861942f ("IB/core: Add generic function to extract IB speed from netdev")
Link: https://lore.kernel.org/r/20200902124304.170912-1-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-02 16:58:54 -03:00
Xi Wang
8aa64be019 RDMA/core: Fix unsafe linked list traversal after failing to allocate CQ
It's not safe to access the next CQ in list_for_each_entry() after
invoking ib_free_cq(), because the CQ has already been freed in current
iteration.  It should be replaced by list_for_each_entry_safe().

Fixes: c7ff819aef ("RDMA/core: Introduce shared CQ pool API")
Link: https://lore.kernel.org/r/1598963935-32335-1-git-send-email-liweihang@huawei.com
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-02 15:56:40 -03:00
Jason Gunthorpe
6989aa62d3 Linux 5.9-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl9ML+IeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGA8EIAIy/kTbFS0yrE9yV
 hb98oX0z9+EU9YQg9vhaRWwPd+rJF/JMQZLqYcwbhjG9abaUL3T3fEcSAefMHw8E
 LAt+hYzA38dHt7tqhsFQX3vV1VorvDVICBVN0yRPRWKKikq4OPIHzaAR9tleGAF5
 8btQisl1PjN+obwYmLuNb6aX16OCwAF+uXOwehcoJs9dvMNhwtXRzfOflWzOvOo6
 tE0bHErlylLDfLv4ZzEfczTdks4QJZ7C0xLSf3oN9AAynW42Xnhct4hi8qZY/hCf
 CMaqeN4hdpub6TvQIqBdDqMMjEXGFgeNSnAEBQY9VpvUqz8NTu6sQxwgJEKDF5tg
 d81lv2c=
 =uW/F
 -----END PGP SIGNATURE-----

Merge tag 'v5.9-rc3' into rdma.git for-next

Required due to dependencies in following patches.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-08-31 12:28:12 -03:00
Jason Gunthorpe
5d985d724b RDMA/core: Trigger a WARN_ON if the driver causes uobjects to become leaked
Drivers that fail destroy can cause uverbs to leak uobjects. Drivers are
required to always eventually destroy their ubojects, so trigger a WARN_ON
to detect this driver bug.

Link: https://lore.kernel.org/r/0-v1-b1e0ed400ba9+f7-warn_destroy_ufile_hw_jgg@nvidia.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-08-31 12:10:11 -03:00
Jason Gunthorpe
657360d6c7 RDMA/ucma: Remove closing and the close_wq
Use cancel_work_sync() to ensure that the wq is not running and simply
assign NULL to ctx->cm_id to indicate if the work ran or not. Delete the
close_wq since flush_workqueue() is no longer needed.

Link: https://lore.kernel.org/r/20200818120526.702120-15-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-08-27 08:38:16 -03:00
Jason Gunthorpe
a1d33b70db RDMA/ucma: Rework how new connections are passed through event delivery
When a new connection is established the RDMA CM creates a new cm_id and
passes it through to the event handler. However inside the UCMA the new ID
is not assigned a ucma_context until the user retrieves the event from a
syscall.

This creates a weird edge condition where a cm_id's context can continue
to point at the listening_id that created it, and a number of additional
edge conditions on event list clean up related to destroying half created
IDs.

There is also a race condition in ucma_get_events() where the
cm_id->context is being assigned without holding the handler_mutex.

Simplify all of this by creating the ucma_context inside the event handler
itself and eliminating the edge case of a half created cm_id. All cm_id's
can be uniformly destroyed via __destroy_id() or via the close_work.

Link: https://lore.kernel.org/r/20200818120526.702120-14-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-08-27 08:38:16 -03:00
Jason Gunthorpe
310ca1a7dc RDMA/ucma: Narrow file->mut in ucma_event_handler()
Since the backlog is now an atomic the file->mut is now only protecting
the event_list and ctx_list. Narrow its scope to make it clear

Link: https://lore.kernel.org/r/20200818120526.702120-13-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-08-27 08:38:16 -03:00
Jason Gunthorpe
26c15dec49 RDMA/ucma: Change backlog into an atomic
There is no reason to grab the file->mut just to do this inc/dec work. Use
an atomic.

Link: https://lore.kernel.org/r/20200818120526.702120-12-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-08-27 08:38:16 -03:00
Jason Gunthorpe
38e03d0926 RDMA/ucma: Add missing locking around rdma_leave_multicast()
All entry points to the rdma_cm from a ULP must be single threaded,
even this error unwinds. Add the missing locking.

Fixes: 7c11910783 ("RDMA/ucma: Put a lock around every call to the rdma_cm layer")
Link: https://lore.kernel.org/r/20200818120526.702120-11-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-08-27 08:38:15 -03:00
Jason Gunthorpe
98837c6c3d RDMA/ucma: Fix locking for ctx->events_reported
This value is locked under the file->mut, ensure it is held whenever
touching it.

The case in ucma_migrate_id() is a race, while in ucma_free_uctx() it is
already not possible for the write side to run, the movement is just for
clarity.

Fixes: 88314e4dda ("RDMA/cma: add support for rdma_migrate_id()")
Link: https://lore.kernel.org/r/20200818120526.702120-10-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-08-27 08:38:15 -03:00