This three thread race can result in the work being run once the callback
becomes NULL:
CPU1 CPU2 CPU3
netevent_callback()
process_one_req() rdma_addr_cancel()
[..]
spin_lock_bh()
set_timeout()
spin_unlock_bh()
spin_lock_bh()
list_del_init(&req->list);
spin_unlock_bh()
req->callback = NULL
spin_lock_bh()
if (!list_empty(&req->list))
// Skipped!
// cancel_delayed_work(&req->work);
spin_unlock_bh()
process_one_req() // again
req->callback() // BOOM
cancel_delayed_work_sync()
The solution is to always cancel the work once it is completed so any
in between set_timeout() does not result in it running again.
Cc: stable@vger.kernel.org
Fixes: 44e75052bc ("RDMA/rdma_cm: Make rdma_addr_cancel into a fence")
Link: https://lore.kernel.org/r/20200930072007.1009692-1-leon@kernel.org
Reported-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Functions related to nested interface infrastructure such as
netdev_walk_all_{ upper | lower }_dev() pass both private functions
and "data" pointer to handle their own things.
At this point, the data pointer type is void *.
In order to make it easier to expand common variables and functions,
this new netdev_nested_priv structure is added.
In the following patch, a new member variable will be added into this
struct to fix the lockdep issue.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Have a single rdma_restrack_add() that adds an entry, there is no reason
to split the user/kernel here, the rdma_restrack_set_task() is responsible
for this difference.
This patch prepares the code to the future requirement of making restrack
is mandatory for managing ib objects.
Link: https://lore.kernel.org/r/20200922091106.2152715-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Refactor the restrack code to make sure the kref inside the restrack entry
properly kref's the object in which it is embedded. This slight change is
needed for future conversions of MR and QP which are refcounted before the
release and kfree.
The ideal flow from ib_core perspective as follows:
* Allocate ib_* structure with rdma_zalloc_*.
* Set everything that is known to ib_core to that newly created object.
* Initialize kref with restrack help
* Call to driver specific allocation functions.
* Insert into restrack DB
....
* Return and release restrack with restrack_put.
Largely this means a rdma_restrack_new() should be called near allocating
the containing structure.
Link: https://lore.kernel.org/r/20200922091106.2152715-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ucma_destroy_id() assumes that all things accessing the ctx will do so via
the xarray. This assumption violated only in the case the FD is being
closed, then the ctx is reached via the ctx_list. Normally this is OK
since ucma_destroy_id() cannot run concurrenty with release(), however
with ucma_migrate_id() is involved this can violated as the close of the
2nd FD can run concurrently with destroy on the first:
CPU0 CPU1
ucma_destroy_id(fda)
ucma_migrate_id(fda -> fdb)
ucma_get_ctx()
xa_lock()
_ucma_find_context()
xa_erase()
xa_unlock()
xa_lock()
ctx->file = new_file
list_move()
xa_unlock()
ucma_put_ctx()
ucma_close(fdb)
_destroy_id()
kfree(ctx)
_destroy_id()
wait_for_completion()
// boom, ctx was freed
The ctx->file must be modified under the handler and xa_lock, and prior to
modification the ID must be rechecked that it is still reachable from
cur_file, ie there is no parallel destroy or migrate.
To make this work remove the double locking and streamline the control
flow. The double locking was obsoleted by the handler lock now directly
preventing new uevents from being created, and the ctx_list cannot be read
while holding fgets on both files. Removing the double locking also
removes the need to check for the same file.
Fixes: 88314e4dda ("RDMA/cma: add support for rdma_migrate_id()")
Link: https://lore.kernel.org/r/0-v1-05c5a4090305+3a872-ucma_syz_migrate_jgg@nvidia.com
Reported-and-tested-by: syzbot+cc6fc752b3819e082d0c@syzkaller.appspotmail.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Leon Romanovsky says:
====================
IBTA declares speed as 16 bits, but kernel stores it in u8. This series
fixes in-kernel declaration while keeping external interface intact.
====================
Based on the mlx5-next branch at
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
due to dependencies.
* branch 'mlx5_active_speed':
RDMA: Fix link active_speed size
RDMA/mlx5: Delete duplicated mlx5_ptys_width enum
net/mlx5: Refactor query port speed functions
The roce path triggers a work queue that continues to touch the id_priv
but doesn't hold any reference on it. Futher, unlike in the IB case, the
work queue is not fenced during rdma_destroy_id().
This can trigger a use after free if a destroy is triggered in the
incredibly narrow window after the queue_work and the work starting and
obtaining the handler_mutex.
The only purpose of this work queue is to run the ULP event callback from
the standard context, so switch the design to use the existing
cma_work_handler() scheme. This simplifies quite a lot of the flow:
- Use the cma_work_handler() callback to launch the work for roce. This
requires generating the event synchronously inside the
rdma_join_multicast(), which in turn means the dummy struct
ib_sa_multicast can become a simple stack variable.
- cm_work_handler() used the id_priv kref, so we can entirely eliminate
the kref inside struct cma_multicast. Since the cma_multicast never
leaks into an unprotected work queue the kfree can be done at the same
time as for IB.
- Eliminating the general multicast.ib requires using cma_set_mgid() in a
few places to recompute the mgid.
Fixes: 3c86aa70bf ("RDMA/cm: Add RDMA CM support for IBoE devices")
Link: https://lore.kernel.org/r/20200902081122.745412-9-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
These are the same thing, except that cma_ndev_work doesn't have a state
transition. Signal no state transition by setting old_state and new_state
== 0.
In all cases the handler function should not be called once
rdma_destroy_id() has progressed passed setting the state.
Link: https://lore.kernel.org/r/20200902081122.745412-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The only place that still uses it is rdma_join_multicast() which is only
doing a sanity check that the caller hasn't done something wrong and
doesn't need the spinlock.
At least in the case of rdma_join_multicast() the information it needs
will remain until the ID is destroyed once it enters these
states. Similarly there is no reason to check for these specific states in
the handler callback, instead use the usual check for a destroyed id under
the handler_mutex.
Link: https://lore.kernel.org/r/20200902081122.745412-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There is a strange unlocked read of the ID state when checking for
reuseaddr. This is because an ID cannot be reusable once it becomes a
listening ID. Instead of using the state to exclude reuse, just clear it
as part of rdma_listen()'s flow to convert reusable into not reusable.
Once a ID goes to listen there is no way back out, and the only use of
reusable is on the bind_list check.
Finally, update the checks under handler_mutex to use READ_ONCE and audit
that once RDMA_CM_LISTEN is observed in a req callback it is stable under
the handler_mutex.
Link: https://lore.kernel.org/r/20200902081122.745412-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Re-organize things so the state variable is not read unlocked. The first
attempt to go directly from ADDR_BOUND immediately tells us if the ID is
already bound, if we can't do that then the attempt inside
rdma_bind_addr() to go from IDLE to ADDR_BOUND confirms the ID needs
binding.
Link: https://lore.kernel.org/r/20200902081122.745412-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
It is currently a bit confusing, but the design is if the handler_mutex
is held, and the state is in RDMA_CM_CONNECT, then the state cannot leave
RDMA_CM_CONNECT without also serializing with the handler_mutex.
Make this clearer by adding a direct assertion, fixing the usage in
rdma_connect and generally using READ_ONCE to read the state value.
Link: https://lore.kernel.org/r/20200902081122.745412-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
rxe will hold a refcount on the IB device as long as CQ objects exist,
this causes destruction of a rxe device to hang if the CQ pool has any
cached CQs since they are being destroyed after the refcount must go to
zero.
Treat the CQ pool like a client and create/destroy it before/after all
other clients. No users of CQ pool can exist past a client remove call.
Link: https://lore.kernel.org/r/e8a240aa-9e9b-3dca-062f-9130b787f29b@acm.org
Fixes: c7ff819aef ("RDMA/core: Introduce shared CQ pool API")
Tested-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Pull rdma fixes from Jason Gunthorpe:
"A number of driver bug fixes and a few recent regressions:
- Several bug fixes for bnxt_re. Crashing, incorrect data reported,
and corruption on new HW
- Memory leak and crash in rxe
- Fix sysfs corruption in rxe if the netdev name is too long
- Fix a crash on error unwind in the new cq_pool code
- Fix kobject panics in rtrs by working device lifetime properly
- Fix a data corruption bug in iser target related to misaligned
buffers"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/isert: Fix unaligned immediate-data handling
RDMA/rtrs-srv: Set .release function for rtrs srv device during device init
RDMA/bnxt_re: Remove set but not used variable 'qplib_ctx'
RDMA/core: Fix reported speed and width
RDMA/core: Fix unsafe linked list traversal after failing to allocate CQ
RDMA/bnxt_re: Remove the qp from list only if the qp destroy succeeds
RDMA/bnxt_re: Fix driver crash on unaligned PSN entry address
RDMA/bnxt_re: Restrict the max_gids to 256
RDMA/bnxt_re: Static NQ depth allocation
RDMA/bnxt_re: Fix the qp table indexing
RDMA/bnxt_re: Do not report transparent vlan from QP1
RDMA/mlx4: Read pkey table length instead of hardcoded value
RDMA/rxe: Fix panic when calling kmem_cache_create()
RDMA/rxe: Fix memleak in rxe_mem_init_user
RDMA/rxe: Fix the parent sysfs read when the interface has 15 chars
RDMA/rtrs-srv: Replace device_register with device_initialize and device_add
For the calls linked to mlx4_ib_umem_calc_optimal_mtt_size() use
ib_umem_num_dma_blocks() inside the function, it is just some weird static
default.
All other places are just using it with PAGE_SIZE, switch to
ib_umem_num_dma_blocks().
As this is the last call site, remove ib_umem_num_count().
Link: https://lore.kernel.org/r/15-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ib_umem_num_pages() should only be used by things working with the SGL in
CPU pages directly.
Drivers building DMA lists should use the new ib_num_dma_blocks() which
returns the number of blocks rdma_umem_for_each_block() will return.
To make this general for DMA drivers requires a different implementation.
Computing DMA block count based on umem->address only works if the
requested page size is < PAGE_SIZE and/or the IOVA == umem->address.
Instead the number of DMA pages should be computed in the IOVA address
space, not umem->address. Thus the IOVA has to be stored inside the umem
so it can be used for these calculations.
For now set it to umem->address by default and fix it up if
ib_umem_find_best_pgsz() was called. This allows drivers to be converted
to ib_umem_num_dma_blocks() safely.
Link: https://lore.kernel.org/r/6-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
rdma_for_each_block() makes assumptions about how the SGL is constructed
that don't work if the block size is below the page size used to to build
the SGL.
The rules for umem SGL construction require that the SG's all be PAGE_SIZE
aligned and we don't encode the actual byte offset of the VA range inside
the SGL using offset and length. So rdma_for_each_block() has no idea
where the actual starting/ending point is to compute the first/last block
boundary if the starting address should be within a SGL.
Fixing the SGL construction turns out to be really hard, and will be the
subject of other patches. For now block smaller pages.
Fixes: 4a35339958 ("RDMA/umem: Add API to find best driver supported page size in an MR")
Link: https://lore.kernel.org/r/2-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
It is possible for a single SGL to span an aligned boundary, eg if the SGL
is
61440 -> 90112
Then the length is 28672, which currently limits the block size to
32k. With a 32k page size the two covering blocks will be:
32768->65536 and 65536->98304
However, the correct answer is a 128K block size which will span the whole
28672 bytes in a single block.
Instead of limiting based on length figure out which high IOVA bits don't
change between the start and end addresses. That is the highest useful
page size.
Fixes: 4a35339958 ("RDMA/umem: Add API to find best driver supported page size in an MR")
Link: https://lore.kernel.org/r/1-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Like any other verbs objects, CQ shouldn't fail during destroy, but
mlx5_ib didn't follow this contract with mixed IB verbs objects with
DEVX. Such mix causes to the situation where FW and kernel are fully
interdependent on the reference counting of each side.
Kernel verbs and drivers that don't have DEVX flows shouldn't fail.
Fixes: e39afe3d6d ("RDMA: Convert CQ allocations to be under core responsibility")
Link: https://lore.kernel.org/r/20200907120921.476363-7-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In similar way to other IB objects, restore the ability to return error on
SRQ destroy. Strictly speaking, this change is not necessary, and provided
here to ensure a symmetrical interface like other destroy functions.
Fixes: 68e326dea1 ("RDMA: Handle SRQ allocations by IB/core")
Link: https://lore.kernel.org/r/20200907120921.476363-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Like any other IB verbs objects, AH are refcounted by ib_core. The release
of those objects are controlled by ib_core with promise that AH destroy
can't fail.
Being SW object for now, this change makes dealloc_ah() to behave like any
other destroy IB flows.
Fixes: d345691471 ("RDMA: Handle AH allocations by IB/core")
Link: https://lore.kernel.org/r/20200907120921.476363-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The IB verbs objects are counted by the kernel and ib_core ensures that
deallocate PD will success so it will be called once all other objects
that depends on PD will be released. This is achieved by managing various
reference counters on such objects.
The mlx5 driver didn't follow this standard flow when allowed DEVX objects
that are not managed by ib_core to be interleaved with the ones under
ib_core responsibility.
In such interleaved scenarios deallocate command can fail and ib_core will
leave uobject in internal DB and attempt to clean it later to free
resources anyway.
This change partially restores returned value from dealloc_pd() for all
drivers, but keeping in mind that non-DEVX devices and kernel verbs paths
shouldn't fail.
Fixes: 21a428a019 ("RDMA: Handle PD allocations by IB/core")
Link: https://lore.kernel.org/r/20200907120921.476363-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently it triggers a WARN_ON and then goes ahead and destroys the
uobject anyhow, leaking any driver memory.
The only place that leaks driver memory should be during FD close() in
uverbs_destroy_ufile_hw().
Drivers are only allowed to fail destroy uobjects if they guarantee
destroy will eventually succeed. uverbs_destroy_ufile_hw() provides the
loop to give the driver that chance.
Link: https://lore.kernel.org/r/20200902081708.746631-1-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In ucma_process_join(), if the call to xa_alloc() fails, the function will
return without freeing mc. Fix this by jumping to the correct line.
In the process I renamed the jump labels to something more memorable for
extra clarity.
Link: https://lore.kernel.org/r/20200902162454.332828-1-alex.dewar90@gmail.com
Addresses-Coverity-ID: 1496814 ("Resource leak")
Fixes: 95fe51096b ("RDMA/ucma: Remove mc_list and rely on xarray")
Signed-off-by: Alex Dewar <alex.dewar90@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When the returned speed from __ethtool_get_link_ksettings() is
SPEED_UNKNOWN this will lead to reporting a wrong speed and width for
providers that uses ib_get_eth_speed(), fix that by defaulting the
netdev_speed to SPEED_1000 in case the returned value from
__ethtool_get_link_ksettings() is SPEED_UNKNOWN.
Fixes: d41861942f ("IB/core: Add generic function to extract IB speed from netdev")
Link: https://lore.kernel.org/r/20200902124304.170912-1-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When a new connection is established the RDMA CM creates a new cm_id and
passes it through to the event handler. However inside the UCMA the new ID
is not assigned a ucma_context until the user retrieves the event from a
syscall.
This creates a weird edge condition where a cm_id's context can continue
to point at the listening_id that created it, and a number of additional
edge conditions on event list clean up related to destroying half created
IDs.
There is also a race condition in ucma_get_events() where the
cm_id->context is being assigned without holding the handler_mutex.
Simplify all of this by creating the ucma_context inside the event handler
itself and eliminating the edge case of a half created cm_id. All cm_id's
can be uniformly destroyed via __destroy_id() or via the close_work.
Link: https://lore.kernel.org/r/20200818120526.702120-14-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This value is locked under the file->mut, ensure it is held whenever
touching it.
The case in ucma_migrate_id() is a race, while in ucma_free_uctx() it is
already not possible for the write side to run, the movement is just for
clarity.
Fixes: 88314e4dda ("RDMA/cma: add support for rdma_migrate_id()")
Link: https://lore.kernel.org/r/20200818120526.702120-10-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ctx->file is changed under the file->mut lock by ucma_migrate_id(), which
is impossible to lock correctly. Instead change ctx->file under the
handler_lock and ctx_table lock and revise all places touching ctx->file
to use this locking when reading ctx->file.
Link: https://lore.kernel.org/r/20200818120526.702120-9-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The only reader of destroying is inside a handler under the handler_mutex,
so directly use the handler_mutex when setting it instead of the larger
file->mut.
As the refcount could be zero here, and the cm_id already freed, and
additional refcount grab around the locking is required to touch the
cm_id.
Link: https://lore.kernel.org/r/20200818120526.702120-8-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>