add_client_context(), ib_unregister_device() and ib_unregister_client()
are designed to call from blocking context. There is no need to save and
restore last interrupt state when called from such blocking context. Even
though this is not a performance path, using the right spin lock API is
desired for code clarity.
To avoid checkpatch warning while removing flags, sizeof() is used.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
While unregistering a device, remove the context elements from the list to
not have any stale entries. With that any errors/bugs can be checked when
device is freed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
While traversing client_data_list in following conditions, linked list is
only read, no elements of the list are removed. Therefore, use
list_for_each_entry(), instead of list_for_each_safe().
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
While unregistering a client, only context removal should be protected
with lock. There is no need to protect a freeing of such context which is
already removed from the list.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently rdma_addr_cancel() is an async operation, which notifies that
cancel is done by executing the callback function given during
rdma_resolve_ip(). If resolve_ip request is already completed than
callback is not executed.
Instead, now rdma_resolve_addr() and rdma_addr_cancel() simplified in
following ways.
1. rdma_addr_cancel() now a synchronous method. If request was
pending, after it is cancelled, no callback is notified.
2. rdma_resolve_addr() and respective addr_handler() callback doesn't
need to hold reference to cm_id.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
While registering a mad agent, a user space can trigger various errors
and flood the logs.
Therefore, decrease verbosity and rate limit such error messages.
While we are at it, use __func__ to print function name.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When requested QP type is not supported for a {device, port}, return the
error right away before validating all parameters during mad agent
registration time.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When AF_IB addresses are used during rdma_resolve_addr() a lock is not
held. A cma device can get removed while list traversal is in progress
which may lead to crash. ie
CPU0 CPU1
==== ====
rdma_resolve_addr()
cma_resolve_ib_dev()
list_for_each() cma_remove_one()
cur_dev->device mutex_lock(&lock)
list_del();
mutex_unlock(&lock);
cma_process_remove();
Therefore, hold a lock while traversing the list which avoids such
situation.
Cc: <stable@vger.kernel.org> # 3.10
Fixes: f17df3b0de ("RDMA/cma: Add support for AF_IB to rdma_resolve_addr()")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
For dependencies, branch based on rdma.git 'for-rc' of
https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/
Pull 'uverbs_dev_cleanups' from Leon Romanovsky:
====================
Reuse the char device code interfaces to simplify ib_uverbs_device
creation and destruction. As part of this series, we are sending fix to
cleanup path, which was discovered during internal review,
The fix definitely can go to -rc, but it means that this series will be
dependent on rdma-rc.
====================
* branch 'uverbs_dev_cleanups':
RDMA/uverbs: Use device.groups to initialize device attributes
RDMA/uverbs: Use cdev_device_add() instead of cdev_add()
RDMA/core: Depend on device_add() to add device attributes
RDMA/uverbs: Fix error cleanup path of ib_uverbs_add_one()
Resolved conflict in ib_device_unregister_sysfs()
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Instead of explicitly adding device attribute files and handling such
error conditions, depend on device core layer to create device attributes
files based group pointer NULL terminated array.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Instead of doing two step process to add char device and create underlying
device, use cdev_device_add() which does both.
Currently a kobject per uverbs_device is created to keep reference to its
holding ib_uverbs_device in addition to its underlying device 'dev'.
Instead just use uverbs_device->dev to keep a reference to.
With this change there is single reference tracker for ib_uverbs_device
structure.
This allows for subsequent patch to registers group attribute as well
using single API cdev_device_add().
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Instead of adding/removing device attribute files, depend on device_add()
which considers adding these device files based on NULL terminated
attributes group array.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If ib_uverbs_create_uapi() fails, dev_num should be freed from the bitmap.
Fixes: 7d96c9b176 ("IB/uverbs: Have the core code create the uverbs_root_spec")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reuse existing get_device() API to do it symmetric to already used
put_device() in commit 924b8900a4 ("RDMA/core: Replace open-coded
variant of put_device")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The "closing" variable is used as boolean and set to "true" in one
place, update the declaration of that variable and their other
assignment to proper type.
Fixes: e951747a08 ("IB/uverbs: Rework the locking for cleaning up the ucontext")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The upstream kernel commit cited below modified the workqueue in the
new CQ API to be bound to a specific CPU (instead of being unbound).
This caused ALL users of the new CQ API to use the same bound WQ.
Specifically, MAD handling was severely delayed when the CPU bound
to the WQ was busy handling (higher priority) interrupts.
This caused a delay in the MAD "heartbeat" response handling,
which resulted in ports being incorrectly classified as "down".
To fix this, add a new "unbound" WQ type to the new CQ API, so that users
have the option to choose either a bound WQ or an unbound WQ.
For MADs, choose the new "unbound" WQ.
Fixes: b7363e67b2 ("IB/device: Convert ib-comp-wq to be CPU-bound")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.m>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Refactor the initialization of a flow action object to a common function.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This makes it clear and safe to access constants passed in from user
space. We define a consistent ABI of u64 for all constants, and verify
that the data passed in can be represented by the type the user supplies.
The expectation is this will always be used with an enum declaring the
constant values, and the user will use the enum type as input to the
accessor.
To retrieve the attribute value we introduce two helper calls - one
standard which may fail if attribute is not valid and one where caller can
provide a default value which will be used in case the attribute is not
valid (useful when attribute is optional).
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
The object lock was supposed to always be released during destroy, but
when the destruction retry series was integrated with the destroy series
it created a failure path that missed the unlock.
Keep with convention, if destroy fails the caller must undo all locking.
Fixes: 87ad80abc7 ("IB/uverbs: Consolidate uobject destruction")
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The current code grabs the private_data of whatever file descriptor
userspace has supplied and implicitly casts it to a `struct ucma_file *`,
potentially causing a type confusion.
This is probably fine in practice because the pointer is only used for
comparisons, it is never actually dereferenced; and even in the
comparisons, it is unlikely that a file from another filesystem would have
a ->private_data pointer that happens to also be valid in this context.
But ->private_data is not always guaranteed to be a valid pointer to an
object owned by the file's filesystem; for example, some filesystems just
cram numbers in there.
Check the type of the supplied file descriptor to be safe, analogous to how
other places in the kernel do it.
Fixes: 88314e4dda ("RDMA/cma: add support for rdma_migrate_id()")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.
Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.
We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held. Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range. Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.
This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false. This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.
I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that. The first
part can be done without a sleeping lock in most cases AFAICS.
The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.
The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom. This can be done e.g. after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small. Then we are looking for a proper process tear down.
[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
Reported-by: David Rientjes <rientjes@google.com>
Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rdma.git merge resolution for the 4.19 merge window
Conflicts:
drivers/infiniband/core/rdma_core.c
- Use the rdma code and revise with the new spelling for
atomic_fetch_add_unless
drivers/nvme/host/rdma.c
- Replace max_sge with max_send_sge in new blk code
drivers/nvme/target/rdma.c
- Use the blk code and revise to use NULL for ib_post_recv when
appropriate
- Replace max_sge with max_recv_sge in new blk code
net/rds/ib_send.c
- Use the net code and revise to use NULL for ib_post_recv when
appropriate
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltwm2geHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGITkH/iSzkVhT2OxHoir0
mLVzTi7/Z17L0e/ELl7TvAC0iLFlWZKdlGR0g3b4/QpXLPmNK4HxiDRTQuWn8ke0
qDZyDq89HqLt+mpeFZ43PCd9oqV8CH2xxK3iCWReqv6bNnowGnRpSStlks4rDqWn
zURC/5sUh7TzEG4s997RrrpnyPeQWUlf/Mhtzg2/WvK2btoLWgu5qzjX1uFh3s7u
vaF2NXVJ3X03gPktyxZzwtO1SwLFS1jhwUXWBZ5AnoJ99ywkghQnkqS/2YpekNTm
wFk80/78sU+d91aAqO8kkhHj8VRrd+9SGnZ4mB2aZHwjZjGcics4RRtxukSfOQ+6
L47IdXo=
=sJkt
-----END PGP SIGNATURE-----
Merge tag 'v4.18' into rdma.git for-next
Resolve merge conflicts from the -rc cycle against the rdma.git tree:
Conflicts:
drivers/infiniband/core/uverbs_cmd.c
- New ifs added to ib_uverbs_ex_create_flow in -rc and for-next
- Merge removal of file->ucontext in for-next with new code in -rc
drivers/infiniband/core/uverbs_main.c
- for-next removed code from ib_uverbs_write() that was modified
in for-rc
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Filter functions returns either 0 or 1, therefore better change their
return type from int to bool to reflect the same. Additionally some
filter functions have suffix of _filter some doesn't. Make all filter
function consistent to have __filter suffix to improve code readability.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Update all GID table entries of the netdevice whose MAC address changed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently following issues exist:
1. Default GIDs of the lower (slave) netdevice if the bond netdevice is
added. Rather default GID should be of bond master netdevice.
2. Due to this, when failover event occurs FAILOVER event handler attempts
to delete the GID of the upper device and tries to add the default GID
of the lower device. This is incorrect behavior.
To have simple and correct code:
(a) Split default GIDs addition out of add_netdev_ips(). This allows
easier removal in future if RoCE default GIDs are removed.
(b) Add default GIDs of the bond master device by using right filter and
callback function.
(c) Remove unused function enum_netdev_default_gids().
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now that we correctly delete the default GIDs of lower devices during
CHANGEUPPER event, add default GIDs of the bonding master device.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When NETDEV_CHANGEUPPER event occurs, lower device is not yet established
as slave of the master, and when upper device is bond device, default GID
entries not deleted.
Due to this, when bond device is fully configured, default GID entries of
bond device cannot be added as default GID entries are occupied by the
lower netdevice. This is incorrect.
Default GID entries should really be of bond netdevice because in all RoCE
GIDs (default or IP), MAC address of the bond device will be used. It is
confusing to have default GID of netdevice which is not really used for
any purpose.
Therefore, as first step, implement
(a) filter function which filters if a CHANGEUPPER event netdevice and
associated upper device is master device or not.
(b) callback function which deletes the default GIDs of lower (event
netdevice).
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently bond_delete_netdev_default_gids() is called by two callers.
(a) del_netdev_default_ips_join()
(b) del_netdev_default_ips()
Both above functions changes the argument order while calling
bond_delete_netdev_default_gids(). This required silly
del_netdev_default_ips() wrapper.
Additionally, del_netdev_default_ips() deletes default GIDs not IP based
GIDs. del_netdev_default_ips() having _ips suffix is confusing.
Therefore, get rid of confusing del_netdev_default_ips() and simplify
bond_delete_netdev_default_gids() to follow same argument order as its
caller.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add comment for handling CHANGEUPPER netevent handling.
To improve code readability,
(a) move cmd definitions to its respective if-else branches,
(b) avoid single line structure definitions.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Even though this interface is marked CONFIG_BROKEN we still expect it to
compile, at least until we delete it completely.
Also mark INFINIBAND_USER_ACCESS_UCM with COMPILE_TEST so these situations
can be detected.
Fixes: e7ff98aefc ("RDMA/cma: Constify path record, ib_cm_event, listen_id pointers")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Pull locking/atomics update from Thomas Gleixner:
"The locking, atomics and memory model brains delivered:
- A larger update to the atomics code which reworks the ordering
barriers, consolidates the atomic primitives, provides the new
atomic64_fetch_add_unless() primitive and cleans up the include
hell.
- Simplify cmpxchg() instrumentation and add instrumentation for
xchg() and cmpxchg_double().
- Updates to the memory model and documentation"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
locking/atomics: Rework ordering barriers
locking/atomics: Instrument cmpxchg_double*()
locking/atomics: Instrument xchg()
locking/atomics: Simplify cmpxchg() instrumentation
locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
tools/memory-model: Rename litmus tests to comply to norm7
tools/memory-model/Documentation: Fix typo, smb->smp
sched/Documentation: Update wake_up() & co. memory-barrier guarantees
locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
sched/core: Use smp_mb() in wake_woken_function()
tools/memory-model: Add informal LKMM documentation to MAINTAINERS
locking/atomics/Documentation: Describe atomic_set() as a write operation
tools/memory-model: Make scripts executable
tools/memory-model: Remove ACCESS_ONCE() from model
tools/memory-model: Remove ACCESS_ONCE() from recipes
locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
tools/memory-model: Add litmus test for full multicopy atomicity
locking/refcount: Always allow checked forms
...
Now that the ioctl path and uobjects are converted to use uverbs_api, it
is now safe to remove the disassociation protection from the common ioctl
code.
This completes the work to make destroy functions continue to work even
after device disassociation.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Convert the ioctl method syscall path to use the uverbs_api data
structures. The new uapi structure includes all the same information, just
in a different and more optimal way.
- Use attr_bkey instead of 2 level radix trees for everything related to
attributes. This includes the attribute storage, presence, and
detection of missing mandatory attributes.
- Avoid iterating over all attribute storage at finish, instead use
find_first_bit with the attr_bkey to locate only those attrs that need
cleanup.
- Organize things to always run, and always rely on, cleanup. This
avoids a bunch of tricky error unwind cases.
- Locate the method using the radix tree, and locate the attributes
using a very efficient incremental radix tree lookup
- Use the precomputed destroy_bkey to handle uobject destruction
- Use the precomputed allocation sizes and precomputed 'need_stack'
to avoid maths in the fast path. This is optimal if userspace
does not pass (many) unsupported attributes.
Overall this results in much better codegen for the attribute accessors,
everything is now stored in bitmaps or linear arrays indexed by attr_bkey.
The compiler can compute attr_bkey values at compile time for all method
attributes, meaning things like uverbs_attr_is_valid() now compile into
single instruction bit tests.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Several handlers need temporary allocations for the life of the method,
switch them to use the uverbs_alloc allocator.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
This is similar in spirit to devm, it keeps track of any allocations
linked to this method call and ensures they are all freed when the method
exits. Further, if there is space in the internal/onstack buffer then the
allocator will hand out that memory and avoid an expensive call to
kalloc/kfree in the syscall path.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Memory in the bundle is valuable, do not waste it holding an 8 byte
pointer for the rare case of writing to a PTR_OUT. We can compute the
pointer by storing a small 1 byte array offset and the base address of the
uattr memory in the bundle private memory.
This also means we can access the kernel's copy of the ib_uverbs_attr, so
drop the copy of flags as well.
Since the uattr base should be private bundle information this also
de-inlines the already too big uverbs_copy_to inline and moves
create_udata into uverbs_ioctl.c so they can see the private struct
definition.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
This already existed as the anonymous 'ctx' structure, but this was not
really a useful form. Hoist this struct into bundle_priv and rework the
internal things to use it instead.
Move a bunch of the processing internal state into the priv and reduce the
excessive use of function arguments.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Currently the struct uverbs_obj_type stored in the ib_uobject is part of
the .rodata segment of the module that defines the object. This is a
problem if drivers define new uapi objects as we will be left with a
dangling pointer after device disassociation.
Switch the uverbs_obj_type for struct uverbs_api_object, which is
allocated memory that is part of the uverbs_api and is guaranteed to
always exist. Further this moves the 'type_class' into this memory which
means access to the IDR/FD function pointers is also guaranteed. Drivers
cannot define new types.
This makes it safe to continue to use all uobjects, including driver
defined ones, after disassociation.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This radix tree datastructure is intended to replace the 'hash' structure
used today for parsing ioctl methods during system calls. This first
commit introduces the structure and builds it from the existing .rodata
descriptions.
The so-called hash arrangement is actually a 5 level open coded radix tree.
This new version uses a 3 level radix tree built using the radix tree
library.
Overall this is much less code and much easier to build as the radix tree
API allows for dynamic modification during the building. There is a small
memory penalty to pay for this, but since the radix tree is allocated on
a per device basis, a few kb of RAM seems immaterial considering the
gained simplicity.
The radix tree is similar to the existing tree, but also has a 'attr_bkey'
concept, which is a small value'd index for each method attribute. This is
used to simplify and improve performance of everything in the next
patches.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
There is no reason for drivers to do this, the core code should take of
everything. The drivers will provide their information from rodata to
describe their modifications to the core's base uapi specification.
The core uses this to build up the runtime uapi for each device.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
This is missing a zeroing of the high bits of flags, and is also not
correct for big endian machines. Properly zero extend the 32 bit flags
into the 64 bit stack variable.
Reported-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Fixes: bccd06223f ("IB/uverbs: Add UVERBS_ATTR_FLAGS_IN to the specs language")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
sgid_attr is uninitialized on the stack, initialize it to NULL.
Fixes: 398391071f ("IB/cm: Replace members of sa_path_rec with 'struct sgid_attr *'")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Yossi Itigin <yosefe@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The disassociate function was broken by design because it failed all
commands. This prevents userspace from calling destroy on a uobject after
it has detected a device fatal error and thus reclaiming the resources in
userspace is prevented.
This fix is now straightforward, when anything destroys a uobject that is
not the user the object remains on the IDR with a NULL context and object
pointer. All lookup locking modes other than DESTROY will fail. When the
user ultimately calls the destroy function it is simply dropped from the
IDR while any related information is returned.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now that all the callbacks are safe to run concurrently with
disassociation this test can be eliminated. The ufile core infrastructure
becomes entirely self contained and is not sensitive to disassociation.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This does the same as the patch before, except for ioctl. The rules are
the same, but for the ioctl methods the core code handles setting up the
uobject.
- Retrieve the ib_dev from the uobject->context->device. This is
safe under ioctl as the core has already done rdma_alloc_begin_uobject
and so CREATE calls are entirely protected by the rwsem.
- Retrieve the ib_dev from uobject->object
- Call ib_uverbs_get_ucontext()
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This is a step to get rid of the global check for disassociation. In this
model, the ib_dev is not proven to be valid by the core code and cannot be
provided to the method. Instead, every method decides if it is able to
run after disassociation and obtains the ib_dev using one of three
different approaches:
- Call srcu_dereference on the udevice's ib_dev. As before, this means
the method cannot be called after disassociation begins.
(eg alloc ucontext)
- Retrieve the ib_dev from the ucontext, via ib_uverbs_get_ucontext()
- Retrieve the ib_dev from the uobject->object after checking
under SRCU if disassociation has started (eg uobj_get)
Largely, the code is all ready for this, the main work is to provide a
ib_dev after calling uobj_alloc(). The few other places simply use
ib_uverbs_get_ucontext() to get the ib_dev.
This flexibility will let the next patches allow destroy to operate
after disassociation.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Commands that are reading/writing to objects can test for an ongoing
disassociation during their initial call to rdma_lookup_get_uobject. This
directly prevents all of these commands from conflicting with an ongoing
disassociation.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
After all the recent structural changes this is now straightforward, hold
the hw_destroy_rwsem across the entire uobject creation. We already take
this semaphore on the success path, so holding it a bit longer is not
going to change the performance.
After this change none of the create callbacks require the
disassociate_srcu lock to be correct.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
After all the recent structural changes this is now straightfoward, hoist
the hw_destroy_rwsem up out of rdma_destroy_explicit and wrap it around
the uobject write lock as well as the destroy.
This is necessary as obtaining a write lock concurrently with
uverbs_destroy_ufile_hw() will cause malfunction.
After this change none of the destroy callbacks require the
disassociate_srcu lock to be correct.
This requires introducing a new lookup mode, UVERBS_LOOKUP_DESTROY as the
IOCTL interface needs to hold an unlocked kref until all command
verification is completed.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There are several flows that can destroy a uobject and each one is
minimized and sprinkled throughout the code base, making it difficult to
understand and very hard to modify the destroy path.
Consolidate all of these into uverbs_destroy_uobject() and call it in all
cases where a uobject has to be destroyed.
This makes one change to the lifecycle, during any abort (eg when
alloc_commit is not called) we always call out to alloc_abort, even if
remove_commit needs to be called to delete a HW object.
This also renames RDMA_REMOVE_DURING_CLEANUP to RDMA_REMOVE_ABORT to
clarify its actual usage and revises some of the comments to reflect what
the life cycle is for the type implementation.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The ridiculous dance with uobj_remove_commit() is not needed, the write
path can follow the same flow as ioctl - lock and destroy the HW object
then use the data left over in the uobject to form the response to
userspace.
Two helpers are introduced to make this flow straightforward for the
caller.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The core code will destroy the HW object on behalf of the method, if the
method provides an implementation it must simply copy data from the stub
uobj into the response. Destroy methods cannot touch the HW object.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In rdma cm module, functions which are common between IB and iWarp
are named with cma_.
iWarp specific functions are prefixed with cma_iw.
IB specific functions are perfixed with cma_ib.
However some functions in request processing path didn't follow
cma_ib notion. Prefix them with _ib for better code clarity.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
cma_add_one() initializes the default GID regardless of device type.
listen_id is bound to a device and an IP address, its GID type is
initialized by cma_acquire_dev().
Therefore a valid default GID type is always available, it is not needed
to check port type during cma_acquire_dev().
Initialize gid type of a cm id when the cm_id is created instead of
doing conditional checks during cma_acquire_dev() and trying to
initialize to 0 during _cma_attach_to_dev().
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In various functions rdma_cm_event is zero initialized on stack using
memset() while holding lock which is not necessary.
Therefore, don't hold the lock while initializing on stack.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Return bool for following internal and inline functions as their
underlying APIs return bool too.
1. cma_zero_addr()
2. cma_loopback_addr()
3. cma_any_addr()
4. ib_addr_any()
5. ib_addr_loopback()
While we are touching cma_loopback_addr(), remove extra white spaces
in it.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Arrange fields of cma_req_info structure for efficiency on
stack and get rid of one bit boolean field.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Constify several pointers such as path_rec, ib_cm_event and listen_id
pointers in several functions.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Following APIs are not supposed to modify addr or dest_addr contents.
Therefore make those function argument const for better code
readability.
1. rdma_resolve_ip()
2. rdma_addr_size()
3. rdma_resolve_addr()
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently dst address is first set and later on cleared on either of the
3 error conditions are met.
However none of the APIs or checks are supposed to refer to the
destination address of the cm_id.
Therefore, set the destination address after necessary checks pass which
simplifies the error flow.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently rdma_cm_id's resource tracking fields such as owner task and
kern_name and other non resource tracking fields are initialized in
in single function __rdma_create_id().
Therefore, initialize rdma_cm_id's resource type also in same init
function.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Make sure the providers implement the verbs callbacks before calling
them, otherwise return -EOPNOTSUPP.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
{create,destroy}_ah aren't mandatory verbs, because not all providers
are implementing them.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This clearly indicates that the input is a bitwise combination of values
in an enum, and identifies which enum contains the definition of the bits.
Special accessors are provided that handle the mandatory validation of the
allowed bits and enforce the correct type for bitwise flags.
If we had introduced this at the start then the kabi would have uniformly
used u64 data to pass flags, however today there is a mixture of u64 and
u32 flags. All places are converted to accept both sizes and the accessor
fixes it. This allows all existing flags to grow to u64 in future without
any hassle.
Finally all flags are, by definition, optional. If flags are not passed
the accessor does not fail, but provides a value of zero.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Since neither ib_post_send() nor ib_post_recv() modify the data structure
their second argument points at, declare that argument const. This change
makes it necessary to declare the 'bad_wr' argument const too and also to
modify all ULPs that call ib_post_send(), ib_post_recv() or
ib_post_srq_recv(). This patch does not change any functionality but makes
it possible for the compiler to verify whether the
ib_post_(send|recv|srq_recv) really do not modify the posted work request.
To make this possible, only one cast had to be introduce that casts away
constness, namely in rpcrdma_post_recvs(). The only way I can think of to
avoid that cast is to introduce an additional loop in that function or to
change the data type of bad_wr from struct ib_recv_wr ** into int
(an index that refers to an element in the work request list). However,
both approaches would require even more extensive changes than this
patch.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently if the cm_id is not bound to any netdevice, than for such cm_id,
net namespace is ignored; which is incorrect.
Regardless of cm_id bound to a netdevice or not, net namespace must
match. When a cm_id is bound to a netdevice, in such case net namespace
and netdevice both must match.
Fixes: 4c21b5bcef ("IB/cma: Add net_dev and private data checks to RDMA CM")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When netdevice is not found for a request, and if it for RoCE port,
currently it allows matching the listener as long as port number matches
by ignoring the netdevice.
Now that we always prefer to have netdevice associated with RoCE, when
netdevice is not found, don't consider RoCE ports.
In other words, a NULL netdevice with RoCE is not acceptable. Therefore,
remove this confusing RoCE port ignorance check.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
For RoCE, when CM requests are received for RC and UD connections,
netdevice of the incoming request is unavailable. Because of that CM
requests are always forwarded to init_net namespace.
Now that we have the GID attribute available, introduce SGID attribute in
incoming CM requests and refer to the netdevice of it. This is similar to
existing SGID attribute field in outgoing CM requests for RC and UD
transports.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove unnecessary parentheses to fix the clang warning of extraneous
parentheses.
Signed-off-by: Varsha Rao <rvarsha016@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
We have a parallel unlocked reader and writer with ib_uverbs_get_context()
vs everything else, and nothing guarantees this works properly.
Audit and fix all of the places that access ucontext to use one of the
following locking schemes:
- Call ib_uverbs_get_ucontext() under SRCU and check for failure
- Access the ucontext through an struct ib_uobject context member
while holding a READ or WRITE lock on the uobject.
This value cannot be NULL and has no race.
- Hold the ucontext_lock and check for ufile->ucontext !NULL
This also re-implements ib_uverbs_get_ucontext() in a way that is safe
against concurrent ib_uverbs_get_context() and disassociation.
As a side effect, every access to ucontext in the commands is via
ib_uverbs_get_context() with an error check, or via the uobject, so there
is no longer any need for the core code to check ucontext on every command
call. These checks are also removed.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Allocating the struct file during alloc_begin creates this strange
asymmetry with IDR, where the FD has two krefs pointing at it during the
pre-commit phase. In particular this makes the abort process for FD very
strange and confusing.
For instance abort currently calls the type's destroy_object twice, and
the fops release once if abort is done. This is very counter intuitive. No
fops should be called until alloc_commit succeeds, and destroy_object
should only ever be called once.
Moving the struct file allocation to the alloc_commit is now simple, as we
already support failure of rdma_alloc_commit_uobject, with all the
required rollback pieces.
This creates an understandable symmetry with IDR and simplifies/fixes the
abort handling for FD types.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The ioctl framework already does this correctly, but the write path did
not. This is trivially fixed by simply using a standard pattern to return
uobj_alloc_commit() as the last statement in every function.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The locking here has always been a bit crazy and spread out, upon some
careful analysis we can simplify things.
Create a single function uverbs_destroy_ufile_hw() that internally handles
all locking. This pulls together pieces of this process that were
sprinkled all over the places into one place, and covers them with one
lock.
This eliminates several duplicate/confusing locks and makes the control
flow in ib_uverbs_close() and ib_uverbs_free_hw_resources() extremely
simple.
Unfortunately we have to keep an extra mutex, ucontext_lock. This lock is
logically part of the rwsem and provides the 'down write, fail if write
locked, wait if read locked' semantic we require.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Rename 'cleanup_rwsem' to 'hw_destroy_rwsem' which is held across any call
to the type destroy function (aka 'hw' destroy). The main purpose of this
lock is to prevent normal add and destroy from running concurrently with
uverbs_cleanup_ufile()
Since the uobjects list is always manipulated under the 'hw_destroy_rwsem'
we can eliminate the uobjects_lock in the cleanup function. This allows
converting that lock to a very simple spinlock with a narrow critical
section.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The locking requirements here have changed slightly now that we can rely
on the ib_uverbs_file always existing and containing all the necessary
locking infrastructure.
That means we can get rid of the cleanup_mutex usage (this was protecting
the check on !uboj->context).
Otherwise, follow the same pattern that IDR uses for destroy, acquire
exclusive write access, then call destroy and the undo the 'lookup'.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This wasn't wrong, but the placement of two krefs didn't make any
sense. Follow some simple rules.
- A kref is held inside uobjects_list
- A kref is held inside the IDR
- A kref is held inside file->private
- A stack based kref is passed bettwen alloc_begin and
alloc_abort/alloc_commit
Any place we destroy one of the above pointers, we stick a put,
or 'move' the kref into another pointer.
The key functions have sensible semantics:
- alloc_uobj fully initializes the common members in uobj, including
the list
- Get rid of the uverbs_idr_remove_uobj helper since IDR remove
does require put, but it depends on the situation. Later
patches will re-consolidate this differently.
- alloc_abort always consumes the passed kref, done in the type
- alloc_commit always consumes the passed kref, done in the type
- rdma_remove_commit_uobject always pairs with a lookup_get
After it is all done the only control flow change is to:
- move a get from alloc_commit_fd_uobject to rdma_alloc_commit_uobject
- add a put to remove_commit_idr_uobject
- Consistenly use rdma_lookup_put in rdma_remove_commit_uobject at
the right place
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The alloc_commit callback makes the uobj visible to other threads,
and it does so using a 'move' semantic of the uobj kref on the stack
into the public storage (eg the IDR, uobject list and file_private_data)
Once this is done another thread could start up and trigger deletion
of the kref. Fortunately cleanup_rwsem happens to prevent this from
being a bug, but that is a fantastically unclear side effect.
Re-organize things so that alloc_commit is that last thing to touch
the uobj, get rid of the sneaky implicit dependency on cleanup_rwsem,
and add a comment reminding that uobj is no longer kref'd after
alloc_commit.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Our ABI for write() uses a s32 for FDs and a u32 for IDRs, but internally
we ended up implicitly casting these ABI values into an 'int'. For ioctl()
we use a s64 for FDs and a u64 for IDRs, again casting to an int.
The various casts to int are all missing range checks which can cause
userspace values that should be considered invalid to be accepted.
Fix this by making the generic lookup routine accept a s64, which does not
truncate the write API's u32/s32 or the ioctl API's s64. Then push the
detailed range checking down to the actual type implementations to be
shared by both interfaces.
Finally, change the copy of the uobj->id to sign extend into a s64, so eg,
if we ever wish to return a negative value for a FD it is carried
properly.
This ensures that userspace values are never weirdly interpreted due to
the various trunctations and everything that is really out of range gets
an EINVAL.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If the method fails after calling rdma_explicit_destroy (eg if
copy_to_user faults) then it will trigger a kernel oops:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
PGD 800000000548d067 P4D 800000000548d067 PUD 54a0067 PMD 0
SMP PTI
CPU: 0 PID: 359 Comm: ibv_rc_pingpong Not tainted 4.18.0-rc1+ #28
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
RIP: 0010: (null)
Code: Bad RIP value.
RSP: 0018:ffffc900001a3bf0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88000603bd00 RCX: 0000000000000003
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88000603bd00
RBP: 0000000000000001 R08: ffffc900001a3cf8 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffc900001a3cf0
R13: 0000000000000000 R14: ffffc900001a3cf0 R15: 0000000000000000
FS: 00007fb00dda8700(0000) GS:ffff880007c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 000000000548e004 CR4: 00000000003606b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? rdma_lookup_put_uobject+0x22/0x50 [ib_uverbs]
? uverbs_finalize_object+0x3b/0x60 [ib_uverbs]
? uverbs_finalize_attrs+0x128/0x140 [ib_uverbs]
? ib_uverbs_cmd_verbs+0x698/0x7c0 [ib_uverbs]
? find_held_lock+0x2d/0x90
? __might_fault+0x39/0x90
? ib_uverbs_ioctl+0x111/0x1f0 [ib_uverbs]
? do_vfs_ioctl+0xa0/0x6d0
? trace_hardirqs_on_caller+0xed/0x180
? _raw_spin_unlock_irq+0x24/0x40
? syscall_trace_enter+0x138/0x1d0
? ksys_ioctl+0x35/0x60
? __x64_sys_ioctl+0x11/0x20
? do_syscall_64+0x5b/0x1c0
? entry_SYSCALL_64_after_hwframe+0x49/0xbe
This is because the type was replaced with the null_type during explicit
destroy that cannot complete the destruction.
One of the side effects of replacing the type is to make the object
handle totally unreachable - so no other command could attempt to use
it, even though it remains on the uboject list.
We can get the same end result by just fully destroying the object inside
rdma_explicit_destroy and leaving the caller the residual kref for the
uobj with no attached HW object, and no presence in the ubojects list.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The commit cited below checked that the port numbers provided in the
primary and alt AVs are legal.
That is sufficient to prevent a kernel panic. However, it is not
sufficient for correct operation.
In Linux, AVs (both primary and alt) must be completely self-described.
We do not accept an AV from userspace without an embedded port number.
(This has been the case since kernel 3.14 commit dbf727de74
("IB/core: Use GID table in AH creation and dmac resolution")).
For the primary AV, this embedded port number must match the port number
specified with IB_QP_PORT.
We also expect the port number embedded in the alt AV to match the
alt_port_num value passed by the userspace driver in the modify_qp command
base structure.
Add these checks to modify_qp.
Cc: <stable@vger.kernel.org> # 4.16
Fixes: 5d4c05c3ee ("RDMA/uverbs: Sanitize user entered port numbers prior to access it")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch considers the case that ib_flow is created by some device
driver with its specific parameters using the KABI infrastructure.
In that case both QP and ib_uflow_resources might not be applicable.
Downstream patches from this series use the above functionality.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Introduce flow steering matcher object and its create and destroy methods.
This matcher object holds some mlx5 specific driver properties that
matches the underlay device specification when an mlx5 flow steering group
is created.
It will be used in downstream patches to be part of mlx5 specific create
flow method.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This variable isn't read and written to with proper locking, so it is
racy. Instead of using an unlocked bool use presence in the mc->list
The caller could race rdma_join_multicast with rdma_leave_multicast which
would leak a mc join and cause a use after free of mc.
Instead, do not add the mc to the list until it has completed
initialization, all mcs on the list require leaving.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Simplify exit paths in ib_umem_get to use the standard goto unwind
pattern.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
DMA mapping is time consuming operation and doesn't need to be performed
with mmap_sem semaphore is held.
The semaphore only needs to be held for accounting and get_user_pages
related activities.
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Enable uverbs_destroy_def_handler to be used by drivers and replace
current code to use it.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Userspace also needs to know if the port requires GRHs to properly form
the AVs it creates.
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Extend the existing grh_required flag to check when AV's are handled that
a GRH is present.
Since we don't want to do query_port during the AV checks for performance
reasons move the flag into the immutable_data.
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
grh_required is intended to be a global setting where all AV's will
require a GRH, not just the sm_lid. Move the special logic to the creation
of the SM AH.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
The internal flag IP_BASED_GIDS was added to a field that was being used
to hold the port Info CapabilityMask without considering the effects this
will have. Since most drivers just use the value from the HW MAD it means
IP_BASED_GIDS will also become set on any HW that sets the IBA flag
IsOtherLocalChangesNoticeSupported - which is not intended.
Fix this by keeping port_cap_flags only for the IBA CapabilityMask value
and store unrelated flags externally. Move the bit definitions for this to
ib_mad.h to make it clear what is happening.
To keep the uAPI unchanged define a new set of flags in the uapi header
that are only used by ib_uverbs_query_port_resp.port_cap_flags which match
the current flags supported in rdma-core, and the values exposed by the
current kernel.
Fixes: b4a26a2728 ("IB: Report using RoCE IP based gids in port caps")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
roce_resolve_route_from_path() resolves the route based on the netdevice
of the GID attribute, therefore there is no point in checking again if
the route is resolved matches the same interface it arrived.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Instead we are now checking the function pointers directly. Get rid of
both cases in ioctl and drop the nonsense idea that destroy can fail.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Fixes: 0e353e34e1 ("IB/core: add RW API support for signature MRs")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove these two functions since all their callers have been removed.
See also commit ea8c2d8f60 ("RDMA/core: Remove unused ib cache
functions").
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In cm_form_tid(), a two bit message sequence number is OR'ed into bit
31-30 of the lower TID value.
After commit f06d265375 ("IB/cm: Randomize starting comm ID"), the
local_id is XOR'ed with a 32-bit random value. Hence, bit 31-30 in the
lower TID now has an arbitrarily value and it makes no sense to OR in
the message sequence number.
Adding to that, the evolution in use of IDR routines in cm_alloc_id()
has always had the possibility of returning a value with bit 30 set.
In addition, said bits are never checked.
Hence, remove the encoding and the corresponding enum.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The only purpose for this structure was to hold the ib_uobject_file
pointer, but now that is part of the standard ib_uobject the structure
no longer makes any sense, so get rid of it.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Unnecessary clutter, to indirect through ucontext when the ufile would do.
Generally most of the code code should only be working with ufile, except
for a few places that touch the driver interface.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
The ucontext isn't needed any more, just pass the uverbs_file directly.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
The correct handle to refer to the idr/etc is ib_uverbs_file, revise all
the core APIs to use this instead. The user API are left as wrappers
that automatically convert a ucontext to a ufile for now.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
The IDR is part of the ib_ufile so all the machinery to lock it, handle
closing and disassociation rightly belongs to the ufile not the ucontext.
This changes the lifetime of that data to match the lifetime of the file
descriptor which is always strictly longer than the lifetime of the
ucontext.
We need the entire locking machinery to continue to exist after ucontext
destruction to allow us to return the destroy data after a device has been
disassociated.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
This consolidates a bunch of repeated code patterns into a helper.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
After all the rework is done it is now possible to include single flags in
the type macros. Any user of UVERBS_ATTR_STRUCT needs to zero check data
past the end of the known struct to be correct, so make this mandatory,
and get rid of MIN_SZ_OR_ZERO as a user flag.
This changes UVERBS_ATTR_TYPE to refer to a struct of exact size with not
possibility of extension, convert the few users of UVERBS_ATTR_TYPE and
MIN_SZ_OR_ZERO to use UVERBS_ATTR_STRUCT.
The one user of UVERBS_ATTR_STRUCT without MIN_SZ_OR_ZERO is just
confused. There is some padding at the end of that struct, but userspace
always provides it with the padding. The construction doesn't test if the
padding is zero, so it is pointless. Just use UVERBS_ATTR_TYPE.
Finally, rename min_sz_or_zero to zero_trailing to better reflect what it
does and hopefully avoid such mis-uses in the future.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
This newer macro allows specifying a lower bound on the accepted size, and
has an 'unlimited' upper bound. Due to this it never checks for trailing
zeroing so it doesn't make any sense to combine it with MIN_SZ_OR_ZERO, so
drop MIN_SZ_OR_ZERO when they are used together
There were a couple of places that open coded this pattern, switch them to
use the clearer UVERBS_ATTR_MIN_SIZE for clarity.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
This bit of boilerplate isn't really necessary, we can use bitfields
instead of a flags enum and the macros can then individually initialize
them through the __VA_ARGS__ like everything else.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Hide it inside the macros. The & is confusing and interferes with using
this as a generic DSL in later patches.
Since this also touches almost every line, also run the specs through
clang-format (with 'BinPackParameters: false') to make the maintenance
easier.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Instead of the large set of indirecting macros, define the few needed
macros to directly instantiate the struct uverbs_oject_tree_def and
associated objects list.
This is small amount of code duplication but the readability is far
better.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Instead of the large set of indirecting macros, define the few needed
macros to directly instantiate the struct uverbs_method_def and associated
attributes list.
This is small amount of code duplication but the readability is far
better.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Instead of using a complex cascade of macros, just directly provide the
initializer list each of the declarations is trying to create.
Now that the macros are simplified this also reworks the uverbs_attr_spec
to be friendly to older compilers by eliminating any unnamed
structures/unions inside, and removing the duplication of some fields. The
structure size remains at 16 bytes which was the original motivation for
some of this oddness.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Two methods are sharing the same attribute constant, but the attribute
definitions are not the same. This should not have been done, instead
split them into two attributes with the same number.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
The specs are required to operate the uverbs file, so they belong inside
the ib_uverbs_device, not inside the ib_device. The spec passed in the
ib_device is just a communication from the driver and should not be used
during runtime.
This also changes the lifetime of the spec memory to match the
ib_uverbs_device, however at this time the spec_root can still contain
driver pointers after disassociation, so it cannot be used if ib_dev is
NULL. This is preparation for another series.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
"nents" is an unsigned int, so if ib_map_mr_sg() returns a negative
error code then it's type promoted to a high unsigned int which is
treated as success.
Fixes: a060b5629a ("IB/core: generic RDMA READ/WRITE API")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The conversion from offsetof() calculations to sizeof()
wrongly behaved for missed exact size and in scenario with
more than one flow.
In such scenario we got "create flow failed, flow 10: 8 bytes
left from uverb cmd" error, which is wrong because the size of
kern_spec is exactly 8 bytes, and we were not supposed to fail.
Cc: <stable@vger.kernel.org> # 3.12
Fixes: 4fae7f1704 ("RDMA/uverbs: Fix slab-out-of-bounds in ib_uverbs_ex_create_flow")
Reported-by: Ran Rozenstein <ranro@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Improve uverbs_cleanup_ucontext algorithm to work properly when the
topology graph of the objects cannot be determined at compile time. This
is the case with objects created via the devx interface in mlx5.
Typically uverbs objects must be created in a strict topologically sorted
order, so that LIFO ordering will generally cause them to be freed
properly. There are only a few cases (eg memory windows) where objects can
point to things out of the strict LIFO order.
Instead of using an explicit ordering scheme where the HW destroy is not
allowed to fail, go over the list multiple times and allow the destroy
function to fail. If progress halts then a final, desperate, cleanup is
done before leaking the memory. This indicates a driver bug.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This lock used to be protecting a call to dst_get_neighbour_noref,
however the below commit changed it to dst_neigh_lookup which no longer
requires rcu.
Access to nud_state, neigh_event_send or rdma_copy_addr does not require
RCU, so delete the lock.
Fixes: 02b619555a ("infiniband: Convert dst_fetch_ha() over to dst_neigh_lookup().")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
port_num is untrusted data from the user, so it should be checked after
calling fill_sgid_attr, which validates it.
Fixes: 8d9ec9addd ("IB/core: Add a sgid_attr pointer to struct rdma_ah_attr")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
kern_spec->reserved is checked prior to calling
kern_spec_to_ib_spec_filter() which makes this second check redundant.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
dma_map_sg_attrs() returns 0 on error and can't return a negative number
(ensured by BUG_ON), so don't check.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Number of specs is provided by user and in valid case can be equal to zero.
Such argument causes to call to kcalloc() with zero-length request and in
return the ZERO_SIZE_PTR is assigned. This pointer is different from NULL
and makes various if (..) checks to success.
Fixes: b6ba4a9aa5 ("IB/uverbs: Add support for flow counters")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The check of cmd.flow_attr.size should check into account the size of the
reserved field (2 bytes), otherwise user can provide a size which will
cause a slab-out-of-bounds warning below.
==================================================================
BUG: KASAN: slab-out-of-bounds in ib_uverbs_ex_create_flow+0x1740/0x1d00
Read of size 2 at addr ffff880068dff1a6 by task syz-executor775/269
CPU: 0 PID: 269 Comm: syz-executor775 Not tainted 4.18.0-rc1+ #245
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0xef/0x17e
print_address_description+0x83/0x3b0
kasan_report+0x18d/0x4d0
ib_uverbs_ex_create_flow+0x1740/0x1d00
ib_uverbs_write+0x923/0x1010
__vfs_write+0x10d/0x720
vfs_write+0x1b0/0x550
ksys_write+0xc6/0x1a0
do_syscall_64+0xa7/0x590
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x433899
Code: fd ff 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89
f7 48 89 d6 48 89 ca 4d 89 c2 4d
89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b 91 fd ff c3 66
2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffc2724db58 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000020006880 RCX: 0000000000433899
RDX: 00000000000000e0 RSI: 0000000020002480 RDI: 0000000000000003
RBP: 00000000006d7018 R08: 00000000004002f8 R09: 00000000004002f8
R10: 00000000004002f8 R11: 0000000000000217 R12: 0000000000000000
R13: 000000000040cd20 R14: 000000000040cdb0 R15: 0000000000000006
Allocated by task 269:
kasan_kmalloc+0xa0/0xd0
__kmalloc+0x1a9/0x510
ib_uverbs_ex_create_flow+0x26c/0x1d00
ib_uverbs_write+0x923/0x1010
__vfs_write+0x10d/0x720
vfs_write+0x1b0/0x550
ksys_write+0xc6/0x1a0
do_syscall_64+0xa7/0x590
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 0:
__kasan_slab_free+0x12e/0x180
kfree+0x159/0x630
detach_buf+0x559/0x7a0
virtqueue_get_buf_ctx+0x3cc/0xab0
virtblk_done+0x1eb/0x3d0
vring_interrupt+0x16d/0x2b0
__handle_irq_event_percpu+0x10a/0x980
handle_irq_event_percpu+0x77/0x190
handle_irq_event+0xc6/0x1a0
handle_edge_irq+0x211/0xd80
handle_irq+0x3d/0x60
do_IRQ+0x9b/0x220
The buggy address belongs to the object at ffff880068dff180
which belongs to the cache kmalloc-64 of size 64
The buggy address is located 38 bytes inside of
64-byte region [ffff880068dff180, ffff880068dff1c0)
The buggy address belongs to the page:
page:ffffea0001a37fc0 count:1 mapcount:0 mapping:ffff88006c401780
index:0x0
flags: 0x4000000000000100(slab)
raw: 4000000000000100 ffffea0001a31100 0000001100000011 ffff88006c401780
raw: 0000000000000000 00000000802a002a 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff880068dff080: fb fb fb fb fc fc fc fc fb fb fb fb fb fb fb fb
ffff880068dff100: fc fc fc fc fb fb fb fb fb fb fb fb fc fc fc fc
>ffff880068dff180: 00 00 00 00 07 fc fc fc fc fc fc fc fb fb fb fb
^
ffff880068dff200: fb fb fb fb fc fc fc fc 00 00 00 00 00 00 fc fc
ffff880068dff280: fc fc fc fc 00 00 00 00 00 00 00 00 fc fc fc fc
==================================================================
Cc: <stable@vger.kernel.org> # 3.12
Fixes: f884827438 ("IB/core: clarify overflow/underflow checks on ib_create/destroy_flow")
Cc: syzkaller <syzkaller@googlegroups.com>
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Flows can be created on UD and RAW_PACKET QP types. Attempts to provide
other QP types as an input causes to various unpredictable failures.
The reason is that in order to support all various types (e.g. XRC), we
are supposed to use real_qp handle and not qp handle and expect to
driver/FW to fail such (XRC) flows. The simpler and safer variant is to
ban all QP types except UD and RAW_PACKET, instead of relying on
driver/FW.
Cc: <stable@vger.kernel.org> # 3.11
Fixes: 436f2ad05a ("IB/core: Export ib_create/destroy_flow through uverbs")
Cc: syzkaller <syzkaller@googlegroups.com>
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Following the removal of ib_create_flow(), adjust the code to get rid of
ib_destroy_flow() too.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There are no kernel users of this interface so lets drop it.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In the accepted series "Refactor ib_uverbs_write path", we presented the
roadmap to get rid of uverbs_cmd_mask and uverbs_ex_cmd_mask fields in
favor of simple check of function pointer. So let's put NULL check of
create_flow function callback despite the fact that uverbs_ex_cmd_mask
still exists.
Link: https://www.spinics.net/lists/linux-rdma/msg60753.html
Suggested-by: Michael J Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now that all users have been converted to use the version of these APIs
that returns a gid_attr pointer we can delete the old entry points.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Prior patches now ensure that the AV has a sgid_attr, if one would have
been required. Instead of querying for one, take it directly from the AH.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
While processing a path record entry in CM messages the associated GID
attribute is now also supplied.
Currently for RoCE a netdevice's net namespace pointer and ifindex are
stored in path record entry. Both of these fields of the netdev can change
anytime while processing CM messages. Additionally storing net namespace
without holding reference will lead to use-after-free crash. Therefore it
is removed. Netdevice information for RoCE is instead provided via
referenced gid attribute in ib_cm requests.
Such a design leads to a situation where the kernel can crash when the net
pointer becomes invalid. However today it is always initialized to
init_net, which cannot become invalid. In order to support processing
packets in any arbitrary namespace of the received packet, it is necessary
to avoid such conditions.
This patch removes the dependency on the net pointer and ifindex; instead
it will rely on SGID attribute which contains a pointer to netdev.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Make the sgid_attr available along with path information to the event
consumer, this allows the consumer to keep using the same GID table entry
as the event is related to.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Hold reference to the the sgid_attr which is used in a cm_id until the
cm_id is destroyed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Use the sgid and other information from the path record to figure out the
sgid_attrs.
Store the selected table entry in the sgid_attr for everything else to
use.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
This is really just a CM support function, normally a multicast address
does not have a specific SGID - but the RDMA CM usage model does restrict
things to the netdevice the CM id is bound to, at least for roce case.
Store the selected table entry in the sgid_attr for everything else to
use.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
The work completion is inspected to determine what dgid table entry was
used to receieve the packet, produces a sgid_attr that matches and sticks
it in the ah_attr.
All callers of this function are now required to release the ah_attr on
success.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
If we already hold the table->lock when doing the kref_put it means we are
in a context where it is safe to do the deletion synchronously, with no
need for the work queue.
This helps to eliminate issues when GID change is requested as part of MAC
address change or bonding event change where expectation is to replace the
GID almost immediately.
Fixes: b150c3862d ("IB/core: Introduce GID entry reference counts")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When sending multicast leave request, consider the net ns in which this
cm_id is created.
Code was duplicated in cma_leave_mc_groups() and rdma_leave_multicast(),
which is now done using a helper function cma_leave_roce_mc_group().
Fixes: bee3c3c918 ("IB/cma: Join and leave multicast groups with IGMP")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In this context the uobject is not allowed to be NULL, so type is the same
as uobject->type, and at least for IDR, id is the same as uobject->id.
FD objects should never handle the FD number outside the uAPI boundary
code.
Suggested-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
While __atomic_add_unless() was originally intended as a building-block
for atomic_add_unless(), it's now used in a number of places around the
kernel. It's the only common atomic operation named __atomic*(), rather
than atomic_*(), and for consistency it would be better named
atomic_fetch_add_unless().
This lack of consistency is slightly confusing, and gets in the way of
scripting atomics. Given that, let's clean things up and promote it to
an official part of the atomics API, in the form of
atomic_fetch_add_unless().
This patch converts definitions and invocations over to the new name,
including the instrumented version, using the following script:
----
git grep -w __atomic_add_unless | while read line; do
sed -i '{s/\<__atomic_add_unless\>/atomic_fetch_add_unless/}' "${line%%:*}";
done
git grep -w __arch_atomic_add_unless | while read line; do
sed -i '{s/\<__arch_atomic_add_unless\>/arch_atomic_fetch_add_unless/}' "${line%%:*}";
done
----
Note that we do not have atomic{64,_long}_fetch_add_unless(), which will
be introduced by later patches.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Palmer Dabbelt <palmer@sifive.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20180621121321.4761-2-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Regression and crashing bug fixes:
- mlx4/5: Fixes for issues found from various checkers
- A resource tracking and uverbs regression in the core code
- qedr: NULL pointer regression found during testing
- rxe: Various small bugs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJbKr/pAAoJEDht9xV+IJsasIoP/2yyHUHjBp3vVNJ3A2qRnzAJ
Yt4DHVo+lWfAhtEY+1rqRQx432aa+gv7e9TUA/Y9Llj0+C2nrOIsNniJvyjF7UrF
djtAua66p5L+TxmeQPbQP+RsE8pUoczxtPWvpTP6dJ5pkp+/0IJl4P7aZNG+WlYT
t/4pW1zBejhA9nXfHCFej4A3HM3/6oW3narmIldrNhW1EH7+5jeidyyLKueY6c1Q
MJ8zfLQM/ZdP1hFwrzfZPMsFmGI4WD7P0F4jWVa+JvpeedV/jOTVVBLKrjHfF1JS
7JMEeVlK/Mqsu4hCu/BJqHsh8kpFs4aTGfHUOyusZ1xsOx92X1QWCTtGEwi/ZKZh
PvZMkbWU6Syd1IFwtMRHrKMxGQYrErwXf9V3xHxVn4bIFEAWTT8qn/T1w+tiUcJY
gBtfqpLuIdzjZ4JtNGBRtfxOvhzqBkHdZO7sd1ARmuIf6Euzvas9AEz9qH893Oun
rfeLOL70hoz2TrJIpnDApndo9LFEGUB+ypUpax9e99nVHVdbPh/PSdRze/2khoj3
oJ8z8oh6KAimiW1sMkJ89fefDfUnkkOFOYrxH3nTYfkdrOHyiEtpLuE424pZwVKM
uWqQ+yoXRuab4X58Gw2ezYq2/UIILn4hJEJ/VdTgJomb41nd0iZtKNlgw2uk8G8M
WhOCed7yvYsp6hDi8pSq
=Gjuy
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Here are eight fairly small fixes collected over the last two weeks.
Regression and crashing bug fixes:
- mlx4/5: Fixes for issues found from various checkers
- A resource tracking and uverbs regression in the core code
- qedr: NULL pointer regression found during testing
- rxe: Various small bugs"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/rxe: Fix missing completion for mem_reg work requests
RDMA/core: Save kernel caller name when creating CQ using ib_create_cq()
IB/uverbs: Fix ordering of ucontext check in ib_uverbs_write
IB/mlx4: Fix an error handling path in 'mlx4_ib_rereg_user_mr()'
RDMA/qedr: Fix NULL pointer dereference when running over iWARP without RDMA-CM
IB/mlx5: Fix return value check in flow_counters_set_data()
IB/mlx5: Fix memory leak in mlx5_ib_create_flow
IB/rxe: avoid double kfree skb
Drivers that use the IOCTL API may have the ib_uverbs_file and need a
way to get the related ib_ucontext from it, this is enabled by this
patch.
Downstream patches from this series will use it.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The ioctl parser framework wrongly assumed that each namespace is
populated. This could lead to NULL dereferences. Fix the parser to
always check that a given namespace indeed exists.
Fixes: fac9658cab ("IB/core: Add new ioctl interface")
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Sometimes the uverbs uAPI doesn't really care about the structure it gets
from user-space. All it wants to do is to allocate enough space and send
it to the hardware/provider driver. Adding a UVERBS_ATTR_MIN_SIZE that
could be used for this scenarios. We use USHRT_MAX as the kernel known
size to bypass any zero validations.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Adding UVERBS_ATTR_SPEC_F_ALLOC_AND_COPY flag to PTR_IN attributes.
By using this flag, the parse automatically allocates and copies the
user-space data. This data is accessible by using uverbs_attr_get_len
and uverbs_attr_get_alloced_ptr inline accessor functions from the
handler.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
uverbs_finalize_objects is currently used only to commit or abort
objects. Since we want to add automatic allocation/free of PTR_IN
attributes, moving it to uverbs_ioctl.c and renamit it to
uverbs_finalize_attrs.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
As provider drivers could use UVERBS_ATTR_FD and UVERBS_ATTR_IDR macros
need to export them.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch replaces the ib_device_attr.max_sge with max_send_sge and
max_recv_sge. It allows ulps to take advantage of devices that have very
different send and recv sge depths. For example cxgb4 has a max_recv_sge
of 4, yet a max_send_sge of 16. Splitting out these attributes allows
much more efficient use of the SQ for cxgb4 with ulps that use the RDMA_RW
API. Consider a large RDMA WRITE that has 16 scattergather entries.
With max_sge of 4, the ulp would send 4 WRITE WRs, but with max_sge of
16, it can be done with 1 WRITE WR.
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Allocate agent IDs from a global IDR instead of an atomic variable.
This eliminates the possibility of reusing an ID which is already in
use after 4 billion registrations. We limit the assigned ID to be less
than 2^24 as the mlx4 driver uses the most significant byte of the agent
ID to store the slave number. Users unlucky enough to see a collision
between agent numbers and slave numbers see messages like:
mlx4_ib: egress mad has non-null tid msb:1 class:4 slave:0
and the MAD layer stops working.
We look up the agent under protection of the RCU lock, which means we
have to free the agent using kfree_rcu, and only increment the reference
counter if it is not 0.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reported-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Acked-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Tested-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If the AH has a GRH then hold a reference to the sgid_attr inside the
common struct.
If the QP is modified with an AV that includes a GRH then also hold a
reference to the sgid_attr inside the common struct.
This informs the cache that the sgid_index is in-use so long as the AH or
QP using it exists.
This also means that all drivers can access the sgid_attr directly from
the ah_attr instead of querying the cache during their UD post-send paths.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Introduce AH attribute copy, move and replace APIs to be used by core and
provider drivers.
In CM code flow when ah attribute might be re-initialized twice while
processing incoming request, or initialized once while from path record
while sending out CM requests. Therefore use rdma_move_ah_attr API to
handle such scenarios instead of memcpy().
Provider drivers keeps a copy ah_attr during the lifetime of the ah.
Therefore, use rdma_replace_ah_attr() which conditionally release
reference to old ah_attr and holds reference to new attribute whose
referrence is released when the AH is freed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
No reason to call rdma_ah_retrieve_grh, tidy whitespace, and add a
function comment block.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
The sgid_attr will ultimately replace the sgid_index in the ah_attr.
This will allow for all layers to have a consistent view of what
gid table entry was selected as processing runs through all stages of the
stack.
This commit introduces the pointer and ensures it is set before calling
any driver callback that includes a struct ah_attr callback, allowing
future patches to adjust both the drivers and the callers to use
sgid_attr instead of sgid_index.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Since we are adding some new fields to this structure it is safest if all
users reliably initialize the struct to zero.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Document this (it's implicitly true due to sleeping operations already
in use in both registration and deregistration). Use this fact to use
spin_lock_irq instead of spin_lock_irqsave. This improves performance
slightly.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Simplify the flow_resources_alloc() function call by reducing
number of goto statements.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Port capability flag represents IBTA PortInfo:CapabilityMask,
but was mistakenly mixed with non-relevant fields. Return that
information for IB only.
Link: https://patchwork.kernel.org/patch/10386245/
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If the gid_attr argument is NULL then the functions behave identically to
rdma_query_gid. ib_query_gid just calls ib_get_cached_gid, so everything
can be consolidated to one function.
Now that all callers either use rdma_query_gid() or ib_get_cached_gid(),
ib_query_gid() API is removed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no reason to restrict this function to roce only these days,
allow the filter function to be called on any protocol.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
These versions are functionally similar but all return gid_attrs and
related information via reference instead of via copy.
The old API is preserved, implemented as wrappers around the new, until
all callers can be converted.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
These call sites have a use of ib_query_gid with a simple lifetime for the
struct gid_attr pointer, with an easy conversion.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch introduces three APIs, rdma_get_gid_attr(),
rdma_put_gid_attr(), and rdma_hold_gid_attr() which expose the reference
counting for GID table entries to the entire stack. The kref counting is
based on the struct ib_gid_attr pointer
Later patches will convert more cache query function to return struct
ib_gid_attrs.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now that ib_gid_attr contains the GID, make use of that in the add_gid()
callback functions for the provider drivers to simplify the add_gid()
implementations.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In order to be able to expose pointers to the ib_gid_attrs in the GID
table we need to make it so the value of the pointer cannot be
changed. Thus each GID table entry gets a unique piece of kref'd memory
that is written only during initialization and remains constant for its
lifetime.
This eventually will allow the struct ib_gid_attrs to be returned without
copy from many of query the APIs, but it also provides a way to track when
all users of a HW table index go away.
For roce we no longer allow an in-use HW table index to be re-used for a
new an different entry. When a GID table entry needs to be removed it is
hidden from the find API, but remains as a valid HW index and all
ib_gid_attr points remain valid. The HW index is not relased until all
users put the kref.
Later patches will broadly replace the use of the sgid_index integer with
the kref'd structure.
Ultimately this will prevent security problems where the OS changes the
properties of a HW GID table entry while an active user object is still
using the entry.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There are at max one or two default GIDs for RoCE. Instead of storing
a default GID property for all the GIDs, store default GID indices as
individual bit per table.
This allows a future simplification to get rid of the GID property field.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When default GIDs are added, their gid type is set by
ib_cache_gid_set_default_gid(). There is no need to set the gid type of a
free GID entry during GID table initialization.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The flows were hidden from the C compiler; expose them as a zero-length
array to allow struct_size to work.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
During disassociation the ucontext will become NULL, however due to how
the SRCU locking works the ucontext must only be examined after looking
at the ib_dev, which governs the RCU control flow.
With the wrong ordering userspace will see EINVAL instead of EIO for a
disassociated uverbs FD, which breaks rdma-core.
Cc: stable@vger.kernel.org
Fixes: 491d5c6a30 ("RDMA/uverbs: Move uncontext check before SRCU read lock")
Reported-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
This has been a quiet cycle for RDMA, the big bulk is the usual smallish
driver updates and bug fixes. About four new uAPI related things. Not as much
Szykaller patches this time, the bugs it finds are getting harder to fix.
- More work cleaning up the RDMA CM code
- Usual driver bug fixes and cleanups for qedr, qib, hfi1, hns, i40iw, iw_cxgb4, mlx5, rxe
- Driver specific resource tracking and reporting via netlink
- Continued work for name space support from Parav
- MPLS support for the verbs flow steering uAPI
- A few tricky IPoIB fixes improving robustness
- HFI1 driver support for the '16B' management packet format
- Some auditing to not print kernel pointers via %llx or similar
- Mark the entire 'UCM' user-space interface as BROKEN with the intent to remove it
entirely. The user space side of this was long ago replaced with RDMA-CM and
syzkaller is finding bugs in the residual UCM interface nobody wishes to fix because
nobody uses it.
- Purge more bogus BUG_ON's from Leon
- 'flow counters' verbs uAPI
- T10 fixups for iser/isert, these are Acked by Martin but going through the RDMA
tree due to dependencies
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJbGEcPAAoJEDht9xV+IJsarBMQAIsAFOizycF0kQfDtvz1yHyV
YjkT3NA71379DsDsCOezVKqZ6RtXdQncJoqqEG1FuNKiXh/rShR3rk9XmdBwUCTq
mIY0ySiQggdeSIJclROiBuzLE3F/KIIkY3jwM80DzT9GUEbnVuvAMt4M56X48Xo8
RpFc13/1tY09ZLBVjInlfmCpRWyNgNccDBDywB/5hF5KCFR/BG/vkp4W0yzksKiU
7M/rZYyxQbtwSfe/ZXp7NrtwOpkpn7vmhED59YgKRZWhqnHF9KKmV+K1FN+BKdXJ
V1KKJ2RQINg9bbLJ7H2JPdQ9EipvgAjUJKKBoD+XWnoVJahp6X2PjX351R/h4Lo5
TH+0XwuCZ2EdjRxhnm3YE+rU10mDY9/UUi1xkJf9vf0r25h6Fgt6sMnN0QBpqkTh
euRZnPyiFeo1b+hCXJfKqkQ6An+F3zes5zvVf59l0yfVNLVmHdlz0lzKLf/RPk+t
U+YZKxfmHA+mwNhMXtKx7rKVDrko+uRHjaX2rPTEvZ0PXE7lMzFMdBWYgzP6sx/b
4c55NiJMDAGTyLCxSc7ziGgdL9Lpo/pRZJtFOHqzkDg8jd7fb07ID7bMPbSa05y0
BU5VpC8yEOYRpOEFbkJSPtHc0Q8cMCv/q1VcMuuhKXYnfSho3TWvtOSQIjUoU/q0
8T6TXYi2yF+f+vZBTFlV
=Mb8m
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This has been a quiet cycle for RDMA, the big bulk is the usual
smallish driver updates and bug fixes. About four new uAPI related
things. Not as much Szykaller patches this time, the bugs it finds are
getting harder to fix.
Summary:
- More work cleaning up the RDMA CM code
- Usual driver bug fixes and cleanups for qedr, qib, hfi1, hns,
i40iw, iw_cxgb4, mlx5, rxe
- Driver specific resource tracking and reporting via netlink
- Continued work for name space support from Parav
- MPLS support for the verbs flow steering uAPI
- A few tricky IPoIB fixes improving robustness
- HFI1 driver support for the '16B' management packet format
- Some auditing to not print kernel pointers via %llx or similar
- Mark the entire 'UCM' user-space interface as BROKEN with the
intent to remove it entirely. The user space side of this was long
ago replaced with RDMA-CM and syzkaller is finding bugs in the
residual UCM interface nobody wishes to fix because nobody uses it.
- Purge more bogus BUG_ON's from Leon
- 'flow counters' verbs uAPI
- T10 fixups for iser/isert, these are Acked by Martin but going
through the RDMA tree due to dependencies"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (138 commits)
RDMA/mlx5: Update SPDX tags to show proper license
RDMA/restrack: Change SPDX tag to properly reflect license
IB/hfi1: Fix comment on default hdr entry size
IB/hfi1: Rename exp_lock to exp_mutex
IB/hfi1: Add bypass register defines and replace blind constants
IB/hfi1: Remove unused variable
IB/hfi1: Ensure VL index is within bounds
IB/hfi1: Fix user context tail allocation for DMA_RTAIL
IB/hns: Use zeroing memory allocator instead of allocator/memset
infiniband: fix a possible use-after-free bug
iw_cxgb4: add INFINIBAND_ADDR_TRANS dependency
IB/isert: use T10-PI check mask definitions from core layer
IB/iser: use T10-PI check mask definitions from core layer
RDMA/core: introduce check masks for T10-PI offload
IB/isert: fix T10-pi check mask setting
IB/mlx5: Add counters read support
IB/mlx5: Add flow counters read support
IB/mlx5: Add flow counters binding support
IB/mlx5: Add counters create and destroy support
IB/uverbs: Add support for flow counters
...
- Use overflow helpers in 2-factor allocators (Kees, Rasmus)
- Introduce overflow test module (Rasmus, Kees)
- Introduce saturating size helper functions (Matthew, Kees)
- Treewide use of struct_size() for allocators (Kees)
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsYJ1gWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlCTEACwdEeriAd2VwxknnsstojGD/3g
8TTFA19vSu4Gxa6WiDkjGoSmIlfhXTlZo1Nlmencv16ytSvIVDNLUIB3uDxUIv1J
2+dyHML9JpXYHHR7zLXXnGFJL0wazqjbsD3NYQgXqmun7EVVYnOsAlBZ7h/Lwiej
jzEJd8DaHT3TA586uD3uggiFvQU0yVyvkDCDONIytmQx+BdtGdg9TYCzkBJaXuDZ
YIthyKDvxIw5nh/UaG3L+SKo73tUr371uAWgAfqoaGQQCWe+mxnWL4HkCKsjFzZL
u9ouxxF/n6pij3E8n6rb0i2fCzlsTDdDF+aqV1rQ4I4hVXCFPpHUZgjDPvBWbj7A
m6AfRHVNnOgI8HGKqBGOfViV+2kCHlYeQh3pPW33dWzy/4d/uq9NIHKxE63LH+S4
bY3oO2ela8oxRyvEgXLjqmRYGW1LB/ZU7FS6Rkx2gRzo4k8Rv+8K/KzUHfFVRX61
jEbiPLzko0xL9D53kcEn0c+BhofK5jgeSWxItdmfuKjLTW4jWhLRlU+bcUXb6kSS
S3G6aF+L+foSUwoq63AS8QxCuabuhreJSB+BmcGUyjthCbK/0WjXYC6W/IJiRfBa
3ZTxBC/2vP3uq/AGRNh5YZoxHL8mSxDfn62F+2cqlJTTKR/O+KyDb1cusyvk3H04
KCDVLYPxwQQqK1Mqig==
=/3L8
-----END PGP SIGNATURE-----
Merge tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull overflow updates from Kees Cook:
"This adds the new overflow checking helpers and adds them to the
2-factor argument allocators. And this adds the saturating size
helpers and does a treewide replacement for the struct_size() usage.
Additionally this adds the overflow testing modules to make sure
everything works.
I'm still working on the treewide replacements for allocators with
"simple" multiplied arguments:
*alloc(a * b, ...) -> *alloc_array(a, b, ...)
and
*zalloc(a * b, ...) -> *calloc(a, b, ...)
as well as the more complex cases, but that's separable from this
portion of the series. I expect to have the rest sent before -rc1
closes; there are a lot of messy cases to clean up.
Summary:
- Introduce arithmetic overflow test helper functions (Rasmus)
- Use overflow helpers in 2-factor allocators (Kees, Rasmus)
- Introduce overflow test module (Rasmus, Kees)
- Introduce saturating size helper functions (Matthew, Kees)
- Treewide use of struct_size() for allocators (Kees)"
* tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
treewide: Use struct_size() for devm_kmalloc() and friends
treewide: Use struct_size() for vmalloc()-family
treewide: Use struct_size() for kmalloc()-family
device: Use overflow helpers for devm_kmalloc()
mm: Use overflow helpers in kvmalloc()
mm: Use overflow helpers in kmalloc_array*()
test_overflow: Add memory allocation overflow tests
overflow.h: Add allocation size calculation helpers
test_overflow: Report test failures
test_overflow: macrofy some more, do more tests for free
lib: add runtime test of check_*_overflow functions
compiler.h: enable builtin overflow checkers and add fallback code
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
void *entry[];
};
instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
uses. It was done via automatic conversion with manual review for the
"CHECKME" non-standard cases noted below, using the following Coccinelle
script:
// pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
// sizeof *pkey_cache->table, GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@
- alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
// mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@
- alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
// Same pattern, but can't trivially locate the trailing element name,
// or variable name.
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
expression SOMETHING, COUNT, ELEMENT;
@@
- alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
+ alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)
Signed-off-by: Kees Cook <keescook@chromium.org>
Resource tracking is supposed to be dual licensed: GPL-2.0 and
OpenIB, but the SPDX tag was not compliant to it. Update the tag to
properly reflect license.
Fixes: 02d8883f52 ("RDMA/restrack: Add general infrastructure to track RDMA resources")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ucma_process_join() will free the new allocated "mc" struct,
if there is any error after that, especially the copy_to_user().
But in parallel, ucma_leave_multicast() could find this "mc"
through idr_find() before ucma_process_join() frees it, since it
is already published.
So "mc" could be used in ucma_leave_multicast() after it is been
allocated and freed in ucma_process_join(), since we don't refcnt
it.
Fix this by separating "publish" from ID allocation, so that we
can get an ID first and publish it later after copy_to_user().
Fixes: c8f6a362bf ("RDMA/cma: Add multicast communication support")
Reported-by: Noam Rathaus <noamr@beyondsecurity.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This series comes to allow user space applications to monitor real time
traffic activity and events of the verbs objects it manages, e.g.:
ibv_qp, ibv_wq, ibv_flow.
This API enables generic counters creation and define mapping
to association with a verbs object, current mlx5 driver using
this API for flow counters.
With this API, an application can monitor the entire life cycle of
object activity, defined here as a static counters attachment.
This API also allows dynamic counters monitoring of measurement points
for a partial period in the verbs object life cycle.
In addition it presents the implementation of the generic counters interface.
This will be achieved by extending flow creation by adding a new flow count
specification type which allows the user to associate a previously created
flow counters using the generic verbs counters interface to the created flow,
once associated the user could read statistics by using the read function of
the generic counters interface.
The API includes:
1. create and destroyed API of a new counters objects
2. read the counters values from HW
Note:
Attaching API to allow application to define the measurement points per objects
is a user space only API and this data is passed to kernel when the counted
object (e.g. flow) is created with the counters object.
-----BEGIN PGP SIGNATURE-----
iHQEABYIAB0WIQT1m3YD37UfMCUQBNwp8NhrnBAZsQUCWxIiqQAKCRAp8NhrnBAZ
sWJRAPYl06nEfQjRlW//ZE/pO2oKXbfEevg7nnbpe80ERlxLAQDA2LHAcU7ma/NC
hS5yxIq1gLSA27N+5qAoFVK8vJ5ZCg==
=EiAV
-----END PGP SIGNATURE-----
Merge tag 'verbs_flow_counters' of git://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git into for-next
Pull verbs counters series from Leon Romanovsky:
====================
Verbs flow counters support
This series comes to allow user space applications to monitor real time
traffic activity and events of the verbs objects it manages, e.g.: ibv_qp,
ibv_wq, ibv_flow.
The API enables generic counters creation and define mapping to
association with a verbs object, the current mlx5 driver is using this API
for flow counters.
With this API, an application can monitor the entire life cycle of object
activity, defined here as a static counters attachment. This API also
allows dynamic counters monitoring of measurement points for a partial
period in the verbs object life cycle.
In addition it presents the implementation of the generic counters
interface.
This will be achieved by extending flow creation by adding a new flow
count specification type which allows the user to associate a previously
created flow counters using the generic verbs counters interface to the
created flow, once associated the user could read statistics by using the
read function of the generic counters interface.
The API includes:
1. create and destroyed API of a new counters objects
2. read the counters values from HW
Note:
Attaching API to allow application to define the measurement points per
objects is a user space only API and this data is passed to kernel when
the counted object (e.g. flow) is created with the counters object.
===================
* tag 'verbs_flow_counters':
IB/mlx5: Add counters read support
IB/mlx5: Add flow counters read support
IB/mlx5: Add flow counters binding support
IB/mlx5: Add counters create and destroy support
IB/uverbs: Add support for flow counters
IB/core: Add support for flow counters
IB/core: Support passing uhw for create_flow
IB/uverbs: Add read counters support
IB/core: Introduce counters read verb
IB/uverbs: Add create/destroy counters support
IB/core: Introduce counters object and its create/destroy
IB/uverbs: Add an ib_uobject getter to ioctl() infrastructure
net/mlx5: Export flow counter related API
net/mlx5: Use flow counter pointer as input to the query function
The struct ib_uverbs_flow_spec_action_count associates a counters object
with the flow.
Post this association the flow counters can be read via the counters
object.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Raed Salem <raeds@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This is required when user-space drivers need to pass extra information
regarding how to handle this flow steering specification.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch exposes the read counters verb to user space applications. By
that verb the user can read the hardware counters which are associated
with the counters object.
The application needs to provide a sufficient memory to hold the
statistics.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Raed Salem <raeds@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
User space application which uses counters functionality, is expected to
allocate/release the counters resources by calling create/destroy verbs
and in turn get a unique handle that can be used to attach the counters to
its counted type.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Raed Salem <raeds@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Previously, the user had to dig inside the attribute to get the uobject.
Add a helper function that correctly extract it (and do the required
checks) for him/her.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no need to check existence of mad_queue, because we already did
pointer dereference before call to dequeue_mad().
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is no need to bring down the whole machine, just because unknown
event was received. It is better to ignore it silently.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In case CM work list is empty, the work pointer will be NULL,
so instead of kernel crash it is better to abort processing
of works.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch hoisted the common process of disassociate_ucontext
callback function into ib core code, and these code are common
to ervery ib_device driver.
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This once might have made sense when cache.c was in a different module
from device.c, but today it just obfuscation. Get rid of the wrappers
and call roge_gid_mgmt_init()/cleanup() directly.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
When a GID entry is invalid EAGAIN is returned. This is an incorrect error
code, there is nothing that will make this GID entry valid again in
bounded time.
Some user space tools fail incorrectly if EAGAIN is returned here, and
this represents a small ABI change from earlier kernels.
The first patch in the Fixes list makes entries that were valid before
to become invalid, allowing this code to trigger, while the second patch
in the Fixes list introduced the wrong EAGAIN.
Therefore revert the return result to EINVAL which matches the historical
expectations of the ibv_query_gid_type() API of the libibverbs user space
library.
Cc: <stable@vger.kernel.org>
Fixes: 598ff6bae6 ("IB/core: Refactor GID modify code for RoCE")
Fixes: 03db3a2d81 ("IB/core: Add RoCE GID table management")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Update mlx4 to support user MR creation against read-only memory, previously
it required the memory to be writable.
Based on rdma for-rc due to dependencies.
* mr_fix: (2 commits)
IB/mlx4: Mark user MR as writable if actual virtual memory is writable
IB/core: Make testing MR flags for writability a static inline function
Make the MR writability flags check, which is performed in umem.c,
a static inline function in file ib_verbs.h
This allows the function to be used by low-level infiniband drivers.
Cc: <stable@vger.kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
There are several places a gid table is accessed.
Have a helper tiny function rdma_gid_table() to avoid code
duplication at such places.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Instead of open coding memcmp() to check whether a given GID is zero or
not, use a helper function to do so, and replace instances of
memcpy(z,&zgid) with memset.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In commit 357d23c811a7 ("Remove the obsolete libibcm library")
in rdma-core [1], we removed obsolete library which used the
/dev/infiniband/ucmX interface.
Following multiple syzkaller reports about non-sanitized
user input in the UCMA module, the short audit reveals the same
issues in UCM module too.
It is better to disable this interface in the kernel,
before syzkaller team invests time and energy to harden
this unused interface.
[1] https://github.com/linux-rdma/rdma-core/pull/279
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove duplicate declaration of gid_cache_wq.
Fixes: d41861942 ("IB/core: Add generic function to extract IB speed from netdev")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add a helper function for iwarp drivers to be able to map an
rdma_cm_id to an iw_cm_id. This is useful for dumping driver specific
NLDEV/RESTRACK connection state.
Add a helper to return the rdma_cm_id pointer from the rdma_restack
pointer. This is needed for rdma drivers to map a res entry back to
the public rdma_cm_id struct.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add a new MPLS steering match filter that can match against
a single MPLS tag field.
Since the MPLS header can reside in different locations in the packet's
protocol stack as well as be encapsulated with a tunnel protocol, it
is required to know the exact location of the header in the protocol
stack.
Therefore, when including the MPLS protocol spec in the specs list,
it is mandatory to provide the list in an ordered manner, so
that it represents the actual header order in a matching packet.
Drivers that process the spec list and apply the matching rule
should treat the position of the MPLS spec in the spec list as the
actual location of the MPLS label in the packet's protocol stack.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Adding a new GRE steering match filter that can match against
key and protocol fields.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
During CM request processing flow, ah_attr is initialized twice.
First based on wc. Secondly based on primary path record.
ah_attr initialization from path record can fail, which leads to ah_attr
zeroed out.
Therefore, always initialize ah_attr on stack during reinitialization
phase. If ah_attr init is successful, use the new ah_attry by
overwriting the old one. If the ah_attr init fails, continue to use the
last ah_attr.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
During CM LAP processing, ah_attr is reinitialized on receiving LAP
request. First likely during CM request processing.
ah_attr might get zero out if LAP processing fails.
Therefore, attempt to create new ah_attr for the LAP message.
If the initialization fails, continue with older ah_attr.
If the initialization passes, consider the new ah_attr by overwriting
the older one.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
AH attribute of the cm_id can be overwritten if LAP message is received
on CM request which is in progress. This bug got introduced to avoid
sleeping when spin lock is held as part of commit in Fixes tag.
Therefore validate the cm_id state first and continue to perform AV
ah_attr initialization.
Given that Aleternative path related messages are not supported for
RoCE, init_av_from_response/path is such messages are ok to be called
from blocking context.
Fixes: 33f93e1ebc ("IB/cm: Fix sleeping while spin lock is held")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
User-space may invoke ibv_reg_mr and ibv_dereg_mr in different threads.
If ibv_dereg_mr is called after the thread which invoked ibv_reg_mr has
exited, get_pid_task will return NULL and ib_umem_release will not
decrease mm->pinned_vm.
Instead of using threads to locate the mm, use the overall tgid from the
ib_ucontext struct instead. This matches the behavior of ODP and
disassociate in handling the mm of the process that called ibv_reg_mr.
Cc: <stable@vger.kernel.org>
Fixes: 87773dd56d ("IB: ib_umem_release() should decrement mm->pinned_vm from ib_umem_get")
Signed-off-by: Lidong Chen <lidongchen@tencent.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
"return" statement at the end of void function is redundant, removing
it.
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Qing Huang <qing.huang@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove mr iova attribute because we don't want to pass up kernel pointers.
Fixes: fccec5b89a ("RDMA/nldev: provide detailed MR information")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Several items of conflict have arisen between the RDMA stack's for-rc
branch and upcoming for-next work:
9fd4350ba8 ("IB/rxe: avoid double kfree_skb") directly conflicts with
2e47350789 ("IB/rxe: optimize the function duplicate_request")
Patches already submitted by Intel for the hfi1 driver will fail to
apply cleanly without this merge
Other people on the mailing list have notified that their upcoming
patches also fail to apply cleanly without this merge
Signed-off-by: Doug Ledford <dledford@redhat.com>
_gid_table_setup_one() only performs GID table cache memory allocation,
marks entries as invalid (free) and marks the reserved entries.
At this point GID table is empty and no entries are added.
On dual port device if _gid_table_setup_one() fails to allocate the gid
table for 2nd port, there is no need to perform cleanup_gid_table_port()
to delete GID entries, as GID table is empty.
Therefore make use of existing gid_table_release_one() routine which
frees the GID table memory and avoid code duplication.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
gid_table_reserve_default() always returns zero. Make it return void and
simplify error checking.
rdma_port is already calculated, use that while calling
gid_table_reserve_default() instead of recalculating it.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
These help rdma drivers to fill out the driver entries.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Each driver can register a "fill entry" function with the restrack core.
This function will be called when filling out a resource, allowing the
driver to add driver-specific details. The details consist of a
nltable of nested attributes, that are in the form of <key, [print-type],
value> tuples. Both key and value attributes are mandatory. The key
nlattr must be a string, and the value nlattr can be one of the driver
attributes that are generic, but typed, allowing the attributes to be
validated. Currently the driver nlattr types include string, s32,
u32, s64, and u64. The print-type nlattr allows a driver to specify
an alternative display format for user tools displaying the attribute.
For example, a u32 attribute will default to "%u", but a print-type
attribute can be included for it to be displayed in hex. This allows
the user tool to print the number in the format desired by the driver
driver.
More attrs can be defined as they become needed by drivers.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add a specific RDMA_NLDEV_ATTR_PAD attribute to be used for 64b
attribute padding. To preserve the ABI, make this attribute equal to
RDMA_NLDEV_ATTR_UNSPEC, which has a value of 0, because that has been
used up until now as the pad attribute.
Change all the previous use of 0 as the pad with this
new enum.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When commit [1] was added, SGID was queried to derive the SMAC address.
Then, later on during a refactor [2], SMAC was no longer needed. However,
the now useless GID query remained. Then during additional code changes
later on, the GID query was being done in such a way that it caused iWARP
queries to start breaking. Remove the useless GID query and resolve the
iWARP breakage at the same time.
This is discussed in [3].
[1] commit dd5f03beb4 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
[2] commit 5c266b2304 ("IB/cm: Remove the usage of smac and vid of qp_attr and cm_av")
[3] https://www.spinics.net/lists/linux-rdma/msg63951.html
Suggested-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, the kernel protects access to the agent ID allocator on a per
port basis using a spinlock, so it is impossible for two apps/threads on
the same port to get the same TID, but it is entirely possible for two
threads on different ports to end up with the same TID.
As this can be confusing (regardless of it being legal according to the
IB Spec 1.3, C13-18.1.1, in section 13.4.6.4 - TransactionID usage),
and as the rdma-core user space API for /dev/umad devices implies unique
TIDs even across ports, make the TID an atomic type so that no two
allocations, regardless of port number, will be the same.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch fixes a crash that happens due to access to an
uninitialized DM pointer within the MR object.
The change makes sure the DM pointer in the MR object is set to
NULL during a non-DM MR creation to prevent a false indication
that this MR is related to a DM in the dereg flow.
Fixes: be934cca9e ("IB/uverbs: Add device memory registration ioctl support")
Reported-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds a check in the ib_uverbs_rereg_mr flow to make
sure there's no attempt to rereg a device memory MR to regular MR.
In such case the command will fail with -EINVAL status.
fixes: be934cca9e ("IB/uverbs: Add device memory registration ioctl support")
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In the cases where iwpm_hash_bucket is NULL and where function
get_mapinfo_hash_bucket returns NULL then the map_info is never added
to hash_bucket_head and hence there is a leak of map_info. Fix this
by nullifying hash_bucket_head and if that is null we know that
that map_info was not added to hash_bucket_head and hence map_info
should be free'd.
Detected by CoverityScan, CID#1222481 ("Resource Leak")
Fixes: 30dc5e63d6 ("RDMA/core: Add support for iWARP Port Mapper user space service")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There are few issues with validation of netdevice and listen id lookup
for IB (IPoIB) while processing incoming CM request as below.
1. While performing lookup of bind_list in cma_ps_find(), net namespace
of the netdevice can get deleted in cma_exit_net(), resulting in use
after free access of idr and/or net namespace structures.
This lookup occurs from the workqueue context (and not userspace
context where net namespace is always valid).
CPU0 CPU1
==== ====
bind_list = cma_ps_find();
move netdevice to new namespace
delete net namespace
cma_exit_net()
idr_destroy(idr);
[..]
cma_find_listener(bind_list, ..);
2. While netdevice is validated for IP address in given net namespace,
netdevice's net namespace and/or ifindex can change in
cma_get_net_dev() and cma_match_net_dev().
Above issues are overcome by using rcu lock along with netdevice
UP/DOWN state as described below.
When a net namespace is getting deleted, netdevice is closed and
shutdown before moving it back to init_net namespace.
change_net_namespace() synchronizes with any existing use of netdevice
before changing the netdev properties such as net or ifindex.
Once netdevice IFF_UP flags is cleared, such fields are not guaranteed
to be valid.
Therefore, rcu lock along with netdevice state check ensures that,
while route lookup and cm_id lookup is in progress, netdevice of
interest won't migrate to any other net namespace.
This ensures that associated net namespace of netdevice won't get
deleted while rcu lock is held for netdevice which is in IFF_UP state.
Fixes: fa20105e09 ("IB/cma: Add support for network namespaces")
Fixes: 4be74b42a6 ("IB/cma: Separate port allocation to network namespaces")
Fixes: f887f2ac87 ("IB/cma: Validate routing of incoming requests")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Previously, if a method contained mandatory attributes in a namespace
that wasn't given by the user, these attributes weren't validated.
Fixing this by iterating over all specification namespaces.
Fixes: fac9658cab ("IB/core: Add new ioctl interface")
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Before [1], When MAC address of the netdevice is changed, default GID is
supposed to get deleted and added back which affects the node and/or port
GUID in below sequence.
netdevice_event()
-> NETDEV_CHANGEADDR
default_del_cmd()
del_netdev_default_ips()
bond_delete_netdev_default_gids()
ib_cache_gid_set_default_gid()
ib_cache_gid_del()
add_cmd()
[..]
However, ib_cache_gid_del() was not getting invoked in non bonding
scenarios because event_ndev and rdma_ndev are same.
Therefore, fix such condition to ignore checking upper device when event
ndev and rdma_dev are same; similar to bond_set_netdev_default_gids().
Which this fix ib_cache_gid_del() is invoked correctly; however
ib_cache_gid_del() doesn't find the default GID for deletion because
find_gid() was given default_gid = false with
GID_ATTR_FIND_MASK_DEFAULT set.
But it was getting overwritten by ib_cache_gid_set_default_gid() later
on as part of add_cmd().
Therefore, mac address change used to work for default GID.
With refactor series [1], this incorrect behavior is detected.
Therefore,
when deleting default GID, set default_gid and set MASK flag.
when deleting IP based GID, clear default_gid and set MASK flag.
[1] https://patchwork.kernel.org/patch/10319151/
Fixes: 238fdf48f2 ("IB/core: Add RoCE table bonding support")
Fixes: 598ff6bae6 ("IB/core: Refactor GID modify code for RoCE")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When IPv6 link local address is removed, if it matches with the default
GID, default GID(s)s gets removed which may not be a desired behavior.
This behavior is introduced by refactor work in Fixes tag.
When IPv6 link address is removed, removing its equivalent RoCEv2 GID
which exactly matches with default RoCEv2 GID, is right thing to do.
However achieving it correctly requires lot more changes, likely in
roce_gid_mgmt.c and core/cache.c. This should be done as independent
patch.
Therefore, this patch preserves behavior of not deleteing default GIDs.
This is done by providing explicit hint to consider default GID property
using mask and default_gid; similar to add_gid().
Fixes: 598ff6bae6 ("IB/core: Refactor GID modify code for RoCE")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Default GIDs are marked reserved at the start of the GID table at index
0 and 1 by gid_table_reserve_default(). Currently when default GID is
requested, it can still allocates an empty slot which was not marked as
RESERVED for default GID, which is incorrect.
At least in current code flow of roce_gid_mgmt.c, in theory we can
still request to allocate more than one/two default GIDs depending
on how upper devices are setup.
Therefore, it is better for cache layer to only allow our reserved slots
to be used by default GID allocation requests.
Fixes: 598ff6bae6 ("IB/core: Refactor GID modify code for RoCE")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The RDMA CM will select a source device and address by consulting
the routing table if no source address is passed into
rdma_resolve_address(). Userspace will ask for this by passing an
all-zero source address in the RESOLVE_IP command. Unfortunately
the new check for non-zero address size rejects this with EINVAL,
which breaks valid userspace applications.
Fix this by explicitly allowing a zero address family for the source.
Fixes: 2975d5de64 ("RDMA/ucma: Check AF family prior resolving address")
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This is done by auditing all callers of ucma_get_ctx and switching the
ones that unconditionally touch ->device to ucma_get_ctx_dev. This covers
a little less than half of the call sites.
The 11 remaining call sites to ucma_get_ctx() were manually audited.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
With gcc-4.1.2:
drivers/infiniband/core/uverbs_std_types_flow_action.c:366: error: unknown field ‘ptr’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:367: error: unknown field ‘type’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:367: warning: missing braces around initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:367: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>.<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: error: unknown field ‘min_len’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: error: unknown field ‘len’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:369: error: unknown field ‘flags’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:369: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:369: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:376: error: unknown field ‘ptr’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:377: error: unknown field ‘type’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:377: warning: missing braces around initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:377: warning: (near initialization for ‘uverbs_flow_action_esp_replay[0].<anonymous>.<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:379: error: unknown field ‘len’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:379: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:379: warning: (near initialization for ‘uverbs_flow_action_esp_replay[0].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:383: error: unknown field ‘ptr’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:384: error: unknown field ‘type’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: error: unknown field ‘min_len’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: (near initialization for ‘uverbs_flow_action_esp_replay[1].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: error: unknown field ‘len’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: (near initialization for ‘uverbs_flow_action_esp_replay[1].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:386: error: unknown field ‘flags’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:386: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:386: warning: (near initialization for ‘uverbs_flow_action_esp_replay[1].<anonymous>’)
Add the missing braces to fix this.
Fixes: 2eb9beaee5 ("IB/uverbs: Add flow_action create and destroy verbs")
Fixes: 7d12f8d5a1 ("IB/uverbs: Add modify ESP flow_action")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The only thing it does is block module unload while work is posted from
rdma_resolve_ip().
However, this is not the right place to do this. The users of
rdma_resolve_ip() must ensure their own module does not unload until
rdma_resolve_ip() calls the callback, or until rdma_addr_cancel() is
called.
Similarly callers to rdma_addr_find_l2_eth_by_grh() must ensure their
module does not unload while they are calling code.
The only two users are already safe, so there is no need for this.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently rdma_addr_cancel does not prevent the callback from being used,
this is surprising and hard to reason about. There does not appear to be a
bug here as the only user of this API does refcount properly, fixing it
only to increase clarity.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now that the work queue is used directly to launch and track the work
there is no need for the second processing function to do 'all list
entries'. Just schedule all entries onto the main work queue directly.
We can also drop all of the useless list sorting now, as the workqueue
sorts by expiration time automatically.
This change requires switching lock to a spinlock as netdev notifiers
are called in an atomic context, this is now easy since the lock does
not need to be held across the lookup, that is already single
threaded due to the work queue.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Validating input parameters should be done before getting the cm_id
otherwise it can leak a cm_id reference.
Fixes: 6a21dfc0d0 ("RDMA/ucma: Limit possible option size")
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
- Fix RDMA uapi headers to actually compile in userspace and be more
complete
- Three shared with netdev pull requests from Mellanox:
* 7 patches, mostly to net with 1 IB related one at the back). This
series addresses an IRQ performance issue (patch 1), cleanups related to
the fix for the IRQ performance problem (patches 2-6), and then extends
the fragmented completion queue support that already exists in the net
side of the driver to the ib side of the driver (patch 7).
* Mostly IB, with 5 patches to net that are needed to support the remaining
10 patches to the IB subsystem. This series extends the current
'representor' framework when the mlx5 driver is in switchdev mode from
being a netdev only construct to being a netdev/IB dev construct. The IB
dev is limited to raw Eth queue pairs only, but by having an IB dev of
this type attached to the representor for a switchdev port, it enables
DPDK to work on the switchdev device.
* All net related, but needed as infrastructure for the rdma driver
- Updates for the hns, i40iw, bnxt_re, cxgb3, cxgb4, hns drivers
- SRP performance updates
- IB uverbs write path cleanup patch series from Leon
- Add RDMA_CM support to ib_srpt. This is disabled by default. Users need to
set the port for ib_srpt to listen on in configfs in order for it to be
enabled (/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port)
- TSO and Scatter FCS support in mlx4
- Refactor of modify_qp routine to resolve problems seen while working on new
code that is forthcoming
- More refactoring and updates of RDMA CM for containers support from Parav
- mlx5 'fine grained packet pacing', 'ipsec offload' and 'device memory'
user API features
- Infrastructure updates for the new IOCTL interface, based on increased usage
- ABI compatibility bug fixes to fully support 32 bit userspace on 64 bit
kernel as was originally intended. See the commit messages for
extensive details
- Syzkaller bugs and code cleanups motivated by them
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJax5Z0AAoJEDht9xV+IJsacCwQAJBIgmLCvVp5fBu2kJcXMMVI
y3l2YNzAUJvDDKv1r5yTC9ugBXEkDtgzi/W/C2/5es2yUG/QeT/zzQ3YPrtsnN68
5FkiXQ35Tt7+PBHMr0cacGRmF4M3Td3MeW0X5aJaBKhqlNKwA+aF18pjGWBmpVYx
URYCwLb5BZBKVh4+1Leebsk4i0/7jSauAqE5M+9notuAUfBCoY1/Eve3DipEIBBp
EyrEnMDIdujYRsg4KHlxFKKJ1EFGItknLQbNL1+SEa0Oe0SnEl5Bd53Yxfz7ekNP
oOWQe5csTcs3Yr4Ob0TC+69CzI71zKbz6qPDILTwXmsPFZJ9ipJs4S8D6F7ra8tb
D5aT1EdRzh/vAORPC9T3DQ3VsHdvhwpUMG7knnKrVT9X/g7E+gSji1BqaQaTr/xs
i40GepHT7lM/TWEuee/6LRpqdhuOhud7vfaRFwn2JGRX9suqTcvwhkBkPUDGV5XX
5RkHcWOb/7KvmpG7S1gaRGK5kO208LgmAZi7REaJFoZB74FqSneMR6NHIH07ha41
Zou7rnxV68CT2bgu27m+72EsprgmBkVDeEzXgKxVI/+PZ1oadUFpgcZ3pRLOPWVx
rEqjHu65rlA/YPog4iXQaMfSwt/oRD3cVJS/n8EdJKXi4Qt2RDDGdyOmt74w4prM
QuLEdvJIFmwrND1KDoqn
=Ku8g
-----END PGP SIGNATURE-----
Merge tag 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Doug and I are at a conference next week so if another PR is sent I
expect it to only be bug fixes. Parav noted yesterday that there are
some fringe case behavior changes in his work that he would like to
fix, and I see that Intel has a number of rc looking patches for HFI1
they posted yesterday.
Parav is again the biggest contributor by patch count with his ongoing
work to enable container support in the RDMA stack, followed by Leon
doing syzkaller inspired cleanups, though most of the actual fixing
went to RC.
There is one uncomfortable series here fixing the user ABI to actually
work as intended in 32 bit mode. There are lots of notes in the commit
messages, but the basic summary is we don't think there is an actual
32 bit kernel user of drivers/infiniband for several good reasons.
However we are seeing people want to use a 32 bit user space with 64
bit kernel, which didn't completely work today. So in fixing it we
required a 32 bit rxe user to upgrade their userspace. rxe users are
still already quite rare and we think a 32 bit one is non-existing.
- Fix RDMA uapi headers to actually compile in userspace and be more
complete
- Three shared with netdev pull requests from Mellanox:
* 7 patches, mostly to net with 1 IB related one at the back).
This series addresses an IRQ performance issue (patch 1),
cleanups related to the fix for the IRQ performance problem
(patches 2-6), and then extends the fragmented completion queue
support that already exists in the net side of the driver to the
ib side of the driver (patch 7).
* Mostly IB, with 5 patches to net that are needed to support the
remaining 10 patches to the IB subsystem. This series extends
the current 'representor' framework when the mlx5 driver is in
switchdev mode from being a netdev only construct to being a
netdev/IB dev construct. The IB dev is limited to raw Eth queue
pairs only, but by having an IB dev of this type attached to the
representor for a switchdev port, it enables DPDK to work on the
switchdev device.
* All net related, but needed as infrastructure for the rdma
driver
- Updates for the hns, i40iw, bnxt_re, cxgb3, cxgb4, hns drivers
- SRP performance updates
- IB uverbs write path cleanup patch series from Leon
- Add RDMA_CM support to ib_srpt. This is disabled by default. Users
need to set the port for ib_srpt to listen on in configfs in order
for it to be enabled
(/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port)
- TSO and Scatter FCS support in mlx4
- Refactor of modify_qp routine to resolve problems seen while
working on new code that is forthcoming
- More refactoring and updates of RDMA CM for containers support from
Parav
- mlx5 'fine grained packet pacing', 'ipsec offload' and 'device
memory' user API features
- Infrastructure updates for the new IOCTL interface, based on
increased usage
- ABI compatibility bug fixes to fully support 32 bit userspace on 64
bit kernel as was originally intended. See the commit messages for
extensive details
- Syzkaller bugs and code cleanups motivated by them"
* tag 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (199 commits)
IB/rxe: Fix for oops in rxe_register_device on ppc64le arch
IB/mlx5: Device memory mr registration support
net/mlx5: Mkey creation command adjustments
IB/mlx5: Device memory support in mlx5_ib
net/mlx5: Query device memory capabilities
IB/uverbs: Add device memory registration ioctl support
IB/uverbs: Add alloc/free dm uverbs ioctl support
IB/uverbs: Add device memory capabilities reporting
IB/uverbs: Expose device memory capabilities to user
RDMA/qedr: Fix wmb usage in qedr
IB/rxe: Removed GID add/del dummy routines
RDMA/qedr: Zero stack memory before copying to user space
IB/mlx5: Add ability to hash by IPSEC_SPI when creating a TIR
IB/mlx5: Add information for querying IPsec capabilities
IB/mlx5: Add IPsec support for egress and ingress
{net,IB}/mlx5: Add ipsec helper
IB/mlx5: Add modify_flow_action_esp verb
IB/mlx5: Add implementation for create and destroy action_xfrm
IB/uverbs: Introduce ESP steering match filter
IB/uverbs: Add modify ESP flow_action
...
Adding new ioctl method for the MR object - REG_DM_MR.
This command can be used by users to register an allocated
device memory buffer as an MR and receive lkey and rkey
to be used within work requests.
It is added as a new method under the MR object and using a new
ib_device callback - reg_dm_mr.
The command creates a standard ib_mr object which represents the
registered memory.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This change adds uverbs support for allocation/freeing
of device memory commands.
A new uverbs object is defined of type idr to represent
and track the new resource type allocation per context.
The API requires provider driver to implement 2 new ib_device
callbacks - one for allocation and one for deallocation which
return and accept (respectively) the ib_dm object which represents
the allocated memory on the device.
The support is added via the ioctl command infrastructure
only.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This change allows vendors to report device memory capability
max_dm_size - to user via uverbs command.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Adding a new ESP steering match filter that could match against
spi and seq used in IPSec protocol.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
flow_actions of ESP type could be modified during runtime. This could be
common for example when ESN should be changed. Adding a new
UVERBS_FLOW_ACTION_ESP_MODIFY method for changing ESP parameters of an
existing ESP flow_action.
The new method uses the UVERBS_FLOW_ACTION_ESP_CREATE attributes, but
adds a new IB_FLOW_ACTION_ESP_FLAGS_MOD_ESP_ATTRS which means ESP_ATTRS
should be changed.
In addition, we add a new FLOW_ACTION_ESP_REPLAY_NONE replay type that
could be used when one wants to disable a replay protection over a
specific flow_action.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Binding a flow_action to flow steering rule requires using a new
specification. Therefore, adding such an IB_FLOW_SPEC_ACTION_HANDLE flow
specification.
Flow steering rules could use flow_action(s) and as of that we need to
avoid deleting flow_action(s) as long as they're being used.
Moreover, when the attached rules are deleted, action_handle reference
count should be decremented. Introducing a new mechanism of flow
resources to keep track on the attached action_handle(s). Later on, this
mechanism should be extended to other attached flow steering resources
like flow counters.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
A verbs application may receive and transmits packets using a data
path pipeline. Sometimes, the first stage in the receive pipeline or
the last stage in the transmit pipeline involves transforming a
packet, either in order to make it easier for later stages to process
it or to prepare it for transmission over the wire. Such transformation
could be stripping/encapsulating the packet (i.e. vxlan),
decrypting/encrypting it (i.e. ipsec), altering headers, doing some
complex FPGA changes, etc.
Some hardware could do such transformations without software data path
intervention at all. The flow steering API supports steering a
packet (either to a QP or dropping it) and some simple packet
immutable actions (i.e. tagging a packet). Complex actions, that may
change the packet, could bloat the flow steering API extensively.
Sometimes the same action should be applied to several flows.
In this case, it's easier to bind several flows to the same action and
modify it than change all matching flows.
Introducing a new flow_action object that abstracts any packet
transformation (out of a standard and well defined set of actions).
This flow_action object could be tied to a flow steering rule via a
new specification.
Currently, we support esp flow_action, which encrypts or decrypts a
packet according to the given parameters. However, we present a
flexible schema that could be used to other transformation actions tied
to flow rules.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The current implementation of kern_spec_to_ib_spec_filter, which takes
a uAPI based flow steering specification and creates the respective kernel
API flow steering structure, gets a ib_uverbs_flow_spec structure.
The new flow_action uAPI gets a match mask and filter from user-space
which aren't encoded in the flow steering's ib_uverbs_flow_spec structure.
Exporting the logic out of kern_spec_to_ib_spec_filter to get user-space
blobs rather than ib_uverbs_flow_spec structure.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Methods sometimes need to get one attribute out of a group of
pre-defined attributes. This is an enum-like behavior. Since
this is a common requirement, we add a new ENUM attribute to the
generic uverbs ioctl() layer. This attribute is embedded in methods,
like any other attributes we currently have. ENUM attributes point to
an array of standard UVERBS_ATTR_PTR_IN. The user-space encodes the
enum's attribute id in the id field and the internal PTR_IN attr id in
the enum_data.elem_id field. This ENUM attribute could be shared by
several attributes and it can get UVERBS_ATTR_SPEC_F_MANDATORY flag,
stating this attribute must be supported by the kernel, like any other
attribute.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now that ib_gid_attr contains device, port and index, simplify the
provider APIs add_gid() and del_gid() to use device, port and index
fields from the ib_gid_attr attributes structure.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Code is refactored to prepare separate functions for RoCE which can do more
complex operations related to reference counting, while still
maintainining code readability. This includes
(a) Simplification to not perform netdevice checks and modifications
for IB link layer.
(b) Do not add RoCE GID entry which has NULL netdevice; instead return
an error.
(c) If GID addition fails at provider level add_gid(), do not add the
entry in the cache and keep the entry marked as INVALID.
(d) Simplify and reuse the ib_cache_gid_add()/del() routines so that they
can be used even for modifying default GIDs. This avoid some code
duplication in modifying default GIDs.
(e) find_gid() routine refers to the data entry flags to qualify a GID
as valid or invalid GID rather than depending on attributes and zeroness
of the GID content.
(f) gid_table_reserve_default() sets the GID default attribute at
beginning while setting up the GID table. There is no need to use
default_gid flag in low level functions such as write_gid(), add_gid(),
del_gid(), as they never need to update the DEFAULT property of the GID
entry while during GID table update.
As as result of this refactor, reserved GID 0:0:0:0:0:0:0:0 is no longer
searchable as described below.
A unicast GID entry of 0:0:0:0:0:0:0:0 is Reserved GID as per the IB
spec version 1.3 section 4.1.1, point (6) whose snippet is below.
"The unicast GID address 0:0:0:0:0:0:0:0 is reserved - referred to as
the Reserved GID. It shall never be assigned to any endport. It shall
not be used as a destination address or in a global routing header
(GRH)."
GID table cache now only stores valid GID entries. Before this patch,
Reserved GID 0:0:0:0:0:0:0:0 was searchable in the GID table using
ib_find_cached_gid_by_port() and other similar find routines.
Zero GID is no longer searchable as it shall not to be present in GRH or
path recored entry as described in IB spec version 1.3 section 4.1.1,
point (6), section 12.7.10 and section 12.7.20.
ib_cache_update() is simplified to check link layer once, use unified
locking scheme for all link layers, removed temporary gid table
allocation/free logic.
Additionally,
(a) Expand ib_gid_attr to store port and index so that GID query
routines can get port and index information from the attribute structure.
(b) Expand ib_gid_attr to store device as well so that in future code when
GID reference counting is done, device is used to reach back to the GID
table entry.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently following inconsistencies exist.
1. ib_query_gid() returns GID from the software cache for a RoCE port
and returns GID from the HCA for an IB port.
This is incorrect because software GID cache is maintained regardless
of HCA port type.
2. GID is queries from the HCA via ib_query_gid and updated in the
software cache for IB link layer. Both of them might not be in sync.
ULPs such as SRP initiator, SRP target, IPoIB driver have historically
used ib_query_gid() API to query the GID. However CM used cached version
during CM processing, When software cache was introduced, this
inconsitency remained.
In order to simplify, improve readability and avoid link layer
specific above inconsistencies, this patch brings following changes.
1. ib_query_gid() always refers to the cache layer regardless of link
layer.
2. cache module who reads the GID entry from HCA and builds the cache,
directly invokes the HCA provider verb's query_gid() callback function.
3. ib_query_port() is being called in early stage where GID cache is not
yet build while reading port immutable property. Therefore it needs to
read the default GID from the HCA for IB link layer to publish the
subnet prefix.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ib_query_gid() fetches the GID from the software cache maintained in
ib_core for RoCE ports.
Therefore, simplify the provider drivers for RoCE to treat query_gid()
callback as never called for RoCE, and only require non-RoCE devices to
implement it.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Check to make sure that ctx->cm_id->device is set before we use it.
Otherwise userspace can trigger a NULL dereference by doing
RDMA_USER_CM_CMD_SET_OPTION on an ID that is not bound to a device.
Cc: <stable@vger.kernel.org>
Reported-by: <syzbot+a67bc93e14682d92fc2f@syzkaller.appspotmail.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c,
we had some overlapping changes:
1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE -->
MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE
2) In 'net-next' params->log_rq_size is renamed to be
params->log_rq_mtu_frames.
3) In 'net-next' params->hard_mtu is added.
Signed-off-by: David S. Miller <davem@davemloft.net>
rdma_cm_state enum is internal to rdma_cm kernel module.
It is not required to expose state enums to ULP modules.
So lets keep its scope limited to rdma_cm module in cma_priv.h file.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Make dst_entry pointer as const struct dst_entry* to improve code
readablity to make sure that dst structure fields are not modified by
various functions which are using it.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This is already used in many places, get the rest of them too, only
to make the code a bit clearer & simpler.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Export the net device name and index to easily find connection
between IB devices and relevant net devices.
We also updated the comment regarding the devices without FW.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rtnl_lock() is used everywhere, and contention is very high.
When someone wants to iterate over alive net namespaces,
he/she has no a possibility to do that without exclusive lock.
But the exclusive rtnl_lock() in such places is overkill,
and it just increases the contention. Yes, there is already
for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
and this can't be sleepable. Also, sometimes it may be need
really prevent net_namespace_list growth, so for_each_net_rcu()
is not fit there.
This patch introduces new rw_semaphore, which will be used
instead of rtnl_mutex to protect net_namespace_list. It is
sleepable and allows not-exclusive iterations over net
namespaces list. It allows to stop using rtnl_lock()
in several places (what is made in next patches) and makes
less the time, we keep rtnl_mutex. Here we just add new lock,
while the explanation of we can remove rtnl_lock() there are
in next patches.
Fine grained locks generally are better, then one big lock,
so let's do that with net_namespace_list, while the situation
allows that.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the rdma_port_space enum is being passed between user and kernel for
user cm_id setup, we need it in a UAPI header. So add it to
rdma_user_cm.h.
This also fixes the cm_id restrack changes which pass up the port space
value via the RDMA_NLDEV_ATTR_RES_PS attribute.
Fixes: 00313983cd ("RDMA/nldev: provide detailed CM_ID information")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There are several places in the ucma ABI where userspace can pass in a
sockaddr but set the address family to AF_IB. When that happens,
rdma_addr_size() will return a size bigger than sizeof struct sockaddr_in6,
and the ucma kernel code might end up copying past the end of a buffer
not sized for a struct sockaddr_ib.
Fix this by introducing new variants
int rdma_addr_size_in6(struct sockaddr_in6 *addr);
int rdma_addr_size_kss(struct __kernel_sockaddr_storage *addr);
that are type-safe for the types used in the ucma ABI and return 0 if the
size computed is bigger than the size of the type passed in. We can use
these new variants to check what size userspace has passed in before
copying any addresses.
Reported-by: <syzbot+6800425d54ed3ed8135d@syzkaller.appspotmail.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
IB core maintains the GID cache entries for the GID table.
This cache table has to be maintained regardless of HCA's
support of GID table.
For IB and iWarp ports, cache is created by querying the HCA.
For RoCE cache is created based on netdev events.
Therefore just refer to the RoCE port property of the {device, port} to
decide whether to build cache by querying HCA or from netdev events.
There is no need to check if HCA support GID table or not.
ib_cache_update() referred to RoCE attribute before validating
port. Though in all current callers port is valid, it is incorrect
to query RoCE port property before validating the port. Therefore,
rdma_protocol_roce() check is done after rdma_is_port_valid() verifies
that port is valid.
Fixes: 115b68aa6e ("IB/ocrdma: Removed GID add/del null routines")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Even though API is only used by IPoIB driver, its incorrect to refer
RoCE GID table property to search for GID.
Look for only IB link layer to search for the GID.
Fixes: dbb12562f7 ("IB/{core, ipoib}: Simplify ib_find_gid to search only for IB link layer")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ib_find_gid_by_filter() searches GID with filter only for RoCE link
layer regardless of HCA's support for GID table.
Therefore, right way to lookup is compare RoCE port property and not
the GID table property.
Fixes: 99b27e3b5d ("IB/cache: Add ib_find_gid_by_filter cache API")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Due to following reasons, GID table event is generated regardless of GID
table property.
1. GID table cache is maintained at ib core layer regardless of link layer.
2. GID change event has no relation with IB link layer.
3. GID change event also doesn't depend on whether HCA supports GID table
or not.
Fixes: f3906bd360 ("IB/core: Refactor GID cache's ib_dispatch_event")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Due to below reasons, it is better to not support alternate path receive
messages for RoCE in near term.
1. Alternate path for RoCE is not supported at rdmacm layer.
2. It is not supported in uverbs/core layer for RoCE.
3. Alternate path for IPv6 for link local address cannot resolve route
determinstically without a valid incoming interface id whose usecase
make sense only with dual port mode.
4. init_av_from_path while processing LAP messages for IB and RoCE can
lead to adding duplicate entry of AV into the port list, leads to list
corruption.
5. rdma-core userspace a well known userspace implementation has removed
support of libucm which use ucm.ko module, which is the only module that
can trigger alternate path related messages.
6. ucm kernel module is requested to be removed from the IB core in
patch [1].
[1] https://patchwork.kernel.org/patch/10268503/
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently access to hardware stats buffer isn't protected, this can
result in multiple writes and reads at the same time to the same
memory location. This can lead to providing an incorrect value to
the user. Add a mutex to protect against it.
Fixes: b40f4757da ("IB/core: Make device counter infrastructure dynamic")
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The rdma_ucm_event_resp is a different length on 32 and 64 bit compiles.
The kernel requires it to be the expected length or longer so 32 bit
builds running on a 64 bit kernel will not work.
Retain full compat by having all kernels accept a struct with or without
the trailing reserved field.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Ensure that device exists prior to accessing its properties.
Reported-by: <syzbot+71655d44855ac3e76366@syzkaller.appspotmail.com>
Fixes: 7521663857 ("RDMA/cma: Export rdma cm interface to userspace")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently CM request for RoCE follows following flow.
rdma_create_id()
rdma_resolve_addr()
rdma_resolve_route()
For RC QPs:
rdma_connect()
->cma_connect_ib()
->ib_send_cm_req()
->cm_init_av_by_path()
->ib_init_ah_attr_from_path()
For UD QPs:
rdma_connect()
->cma_resolve_ib_udp()
->ib_send_cm_sidr_req()
->cm_init_av_by_path()
->ib_init_ah_attr_from_path()
In both the flows, route is already resolved before sending CM requests.
Therefore, code is refactored to avoid resolving route second time in
ib_cm layer.
ib_init_ah_attr_from_path() is extended to resolve route when it is not
yet resolved for RoCE link layer. This is achieved by caller setting
route_resolved field in path record whenever it has route already
resolved.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fun set of conflict resolutions here...
For the mac80211 stuff, these were fortunately just parallel
adds. Trivially resolved.
In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.
In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.
The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.
The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:
====================
Due to bug fixes found by the syzkaller bot and taken into the for-rc
branch after development for the 4.17 merge window had already started
being taken into the for-next branch, there were fairly non-trivial
merge issues that would need to be resolved between the for-rc branch
and the for-next branch. This merge resolves those conflicts and
provides a unified base upon which ongoing development for 4.17 can
be based.
Conflicts:
drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
(IB/mlx5: Fix cleanup order on unload) added to for-rc and
commit b5ca15ad7e (IB/mlx5: Add proper representors support)
add as part of the devel cycle both needed to modify the
init/de-init functions used by mlx5. To support the new
representors, the new functions added by the cleanup patch
needed to be made non-static, and the init/de-init list
added by the representors patch needed to be modified to
match the init/de-init list changes made by the cleanup
patch.
Updates:
drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
prototypes added by representors patch to reflect new function
names as changed by cleanup patch
drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
stage list to match new order from cleanup patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ib_query_gid() in commit [1] refers to RoCE GID table capability of
the HCA using rdma_cap_roce_gid_table().
ib_core maintains the GID table cache regardless of the HCA provider
drivers capability to maintain RoCE GID table.
Therefore, whether to return a GID table entry from the software cache or
from HCA should be done based on whether the port is RoCE or not.
[1] commit 03db3a2d81 ("IB/core: Add RoCE GID table management")
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The restrack clean routine had simple, but powerful WARN_ON check
to see if all resources are cleared prior to releasing device.
The WARN_ON check performed very well, but lack of information
which device caused to resource leak, the object type and origin
made debug to be fun and challenging at the same time.
The fact that all dumps were the same because restrack_clean() is
called in dealloc() didn't help either.
So let's fix spelling error and convert WARN_ON to be more debug
friendly. The dmesg cut below gives example of how the output
will look output for the case fixed in patch [1]
[ 438.421372] restrack: ------------[ cut here ]------------
[ 438.423448] restrack: BUG: RESTRACK detected leak of resources on mlx5_2
[ 438.425600] restrack: Kernel PD object allocated by mlx5_ib is not freed
[ 438.427753] restrack: Kernel CQ object allocated by mlx5_ib is not freed
[ 438.429660] restrack: ------------[ cut here ]------------
[1] https://patchwork.kernel.org/patch/10298695/
Cc: Michal Kalderon <Michal.Kalderon@cavium.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The option size check is using optval instead of optlen
causing the set option call to fail. Use the correct
field, optlen, for size check.
Fixes: 6a21dfc0d0 ("RDMA/ucma: Limit possible option size")
Signed-off-by: Chien Tin Tung <chien.tin.tung@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Enable the ioctl() uAPI for IB by default if the standard write()
uAPI (INFINIBAND_USER_ACCESS) is enabled. Verbs that are
also available under the old write() uAPI are put inside a new
INFINIBAND_EXP_LEGACY_VERBS_NEW_UAPI Kconfig.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently, all objects are declared in uverbs_std_types. This could lead
to a huge file once we implement all objects, methods and handlers.
Moving each object to its own file to keep the files smaller and more
readable. uverbs_std_types.c will only contain the parsing tree
definition and objects without any methods.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The ioctl() based uverbs is based on merging feature trees. This teaches
the generic parser how to parse methods according to the provider's
support. In order to support merging with the common objects, exporting
the common-object-tree to the provider drivers.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Previously, we've used UVERBS_ATTR_SPEC_F_MIN_SZ for extending existing
attributes. The behavior of this flag was the kernel accepts anything
bigger than the minimum size it specified. This is unsafe, since in
order to safely extend an attribute, we need to make sure unknown size
is zeroed. Replacing UVERBS_ATTR_SPEC_F_MIN_SZ with
UVERBS_ATTR_SPEC_F_MIN_SZ_OR_ZERO, which essentially checks that the
unknown size is zero. In addition, attributes are now decorated with
UVERBS_ATTR_TYPE and UVERBS_ATTR_STRUCT, so we can provide the minimum
and known length.
Users of this flag needs to use copy_from_or_zero functions/macros.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Downstream patches extend uverbs_attr_spec with new fields.
In order to save space, we move the type and flags fields to
the various attribute flavors contained in the union.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Extending uverbs_ioctl header with driver_id and another reserved
field. driver_id should be used in order to identify the driver.
Since every driver could have its own parsing tree, this is necessary
for strace support.
Downstream patches take off the EXPERIMENTAL flag from the ioctl() IB
support and thus we add some reserved fields for future usage.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use macros to make names consistent in ioctl() uAPI:
The ioctl() uAPI works with object-method hierarchy. The method part
also states which handler should be executed when this method is called
from user-space. Therefore, we need to tie method, method's id, method's
handler and the object owning this method together.
Previously, this was done through explicit developer chosen names.
This makes grepping the code harder. Changing the method's name,
method's handler and object's name to be automatically generated based
on the ids.
The headers are split in a way so they be included and used by
user-space. One header strictly contains structures that are used
directly by user-space applications, where another header is used for
internal library (i.e. libibverbs) to form the ioctl() commands.
Other header simply contains the required general command structure.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The error in ucma_create_id() left ctx in the list of contexts belong
to ucma file descriptor. The attempt to close this file descriptor causes
to use-after-free accesses while iterating over such list.
Fixes: 7521663857 ("RDMA/cma: Export rdma cm interface to userspace")
Reported-by: <syzbot+dcfd344365a56fbebd0f@syzkaller.appspotmail.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use rdma_is_port_valid() which performs port validity check instead of
open coding the same check.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Before commit f1b65df5a2 ("IB/mlx5: Add support for active_width and
active_speed in RoCE"), the mlx5_ib driver set default active_width and
active_speed to IB_WIDTH_4X and IB_SPEED_QDR.
Now, the active_width and active_speed are zeros if the RoCE port
is in DOWN state. The speed string should be set to " SDR" instead of
a blank string when active_speed is zero.
Signed-off-by: Honggang Li <honli@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Before commit [1], rdma_addr_find_l2_eth_by_grh() was an exported function
and therefore declaration in include/rdma/ib_addr.h was fine.
But now that its scope is limited to ib_core module, its better to have it
in core_priv.h.
[1] commit 1060f86534 ("IB/{core/cm}: Fix generating a return AH for
RoCEE")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Introduce and use helper function get_cm_port_from_path() to get
cm_port based on the the path record entry.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Resolving route for RoCE for a path record is needed only for the
received CM requests.
Therefore,
(a) ib_init_ah_attr_from_path() is refactored first to isolate the
code of resolving route.
(b) Setting dlid, path bits is not needed for RoCE.
Additionally ah attribute initialization is done from the path record
entry, so it is better to refer to path record entry type for
different link layer instead of ah attribute type while initializing
ah attribute itself.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add and use helper function add_cm_id_to_port_list() to attach
cm_id to port list.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rdma_resolve_ip_route() is used only by ib_core module. Therefore it is
removed as an exported symbol.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rdma_protocol_roce() API from the ib_core already provides a way to
detect whether a given device+port is RoCE or not.
Therefore, make use of it and avoid implementing it again in rdmacm
module.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ah_attr contains the port number to which cm_id is bound. However, while
searching for GID table for matching GID entry, the port number is
ignored.
This could cause the wrong GID to be used when the ah_attr is converted to
an AH.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The return status of ib_init_ah_from_mcmember() is ignored by
cma_ib_mc_handler(). Honor it and return error event if ah attribute
initialization failed.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ib_find_gid() is only used by IPoIB driver. For IB link layer, GID table
entries are not based on netdevice. Netdevice parameter is unused here.
Therefore, it is removed.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Exported symbol's comments should be with function definition and not in
the header file. Therefore comments of ib_find_cached_gid() and
ib_find_cached_gid_by_port() functions are moved closer to their
definitions.
The function name in then comment is different than the actual function
name, fix it to be same as ib_cache_gid_find_by_filter().
Also current comment section of ib_find_cached_gid_by_port() contains the
desciption of ib_find_cached_gid(), fix that as well.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Due to bug fixes found by the syzkaller bot and taken into the for-rc
branch after development for the 4.17 merge window had already started
being taken into the for-next branch, there were fairly non-trivial
merge issues that would need to be resolved between the for-rc branch
and the for-next branch. This merge resolves those conflicts and
provides a unified base upon which ongoing development for 4.17 can
be based.
Conflicts:
drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
(IB/mlx5: Fix cleanup order on unload) added to for-rc and
commit b5ca15ad7e (IB/mlx5: Add proper representors support)
add as part of the devel cycle both needed to modify the
init/de-init functions used by mlx5. To support the new
representors, the new functions added by the cleanup patch
needed to be made non-static, and the init/de-init list
added by the representors patch needed to be modified to
match the init/de-init list changes made by the cleanup
patch.
Updates:
drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
prototypes added by representors patch to reflect new function
names as changed by cleanup patch
drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
stage list to match new order from cleanup patch
Signed-off-by: Doug Ledford <dledford@redhat.com>
gcc-4.4.4 has issues with initialization of anonymous unions.
drivers/infiniband/core/verbs.c: In function '__ib_drain_sq':
drivers/infiniband/core/verbs.c:2204: error: unknown field 'wr_cqe' specified in initializer
drivers/infiniband/core/verbs.c:2204: warning: initialization makes integer from pointer without a cast
Work around this.
Fixes: a1ae7d0345 ("RDMA/core: Avoid that ib_drain_qp() triggers an out-of-bounds stack access")
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
cma_port_is_unique() allows local port reuse if the quad (source
address and port, destination address and port) for this connection
is unique. However, if the destination info is zero or unspecified, it
can't make a correct decision but still allows port reuse. For example,
sometimes rdma_bind_addr() is called with unspecified destination and
reusing the port can lead to creating a connection with a duplicate quad,
after the destination is resolved. The issue manifests when MPI scale-up
tests hang after the duplicate quad is used.
Set the destination address family and add checks for zero destination
address and port to prevent source port reuse based on invalid destination.
Fixes: 19b752a19d ("IB/cma: Allow port reuse for rdma_id")
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
All callers to ib_modify_qp_is_ok() provides enum ib_qp_state
makes the checks of out-of-scope redundant. Let's remove them
together with updating function signature to return boolean result.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The QP state is internal enum which is checked at the driver
level by calling to ib_modify_qp_is_ok(). Move this check closer
to user and leave kernel users to be checked by compiler.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Implement RDMA nldev netlink interface to get detailed CM_ID information.
Because cm_id's are attached to rdma devices in various work queue
contexts, the pid and task information at restrak_add() time is sometimes
not useful. For example, an nvme/f host connection cm_id ends up being
bound to a device in a work queue context and the resulting pid at attach
time no longer exists after connection setup. So instead we mark all
cm_id's created via the rdma_ucm as "user", and all others as "kernel".
This required tweaking the restrack code a little. It also required
wrapping some rdma_cm functions to allow passing the module name string.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Move struct rdma_id_private to a new header cma_priv.h so the resource
tracking services in core/nldev.c can read useful information about cm_ids.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Create a common dumpit function that can be used by all common resource
types. This reduces code replication and simplifies the code as we add
more resource types.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Simplify res_to_dev() to make it easier to read/maintain.
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The QP state is limited and declared in enum ib_qp_state,
but ucma user was able to supply any possible (u32) value.
Reported-by: syzbot+0df1ab766f8924b1edba@syzkaller.appspotmail.com
Fixes: 7521663857 ("RDMA/cma: Export rdma cm interface to userspace")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Users of ucma are supposed to provide size of option level,
in most paths it is supposed to be equal to u8 or u16, but
it is not the case for the IB path record, where it can be
multiple of struct ib_path_rec_data.
This patch takes simplest possible approach and prevents providing
values more than possible to allocate.
Reported-by: syzbot+a38b0e9f694c379ca7ce@syzkaller.appspotmail.com
Fixes: 7ce86409ad ("RDMA/ucma: Allow user space to set service type")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
resolved_dev returned might be NULL as ifindex is transient number.
Ignoring NULL check of resolved_dev might crash the kernel.
Therefore perform NULL check before accessing resolved_dev.
Additionally rdma_resolve_ip_route() invokes addr_resolve() which
performs check and address translation for loopback ifindex.
Therefore, checking it again in rdma_resolve_ip_route() is not helpful.
Therefore, the code is simplified to avoid IFF_LOOPBACK check.
Fixes: 200298326b ("IB/core: Validate route when we init ah")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fix warning limit for kernel stack consumption:
drivers/infiniband/core/cq.c: In function 'ib_process_cq_direct':
drivers/infiniband/core/cq.c:78:1: error: the frame size of 1032 bytes
is larger than 1024 bytes [-Werror=frame-larger-than=]
Using smaller ib_wc array on the stack brings us comfortably below that
limit again.
Fixes: 246d8b184c ("IB/cq: Don't force IB_POLL_DIRECT poll context for ib_process_cq_direct")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Sergey Gorenko <sergeygo@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
IPv6 does path selection for multipath routes deep in the lookup
functions. The next patch adds L4 hash option and needs the skb
for the forward path. To get the skb to the relevant FIB lookup
functions it needs to go through the fib rules layer, so add a
lookup_data argument to the fib_lookup_arg struct.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Omit an extra message for a memory allocation failure in this function.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Maintaining the uobjects list is mandatory, hoist it into the common
rdma_alloc_commit_uobject() function and inline it as there is now
only one caller.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
During IB device registration process, if query_device() fails or if
ib_core fails to registers sysfs entries, rdma cgroup cleanup is
skipped.
Cc: <stable@vger.kernel.org> # v4.2+
Fixes: 4be3a4fa51 ("IB/core: Fix kernel crash during fail to initialize device")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
These pernet_operations just create and destroy IDR.
So, we mark them as async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The proper return error is -EOPNOTSUPP and not -ENOSYS, so update
all places in verbs.c to match this semantics.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Simplify the code by directly checking the availability of extended
command flog instead of doing multiple shift operations.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The internal to kernel variable declarations don't need to be
declared with user types. This patch converts such occurrences
appeared in ib_uverbs_write().
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Move all header validation logic to be performed before SRCU read lock.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The SRCU read lock protects the IB device pointer
and doesn't need to be called before copying user
provided header.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is no need to take SRCU lock before checking
file->ucontext, so move it do it before it.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The check based on index is not sufficient because
IB_USER_VERBS_EX_CMD_CREATE_CQ = IB_USER_VERBS_CMD_CREATE_CQ
and IB_USER_VERBS_CMD_CREATE_CQ <= IB_USER_VERBS_CMD_OPEN_QP,
so if we execute IB_USER_VERBS_EX_CMD_CREATE_CQ this code checks
ib_dev->uverbs_cmd_mask not ib_dev->uverbs_ex_cmd_mask.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Move all command header processing into separate function
and perform those checks before acquiring SRCU read lock.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The non-existing command is supposed to return -EOPNOTSUPP, but the
current code returns different errors for different flows for the
same failure. This patch unifies those flows.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Command that doesn't exist means that it is not supported,
so update code to return -EOPNOTSUPP in case of failure.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fail as early as possible if not enough header data
was provided.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Since commit f21519b23c ("IB/core: extended command: an
improved infrastructure for uverbs commands"), the uverbs
supports extra flags as an input to the command interface.
However actually, there is only one flag available and used,
so it is better to refactor the code, so the resolution and
report to the users is done as early as possible.
As part of this change, we changed the return value of failure case
from ENOSYS to be EINVAL to be consistent with the rest flags checks.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Update sizeof() users to be consistent with coding style.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The function validate_command_mask() returns only two results: success
or failure, so convert it to return bool instead of 0 and -1.
Reported-by: Noa Osherovich <noaos@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
uaccess_kernel() isn't sufficient to determine if an rdma resource is
user-mode or not. For example, resources allocated in the add_one()
function of an ib_client get falsely labeled as user mode, when they
are kernel mode allocations. EG: mad qps.
The result is that these qps are skipped over during a nldev query
because of an erroneous namespace mismatch.
So now we determine if the resource is user-mode by looking at the object
struct's uobject or similar pointer to know if it was allocated for user
mode applications.
Fixes: 02d8883f52 ("RDMA/restrack: Add general infrastructure to track RDMA resources")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Update all the flows to ensure that function pointer exists prior
to accessing it.
This is much safer than checking the uverbs_ex_mask variable, especially
since we know that test isn't working properly and will be removed
in -next.
This prevents a user triggereable oops.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no matching lock for this mutex. Git history suggests this is
just a missed remnant from an earlier version of the function before
this locking was moved into uverbs_free_xrcd.
Originally this lock was protecting the xrcd_table_delete()
=====================================
WARNING: bad unlock balance detected!
4.15.0+ #87 Not tainted
-------------------------------------
syzkaller223405/269 is trying to release lock (&uverbs_dev->xrcd_tree_mutex) at:
[<00000000b8703372>] ib_uverbs_close_xrcd+0x195/0x1f0
but there are no more locks to release!
other info that might help us debug this:
1 lock held by syzkaller223405/269:
#0: (&uverbs_dev->disassociate_srcu){....}, at: [<000000005af3b960>] ib_uverbs_write+0x265/0xef0
stack backtrace:
CPU: 0 PID: 269 Comm: syzkaller223405 Not tainted 4.15.0+ #87
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
dump_stack+0xde/0x164
? dma_virt_map_sg+0x22c/0x22c
? ib_uverbs_write+0x265/0xef0
? console_unlock+0x502/0xbd0
? ib_uverbs_close_xrcd+0x195/0x1f0
print_unlock_imbalance_bug+0x131/0x160
lock_release+0x59d/0x1100
? ib_uverbs_close_xrcd+0x195/0x1f0
? lock_acquire+0x440/0x440
? lock_acquire+0x440/0x440
__mutex_unlock_slowpath+0x88/0x670
? wait_for_completion+0x4c0/0x4c0
? rdma_lookup_get_uobject+0x145/0x2f0
ib_uverbs_close_xrcd+0x195/0x1f0
? ib_uverbs_open_xrcd+0xdd0/0xdd0
ib_uverbs_write+0x7f9/0xef0
? cyc2ns_read_end+0x10/0x10
? ib_uverbs_open_xrcd+0xdd0/0xdd0
? uverbs_devnode+0x110/0x110
? cyc2ns_read_end+0x10/0x10
? cyc2ns_read_end+0x10/0x10
? sched_clock_cpu+0x18/0x200
__vfs_write+0x10d/0x700
? uverbs_devnode+0x110/0x110
? kernel_read+0x170/0x170
? __fget+0x358/0x5d0
? security_file_permission+0x93/0x260
vfs_write+0x1b0/0x550
SyS_write+0xc7/0x1a0
? SyS_read+0x1a0/0x1a0
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL_64_fastpath+0x1e/0x8b
RIP: 0033:0x4335c9
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.11
Fixes: fd3c7904db ("IB/core: Change idr objects to use the new schema")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Once the uobj is committed it is immediately possible another thread
could destroy it, which worst case, can result in a use-after-free
of the restrack objects.
Cc: syzkaller <syzkaller@googlegroups.com>
Fixes: 08f294a152 ("RDMA/core: Add resource tracking for create and destroy CQs")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The command number is not bounds checked against the command mask before it
is shifted, resulting in an ubsan hit. This does not cause malfunction since
the command number is eventually bounds checked, but we can make this ubsan
clean by moving the bounds check to before the mask check.
================================================================================
UBSAN: Undefined behaviour in
drivers/infiniband/core/uverbs_main.c:647:21
shift exponent 207 is too large for 64-bit type 'long long unsigned int'
CPU: 0 PID: 446 Comm: syz-executor3 Not tainted 4.15.0-rc2+ #61
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
dump_stack+0xde/0x164
? dma_virt_map_sg+0x22c/0x22c
ubsan_epilogue+0xe/0x81
__ubsan_handle_shift_out_of_bounds+0x293/0x2f7
? debug_check_no_locks_freed+0x340/0x340
? __ubsan_handle_load_invalid_value+0x19b/0x19b
? lock_acquire+0x440/0x440
? lock_acquire+0x19d/0x440
? __might_fault+0xf4/0x240
? ib_uverbs_write+0x68d/0xe20
ib_uverbs_write+0x68d/0xe20
? __lock_acquire+0xcf7/0x3940
? uverbs_devnode+0x110/0x110
? cyc2ns_read_end+0x10/0x10
? sched_clock_cpu+0x18/0x200
? sched_clock_cpu+0x18/0x200
__vfs_write+0x10d/0x700
? uverbs_devnode+0x110/0x110
? kernel_read+0x170/0x170
? __fget+0x35b/0x5d0
? security_file_permission+0x93/0x260
vfs_write+0x1b0/0x550
SyS_write+0xc7/0x1a0
? SyS_read+0x1a0/0x1a0
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL_64_fastpath+0x18/0x85
RIP: 0033:0x448e29
RSP: 002b:00007f033f567c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f033f5686bc RCX: 0000000000448e29
RDX: 0000000000000060 RSI: 0000000020001000 RDI: 0000000000000012
RBP: 000000000070bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000000056a0 R14: 00000000006e8740 R15: 0000000000000000
================================================================================
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.5
Fixes: 2dbd5186a3 ("IB/core: IB/core: Allow legacy verbs through extended interfaces")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If remove_commit fails then the lock is left locked while the uobj still
exists. Eventually the kernel will deadlock.
lockdep detects this and says:
test/4221 is leaving the kernel with locks still held!
1 lock held by test/4221:
#0: (&ucontext->cleanup_rwsem){.+.+}, at: [<000000001e5c7523>] rdma_explicit_destroy+0x37/0x120 [ib_uverbs]
Fixes: 4da70da23e ("IB/core: Explicitly destroy an object while keeping uobject")
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This is really being used as an assert that the expected usecnt
is being held and implicitly that the usecnt is valid. Rename it to
assert_uverbs_usecnt and tighten the checks to only accept valid
values of usecnt (eg 0 and < -1 are invalid).
The tigher checkes make the assertion cover more cases and is more
likely to find bugs via syzkaller/etc.
Fixes: 3832125624 ("IB/core: Add support for idr types")
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The race is between lookup_get_idr_uobject and
uverbs_idr_remove_uobj -> uverbs_uobject_put.
We deliberately do not call sychronize_rcu after the idr_remove in
uverbs_idr_remove_uobj for performance reasons, instead we call
kfree_rcu() during uverbs_uobject_put.
However, this means we can obtain pointers to uobj's that have
already been released and must protect against krefing them
using kref_get_unless_zero.
==================================================================
BUG: KASAN: use-after-free in copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
Read of size 4 at addr ffff88005fda1ac8 by task syz-executor2/441
CPU: 1 PID: 441 Comm: syz-executor2 Not tainted 4.15.0-rc2+ #56
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xd4
print_address_description+0x73/0x290
kasan_report+0x25c/0x370
? copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
? uverbs_try_lock_object+0x68/0xc0
? modify_qp.isra.7+0xdc4/0x10e0
modify_qp.isra.7+0xdc4/0x10e0
ib_uverbs_modify_qp+0xfe/0x170
? ib_uverbs_query_qp+0x970/0x970
? __lock_acquire+0xa11/0x1da0
ib_uverbs_write+0x55a/0xad0
? ib_uverbs_query_qp+0x970/0x970
? ib_uverbs_query_qp+0x970/0x970
? ib_uverbs_open+0x760/0x760
? futex_wake+0x147/0x410
? sched_clock_cpu+0x18/0x180
? check_prev_add+0x1680/0x1680
? do_futex+0x3b6/0xa30
? sched_clock_cpu+0x18/0x180
__vfs_write+0xf7/0x5c0
? ib_uverbs_open+0x760/0x760
? kernel_read+0x110/0x110
? lock_acquire+0x370/0x370
? __fget+0x264/0x3b0
vfs_write+0x18a/0x460
SyS_write+0xc7/0x1a0
? SyS_read+0x1a0/0x1a0
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL_64_fastpath+0x18/0x85
RIP: 0033:0x448e29
RSP: 002b:00007f443fee0c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f443fee16bc RCX: 0000000000448e29
RDX: 0000000000000078 RSI: 00000000209f8000 RDI: 0000000000000012
RBP: 000000000070bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000008e98 R14: 00000000006ebf38 R15: 0000000000000000
Allocated by task 1:
kmem_cache_alloc_trace+0x16c/0x2f0
mlx5_alloc_cmd_msg+0x12e/0x670
cmd_exec+0x419/0x1810
mlx5_cmd_exec+0x40/0x70
mlx5_core_mad_ifc+0x187/0x220
mlx5_MAD_IFC+0xd7/0x1b0
mlx5_query_mad_ifc_gids+0x1f3/0x650
mlx5_ib_query_gid+0xa4/0xc0
ib_query_gid+0x152/0x1a0
ib_query_port+0x21e/0x290
mlx5_port_immutable+0x30f/0x490
ib_register_device+0x5dd/0x1130
mlx5_ib_add+0x3e7/0x700
mlx5_add_device+0x124/0x510
mlx5_register_interface+0x11f/0x1c0
mlx5_ib_init+0x56/0x61
do_one_initcall+0xa3/0x250
kernel_init_freeable+0x309/0x3b8
kernel_init+0x14/0x180
ret_from_fork+0x24/0x30
Freed by task 1:
kfree+0xeb/0x2f0
mlx5_free_cmd_msg+0xcd/0x140
cmd_exec+0xeba/0x1810
mlx5_cmd_exec+0x40/0x70
mlx5_core_mad_ifc+0x187/0x220
mlx5_MAD_IFC+0xd7/0x1b0
mlx5_query_mad_ifc_gids+0x1f3/0x650
mlx5_ib_query_gid+0xa4/0xc0
ib_query_gid+0x152/0x1a0
ib_query_port+0x21e/0x290
mlx5_port_immutable+0x30f/0x490
ib_register_device+0x5dd/0x1130
mlx5_ib_add+0x3e7/0x700
mlx5_add_device+0x124/0x510
mlx5_register_interface+0x11f/0x1c0
mlx5_ib_init+0x56/0x61
do_one_initcall+0xa3/0x250
kernel_init_freeable+0x309/0x3b8
kernel_init+0x14/0x180
ret_from_fork+0x24/0x30
The buggy address belongs to the object at ffff88005fda1ab0
which belongs to the cache kmalloc-32 of size 32
The buggy address is located 24 bytes inside of
32-byte region [ffff88005fda1ab0, ffff88005fda1ad0)
The buggy address belongs to the page:
page:00000000d5655c19 count:1 mapcount:0 mapping: (null)
index:0xffff88005fda1fc0
flags: 0x4000000000000100(slab)
raw: 4000000000000100 0000000000000000 ffff88005fda1fc0 0000000180550008
raw: ffffea00017f6780 0000000400000004 ffff88006c803980 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88005fda1980: fc fc fb fb fb fb fc fc fb fb fb fb fc fc fb fb
ffff88005fda1a00: fb fb fc fc fb fb fb fb fc fc 00 00 00 00 fc fc
ffff88005fda1a80: fb fb fb fb fc fc fb fb fb fb fc fc fb fb fb fb
ffff88005fda1b00: fc fc 00 00 00 00 fc fc fb fb fb fb fc fc fb fb
ffff88005fda1b80: fb fb fc fc fb fb fb fb fc fc fb fb fb fb fc fc
==================================================================@
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.11
Fixes: 3832125624 ("IB/core: Add support for idr types")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This clarifies the design intention that time between allocate and
commit has the uobj exclusive to the caller. We already guarantee
this by delaying publishing the uobj pointer via idr_insert,
fd_install, list_add, etc.
Additionally holding the usecnt lock during this period provides
extra clarity and more protection against future mistakes.
Fixes: 3832125624 ("IB/core: Add support for idr types")
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If the same attribute is listed twice by the user in the ioctl attribute
list then error unwind can cause the kernel to deref garbage.
This happens when an object with WRITE access is sent twice. The second
parse properly fails but corrupts the state required for the error unwind
it triggers.
Fixing this by making duplicates in the attribute list invalid. This is
not something we need to support.
The ioctl interface is currently recommended to be disabled in kConfig.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
32 bit processes running on a 64 bit kernel call compat_ioctl so that
implementations can revise any structure layout issues. Point compat_ioctl
at our normal ioctl because:
- All our structures are designed to be the same on 32 and 64 bit, ie we
use __aligned_u64 when required and are careful to manage padding.
- Any pointers are stored in u64's and userspace is expected
to prepare them properly.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fix a bug in uverbs_ioctl_merge that looked at the object's iterator
number instead of the method's iterator number when merging methods.
While we're at it, make the uverbs_ioctl_merge code a bit more clear
and faster.
Fixes: 118620d368 ('IB/core: Add uverbs merge trees functionality')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The union approach will get the endianness wrong sometimes if the kernel's
pointer size is 32 bits resulting in EFAULTs when trying to copy to/from
user.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The rule for the API is pointers less than 8 bytes are inlined into
the .data field of the attribute. Fix the creation of the driver udata
struct to follow this rule and point to the .data itself when the size
is less than 8 bytes.
Otherwise if the UHW struct is less than 8 bytes the driver will get
EFAULT during copy_from_user.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This fixes several bugs around the copy_to/from user path:
- copy_to used the user provided size of the attribute
and could copy data beyond the end of the kernel buffer into
userspace.
- copy_from didn't know the size of the kernel buffer and
could have left kernel memory unexpectedly un-initialized.
- copy_from did not use the user length to determine if the
attribute data is inlined or not.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Resource tracking of XRCD objects is not implemented in current
version of restrack and hence can be removed.
Fixes: 02d8883f52 ("RDMA/restrack: Add general infrastructure to track RDMA resources")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:
for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done
with de-mangling cleanups yet to come.
NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.
The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Clean up some function signatures in rxe for clarity
- Tidy the RDMA netlink header to remove unimplemented constants
- bnxt_re driver fixes, one is a regression this window.
- Minor hns driver fixes
- Various fixes from Dan Carpenter and his tool
- Fix IRQ cleanup race in HFI1
- HF1 performance optimizations and a fix to report counters in the right units
- Fix for an IPoIB startup sequence race with the external manager
- Oops fix for the new kabi path
- Endian cleanups for hns
- Fix for mlx5 related to the new automatic affinity support
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaePL/AAoJELgmozMOVy/dqhsQALUhzDuuJ+/F6supjmyqZG53
Ak/PoFjTmHToGQfDq/1TRzyKwMx12aB2l6WGZc31FzhvCw4daPWkoEVKReNWUUJ+
fmESxjLgo8ZRGSqpNxn9Q8agE/I/5JZQoA8bCFCYgdZPKTPNKdtAVBphpdhmrOX4
ygjABikWf/wBsNF1A8lnX9xkfPO21cPHrFQLTnuOzOT/hc6U+PPklHSQCnS91svh
1+Pqjtssg54rxYkJqiFq3giSnfwvmAXO8WyVGmRRPFGLpB0nIjq0Sl6ZgLLClz7w
YJdiBGr7rlnNMgGCjlPU2ZO3lO6J0ytXQzFNqRqvKryXQOv+uVeJgep7WqHTcdQU
UN30FCKQMgLL/F6NF8wKaKcK4X0VgXQa7gpuH2fVSXF0c3LO3/mmWNjixbGSzT2c
Wj+EW3eOKlTddhRLhgbMOdwc32tIGhaD85z2F4+FZO+XI9ZQtJaDewWVDjYoumP/
RlDIFw+KCgSq7+UZL8CoXuh0BuS1nu9TGfkx1HW0DLMF1+yigNiswpUfksV4cISP
JqE2I3yH0A4UobD/a+f9IhIfk2MjxO0tJWNjU8IA9LXgUFlskQ6MpH/AcE9G8JNv
tlfLGR3s4PJa/7j/Iy2F84og/b/KH8v7vyj4Eknq/hLq63/BiM5wj0AUBRrGulN6
HhAMOegxGZ7IKP/y0L7I
=xwZz
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull more rdma updates from Doug Ledford:
"Items of note:
- two patches fix a regression in the 4.15 kernel. The 4.14 kernel
worked fine with NVMe over Fabrics and mlx5 adapters. That broke in
4.15. The fix is here.
- one of the patches (the endian notation patch from Lijun) looks
like a lot of lines of change, but it's mostly mechanical in
nature. It amounts to the biggest chunk of change in it (it's about
2/3rds of the overall pull request).
Summary:
- Clean up some function signatures in rxe for clarity
- Tidy the RDMA netlink header to remove unimplemented constants
- bnxt_re driver fixes, one is a regression this window.
- Minor hns driver fixes
- Various fixes from Dan Carpenter and his tool
- Fix IRQ cleanup race in HFI1
- HF1 performance optimizations and a fix to report counters in the right units
- Fix for an IPoIB startup sequence race with the external manager
- Oops fix for the new kabi path
- Endian cleanups for hns
- Fix for mlx5 related to the new automatic affinity support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (38 commits)
net/mlx5: increase async EQ to avoid EQ overrun
mlx5: fix mlx5_get_vector_affinity to start from completion vector 0
RDMA/hns: Fix the endian problem for hns
IB/uverbs: Use the standard kConfig format for experimental
IB: Update references to libibverbs
IB/hfi1: Add 16B rcvhdr trace support
IB/hfi1: Convert kzalloc_node and kcalloc to use kcalloc_node
IB/core: Avoid a potential OOPs for an unused optional parameter
IB/core: Map iWarp AH type to undefined in rdma_ah_find_type
IB/ipoib: Fix for potential no-carrier state
IB/hfi1: Show fault stats in both TX and RX directions
IB/hfi1: Remove blind constants from 16B update
IB/hfi1: Convert PortXmitWait/PortVLXmitWait counters to flit times
IB/hfi1: Do not override given pcie_pset value
IB/hfi1: Optimize process_receive_ib()
IB/hfi1: Remove unnecessary fecn and becn fields
IB/hfi1: Look up ibport using a pointer in receive path
IB/hfi1: Optimize packet type comparison using 9B and bypass code paths
IB/hfi1: Compute BTH only for RDMA_WRITE_LAST/SEND_LAST packet
IB/hfi1: Remove dependence on qp->s_hdrwords
...
The ev_file is an optional parameter for CQ creation. If the parameter
is not passed, the ev_file pointer will be NULL. Using that pointer
to set the cq_context will result in an OOPs.
Verify that ev_file is not NULL before using.
Cc: <stable@vger.kernel.org> # 4.14.x
Fixes: 9ee79fce36 ("IB/core: Add completion queue (cq) object actions")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
We should return -ENOMEM if the allocation fails. The current code
accidentally returns success.
Fixes: bf3c5a93c5 ("RDMA/nldev: Provide global resource utilization")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
- Misc small driver fixups to
bnxt_re/hfi1/qib/hns/ocrdma/rdmavt/vmw_pvrdma/nes
- Several major feature adds to bnxt_re driver: SRIOV VF RoCE support,
HugePages support, extended hardware stats support, and SRQ support
- A notable number of fixes to the i40iw driver from debugging scale up
testing
- More work to enable the new hip08 chip in the hns driver
- Misc small ULP fixups to srp/srpt//ipoib
- Preparation for srp initiator and target to support the RDMA-CM
protocol for connections
- Add RDMA-CM support to srp initiator, srp target is still a WIP
- Fixes for a couple of places where ipoib could spam the dmesg log
- Fix encode/decode of FDR/EDR data rates in the core
- Many patches from Parav with ongoing work to clean up inconsistencies
and bugs in RoCE support around the rdma_cm
- mlx5 driver support for the userspace features 'thread domain', 'wallclock
timestamps' and 'DV Direct Connected transport'. Support for the firmware
dual port rocee capability
- Core support for more than 32 rdma devices in the char dev allocation
- kernel doc updates from Randy Dunlap
- New netlink uAPI for inspecting RDMA objects similar in spirit to 'ss'
- One minor change to the kobject code acked by GKH
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJacfljAAoJEDht9xV+IJsaUnwP+QFJvfIDEfRlfU2rTmcfymPs
Rz9bW1KLgETcJx/XOE2ba2DOaqdFr56TLflsDfEfOSIL8AtzBQqH3vTqEj49bBP7
4JZAkzWllUS/qoYD2XmvOM0IrIfFXzZtLM/lzLi+5dwK26x3GAB9hHXpKzUrJ1vj
I1Naq14qOFXoNBndEtZJqtIKOhR/Pnd6YtxAiNCmViZGdqm3DIU3D4VJhU5B7pO9
j6ovJs16wfJl/gV1iiz9xO49ViVFpwzSIzYE/Q2ZCegcrsF3EEVN2J4vZHkKgDuN
0/Ar/WOvkPzKBFR8hJ7M4kwp0Fy/69/U49s7kpGNxdhML9sU3+Qfse6JYGj0M9L8
01gTM0SShyAZMNAvjVFbIKLQPg806OAit4cooMwlObbwJ6b7B8K0uN17/uVIkIqp
gXqertyl1BLhUtTOby/8Fox/f/oEvaZksKiwcTKSb7D1Y5jGZZUPRknJ5SwAFWQB
RiTPJ6mY7BUsM9zuYQtRE8x2mpgIezYXFcrAz7iT76WuoZQgo1QLIyYRM1+MlhnC
wNrp5BtqoVfW2Ps0CbSdxJ9vDtDf3cwLg0RzcCB8+NJJccsRD9IVMDev/TDY5k9U
M9LxxtW3WuulRWgliU0Q9VaswUQoIao16vBMVL7GwUm+ClLvbRVoPe8jxgtfk+W3
GAANAI7Kv/vUoV/6CFfP
=sMXV
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull RDMA subsystem updates from Jason Gunthorpe:
"Overall this cycle did not have any major excitement, and did not
require any shared branch with netdev.
Lots of driver updates, particularly of the scale-up and performance
variety. The largest body of core work was Parav's patches fixing and
restructing some of the core code to make way for future RDMA
containerization.
Summary:
- misc small driver fixups to
bnxt_re/hfi1/qib/hns/ocrdma/rdmavt/vmw_pvrdma/nes
- several major feature adds to bnxt_re driver: SRIOV VF RoCE
support, HugePages support, extended hardware stats support, and
SRQ support
- a notable number of fixes to the i40iw driver from debugging scale
up testing
- more work to enable the new hip08 chip in the hns driver
- misc small ULP fixups to srp/srpt//ipoib
- preparation for srp initiator and target to support the RDMA-CM
protocol for connections
- add RDMA-CM support to srp initiator, srp target is still a WIP
- fixes for a couple of places where ipoib could spam the dmesg log
- fix encode/decode of FDR/EDR data rates in the core
- many patches from Parav with ongoing work to clean up
inconsistencies and bugs in RoCE support around the rdma_cm
- mlx5 driver support for the userspace features 'thread domain',
'wallclock timestamps' and 'DV Direct Connected transport'. Support
for the firmware dual port rocee capability
- core support for more than 32 rdma devices in the char dev
allocation
- kernel doc updates from Randy Dunlap
- new netlink uAPI for inspecting RDMA objects similar in spirit to 'ss'
- one minor change to the kobject code acked by Greg KH"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (259 commits)
RDMA/nldev: Provide detailed QP information
RDMA/nldev: Provide global resource utilization
RDMA/core: Add resource tracking for create and destroy PDs
RDMA/core: Add resource tracking for create and destroy CQs
RDMA/core: Add resource tracking for create and destroy QPs
RDMA/restrack: Add general infrastructure to track RDMA resources
RDMA/core: Save kernel caller name when creating PD and CQ objects
RDMA/core: Use the MODNAME instead of the function name for pd callers
RDMA: Move enum ib_cq_creation_flags to uapi headers
IB/rxe: Change RDMA_RXE kconfig to use select
IB/qib: remove qib_keys.c
IB/mthca: remove mthca_user.h
RDMA/cm: Fix access to uninitialized variable
RDMA/cma: Use existing netif_is_bond_master function
IB/core: Avoid SGID attributes query while converting GID from OPA to IB
RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure
IB/umad: Fix use of unprotected device pointer
IB/iser: Combine substrings for three messages
IB/iser: Delete an unnecessary variable initialisation in iser_send_data_out()
IB/iser: Delete an error message for a failed memory allocation in iser_send_data_out()
...
Pull poll annotations from Al Viro:
"This introduces a __bitwise type for POLL### bitmap, and propagates
the annotations through the tree. Most of that stuff is as simple as
'make ->poll() instances return __poll_t and do the same to local
variables used to hold the future return value'.
Some of the obvious brainos found in process are fixed (e.g. POLLIN
misspelled as POLL_IN). At that point the amount of sparse warnings is
low and most of them are for genuine bugs - e.g. ->poll() instance
deciding to return -EINVAL instead of a bitmap. I hadn't touched those
in this series - it's large enough as it is.
Another problem it has caught was eventpoll() ABI mess; select.c and
eventpoll.c assumed that corresponding POLL### and EPOLL### were
equal. That's true for some, but not all of them - EPOLL### are
arch-independent, but POLL### are not.
The last commit in this series separates userland POLL### values from
the (now arch-independent) kernel-side ones, converting between them
in the few places where they are copied to/from userland. AFAICS, this
is the least disruptive fix preserving poll(2) ABI and making epoll()
work on all architectures.
As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
it will trigger only on what would've triggered EPOLLWRBAND on other
architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
at all on sparc. With this patch they should work consistently on all
architectures"
* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
make kernel-side POLL... arch-independent
eventpoll: no need to mask the result of epi_item_poll() again
eventpoll: constify struct epoll_event pointers
debugging printk in sg_poll() uses %x to print POLL... bitmap
annotate poll(2) guts
9p: untangle ->poll() mess
->si_band gets POLL... bitmap stored into a user-visible long field
ring_buffer_poll_wait() return value used as return value of ->poll()
the rest of drivers/*: annotate ->poll() instances
media: annotate ->poll() instances
fs: annotate ->poll() instances
ipc, kernel, mm: annotate ->poll() instances
net: annotate ->poll() instances
apparmor: annotate ->poll() instances
tomoyo: annotate ->poll() instances
sound: annotate ->poll() instances
acpi: annotate ->poll() instances
crypto: annotate ->poll() instances
block: annotate ->poll() instances
x86: annotate ->poll() instances
...
Implement RDMA nldev netlink interface to get detailed information on each
QP in the system. This includes the owning process or kernel ULP and
detailed information from the qp_attrs.
Currently only the dumpit variant is implemented.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Expose through the netlink interface the global per-device utilization of
the supported object types.
Provide both dumpit and doit callbacks.
As an example of possible output from rdmatool for system with 5
mlx5 cards:
$ rdma res
1: mlx5_0: qp 4 cq 5 pd 3
2: mlx5_1: qp 4 cq 5 pd 3
3: mlx5_2: qp 4 cq 5 pd 3
4: mlx5_3: qp 2 cq 3 pd 2
5: mlx5_4: qp 4 cq 5 pd 3
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Track create and destroy operations of PD objects.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Track create and destroy operations of CQ objects.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Track create and destroy operations of QP objects.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The RDMA subsystem has very strict set of objects to work with, but it
completely lacks tracking facilities and has no visibility of resource
utilization.
The following patch adds such infrastructure to keep track of RDMA
resources to help with debugging of user space applications. The primary
user of this infrastructure is RDMA nldev netlink (following patches), to
be exposed to userspace via rdmatool, but it is not limited too that.
At this stage, the main three objects (PD, CQ and QP) are added, and more
will be added later.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The KBUILD_MODNAME variable contains the module name and it is known for
kernel users during compilation, so let's reuse it to track the owners.
Followup patches will store this for resource tracking.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The ndev will be initialized and held only for successful
ib_get_cached_gid(), otherwise it is garbage stack memory.
Calling dev_put() in failure path is wrong.
Fixes: 16c72e4028 ("IB/cm: Refactor to avoid setting path record software only fields")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When checking whatever the current netdev is the bond master interface,
use kernel API netif_is_bond_master() instead of hardcoding the check.
No functionality is changed.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
SGID attributes are not used during OPA to IB GID conversion.
Therefore don't query it.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The ib_write_umad() is protected by taking the umad file mutex.
However, it accesses file->port->ib_dev -- which is protected only by the
port's mutex (field file_mutex).
The ib_umad_remove_one() calls ib_umad_kill_port() which sets
port->ib_dev to NULL under the port mutex (NOT the file mutex).
It then sets the mad agent to "dead" under the umad file mutex.
This is a race condition -- because there is a window where
port->ib_dev is NULL, while the agent is not "dead".
As a result, we saw stack traces like:
[16490.678059] BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
[16490.678246] IP: ib_umad_write+0x29c/0xa3a [ib_umad]
[16490.678333] PGD 0 P4D 0
[16490.678404] Oops: 0000 [#1] SMP PTI
[16490.678466] Modules linked in: rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx4_en(OE) ptp pps_core mlx4_ib(OE-) ib_core(OE) mlx4_core(OE) mlx_compat
(OE) memtrack(OE) devlink mst_pciconf(OE) mst_pci(OE) netconsole nfsv3 nfs_acl nfs lockd grace fscache cfg80211 rfkill esp6_offload esp6 esp4_offload esp4 sunrpc kvm_intel kvm ppdev parport_pc irqbypass
parport joydev i2c_piix4 virtio_balloon cirrus drm_kms_helper ttm drm e1000 serio_raw virtio_pci virtio_ring virtio ata_generic pata_acpi qemu_fw_cfg [last unloaded: mlxfw]
[16490.679202] CPU: 4 PID: 3115 Comm: sminfo Tainted: G OE 4.14.13-300.fc27.x86_64 #1
[16490.679339] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
[16490.679477] task: ffff9cf753890000 task.stack: ffffaf70c26b0000
[16490.679571] RIP: 0010:ib_umad_write+0x29c/0xa3a [ib_umad]
[16490.679664] RSP: 0018:ffffaf70c26b3d90 EFLAGS: 00010202
[16490.679747] RAX: 0000000000000010 RBX: ffff9cf75610fd80 RCX: 0000000000000000
[16490.679856] RDX: 0000000000000001 RSI: 00007ffdf2bfd714 RDI: ffff9cf6bb2a9c00
In the above trace, ib_umad_write is trying to dereference the NULL
file->port->ib_dev pointer.
Fix this by using the agent's device pointer (the device field
in struct ib_mad_agent) -- which IS protected by the umad file mutex.
Cc: <stable@vger.kernel.org> # v4.11
Fixes: 44c58487d5 ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Returning EOPNOTSUPP is problematic because it can also be
returned by the method function, and we use it in quite a few
places in drivers these days.
Instead, dedicate EPROTONOSUPPORT to indicate that the ioctl framework
is enabled but the requested object and method are not supported by
the kernel. No other case will return this code, and it lets userspace
know to fall back to write().
grep says we do not use it today in drivers/infiniband subsystem.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
rdma_dev_addr contains the net namespace pointer, while referring
bound_dev_if of the rdma_dev_addr, refer to the net namespace of
rdma_cm_id stored in rdma_dev_addr.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
cma_validate_port uses rdma_dev_addr to validate the port of the cm_id.
It needs to honor the net namespace which is setup during cm_id creation
when finding netdevice.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Pass the rdma_cm_id so that multiple fields of the rdma_dev_addr
structure can be accessed, instead of passing each individual fields.
This is needed to access some additional fields in followup patches.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If valid netdevice is not found for RoCE, GID table should not be
searched with NULL netdevice.
Doing so causes the search routines to ignore the netdev argument and may
match the wrong GID table entry if the netdev is deleted.
Fixes: abae1b71dd ("IB/cma: cma_validate_port should verify the port and netdevice")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Make use of rdma_read_gids() API to read SGID and DGID which returns
correct GIDs for RoCE and other transports.
rdma_addr_get_dgid() for RoCE for client side connections returns MAC
address, instead of DGID.
rdma_addr_get_sgid() for RoCE doesn't return correct SGID for IPv6 and
when more than one IP address is assigned to the netdevice.
Therefore use transport agnostic rdma_read_gids() API provided by rdma_cm
module.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch introduces an API that allows legacy applications to query
GIDs for a rdma_cm_id which is used during connection establishment.
GIDs are stored and created differently for iWarp, IB and RoCE transports.
Therefore rdma_read_gids() returns GID for all the transports hiding
such internal details to caller.
It is usable for client side and server side connections.
In general continued use of GID based addressing outside of IB is
discouraged, so rdma_read_gids() should not be used by any new ULPs.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
polling the completion queue directly does not interfere
with the existing polling logic, hence drop the requirement.
Be aware that running ib_process_cq_direct with non IB_POLL_DIRECT
CQ may trigger concurrent CQ processing.
This can be used for polling mode ULPs.
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Reported-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
[maxg: added wcs array argument to __ib_process_cq]
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
No need to initialize completion and WR in case we fail
during QP modification.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
gcc-8 reports
drivers/infiniband/core/cma_configfs.c: In function 'make_cma_dev':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified
bound 64 equals destination size [-Wstringop-truncation]
We need to use strlcpy() to make sure the string is nul-terminated.
Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This matches what the userspace copy of this header has been doing
for a while. imm_data is an opaque 4 byte array carried over the network,
and invalidate_rkey is in CPU byte order.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Resolving DMAC for RoCE is applicable to only Connected mode QPs.
So resolve DMAC for only for Connected mode QPs.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Instead of returning 0 (success) for RoCE scenarios where DMAC should
not be resolved, avoid such attempt and make code consistent with
ib_create_user_ah().
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently ah_attr is initialized by the ib_cm layer for rdma_cm
based applications. For RoCE transport ah_attr.roce.dmac is already
initialized by ib_cm, rdma_cm either from wc, path record, route
resolve, explicit path record setting depending on active or passive
side QP. Therefore avoid resolving DMAC for QP of kernel consumers.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently qp->port stores the port number whenever IB_QP_PORT
QP attribute mask is set (during QP state transition to INIT state).
This port number should be stored for the real QP when XRC target QP
is used.
Follow the ib_modify_qp() implementation and hide the access to ->real_qp.
Fixes: a512c2fbef ("IB/core: Introduce modify QP operation with udata")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fix kernel-doc warning for ib_fmr_pool_map_phys() and also format it
with function description and text spacing.
../drivers/infiniband/core/fmr_pool.c:404: warning: Excess function parameter 'pool' description in 'ib_fmr_pool_map_phys'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Change function parameter name in kernel-doc notation and other comments
to eliminate a kernel-doc warning.
../drivers/infiniband/core/verbs.c:1790: warning: Excess function parameter 'wq_init_attr' description in 'ib_create_wq'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The 'if' logic in ucma_query_path was broken with OPA was introduced
and started to treat RoCE paths as as OPA paths. Invert the logic
of the 'if' so only OPA paths are treated as OPA paths.
Otherwise the path records returned to rdma_cma users are mangled
when in RoCE mode.
Fixes: 5752075144 ("IB/SA: Add OPA path record type")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rdma_set_ib_path() missed setting path record fields for RoCE
transport when RoCE support was added.
This results in setting incorrect ndev, destination mac address,
incorrect GID type etc errors when user space attempts to set a raw
IB path using the roce IB path compatibility mapping from userspace.
Fixes: 3c86aa70bf ("RDMA/cm: Add RDMA CM support for IBoE devices")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Since 2006 there has been no user of rdmacm based application to make use
of setting multiple path records using rdma_set_ib_paths API.
Therefore code is simplified to allow setting one path record entry.
Now that it sets only single path, it is renamed to reflect the same.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Introduce a helper function to set path record L2 fields for RoCE.
This includes setting GID type, destination mac address and netdev
ifindex.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The net namespace is set in addr during create_rdma_id(),
cma_resolve_iboe_route() should use that instead of the
init namespace.
The original code was added in commit fa20105e09 ("IB/cma: Add support
for network namespaces"), but this path wasn't in use back then.
This patch updates the code to use right namespace, as preparation
for improving namespace support.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is a need to increase number of possible char devices to support
large number of SR-IOV instances. The current limit is in the range of
64-128 devices/ports. Increase it to support up to 1024.
The patch performs the following steps to refactor the code:
1. Removes the split bitmap for fixed and overflow dev numbers.
2. Pre-allocates the non-legacy major number range during driver
initialization, choosen for simplicity.
3. Add new define (RDMA_MAX_PORTS) that is shared between all drivers.
This is the maximum total number of ports on all struct ib_devices.
4. Set RDMA_MAX_PORTS to 1024.
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove the locks that protect character device bitmaps of
uverbs, umad and issm.
The character device bitmaps are accessed in "client->add" and
"client->remove" calls from ib_register_device and ib_unregister_device
respectively. These calls are already protected by the "device_mutex"
mutex. Thus, the spinlocks are not needed.
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
- One line fix to mlx4 error flow (same as mlx5 fix in last pull request,
just in the mlx4 driver)
- Fix a race condition in the IPoIB driver. This patch is larger than
just a one line fix, but resolves a race condition in a fairly
straight forward manner
- Fix a locking issue in the RDMA netlink code. This patch is also
larger than I would like for a late -rc. It has, however, had a week
to bake in the rdma tree prior to this pull request
- One line fix to fix granting remote machine access to memory that they
don't need and shouldn't have
- One line fix to correct the fact that our sgid/dgid pair is swapped
from what you would expect when receiving an incoming connection
request
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaU+ZkAAoJELgmozMOVy/dLw8P/1f27k9c7Bg91VfuyQeIcSxA
kyRDdzlkRzuI/6QJ4ErK+IkOH8ADG6UGmQa+fOv1dxG8do+YwVflcY7gEgjJA7fP
k0oPuGjiq8wrEWZrFGinln38ou0KALYd4F2C32unVYrsIohQLHSr1D6Ttw0W5FA6
NQG4nVn9FzmilgjqtkW2zOGKw4jdAn57J47tUp49KufuPBTUcxjmZCdaV5AmiuzN
5JpZUieL49Zoc18pcm1OreqDPZcj5LV1XquDNV+AZgU9+uGKoIb932k6hQjBRuml
FSePxpPjdN8zX/KVaa4HQHX4U4uMBp0HcRHYME1bDsKwTh/d9xKM/yTPzzCtJz+r
wmGJ9TPr2nq8blJJq17nSXbaJ4LmzlScCwork3LomdZJi880JwWJlvjFG3M/Yir9
HvS2zIOUJm+xZBNCDVEayYcBMkXew5XjxETtDwOvfYX8FM419LLk1WOp2y/4LKDD
hIR8QYkZMl37lMYqWZUghNjR7Rov6jdd30KDiCGdOAO/qszlNyTSL+icWyzc1t/X
VT4ai7vc0RTicPWwb8H8o8/dQNj8Ed8w5NnMq3hjen+KrTKShkZTMuW+or/E9jZN
ha9jIzSPLRfOvX6mZRrQVe6hiY3fOWMZXdw7gtehUy2hX7LCSwwbn2v6FcsDxyMQ
UW6ZVG3ccP9YSY+tBWKg
=kUnv
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Doug Ledford:
- One line fix to mlx4 error flow (same as mlx5 fix in last pull
request, just in the mlx4 driver)
- Fix a race condition in the IPoIB driver. This patch is larger than
just a one line fix, but resolves a race condition in a fairly
straight forward manner
- Fix a locking issue in the RDMA netlink code. This patch is also
larger than I would like for a late -rc. It has, however, had a week
to bake in the rdma tree prior to this pull request
- One line fix to fix granting remote machine access to memory that
they don't need and shouldn't have
- One line fix to correct the fact that our sgid/dgid pair is swapped
from what you would expect when receiving an incoming connection
request
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/srpt: Fix ACL lookup during login
IB/srpt: Disable RDMA access by the initiator
RDMA/netlink: Fix locking around __ib_get_device_by_index
IB/ipoib: Fix race condition in neigh creation
IB/mlx4: Fix mlx4_ib_alloc_mr error flow
Merging in 12 patch series from Bart that required changes in the
current for-rc branch in order to apply cleanly.
Signed-off-by: Doug Ledford <dledford@redhat.com>
When mlx5_ib_add is called determine if the mlx5 core device being
added is capable of dual port RoCE operation. If it is, determine
whether it is a master device or a slave device using the
num_vhca_ports and affiliate_nic_vport_criteria capabilities.
If the device is a slave, attempt to find a master device to affiliate it
with. Devices that can be affiliated will share a system image guid. If
none are found place it on a list of unaffiliated ports. If a master is
found bind the port to it by configuring the port affiliation in the NIC
vport context.
Similarly when mlx5_ib_remove is called determine the port type. If it's
a slave port, unaffiliate it from the master device, otherwise just
remove it from the unaffiliated port list.
The IB device is registered as a multiport device, even if a 2nd port is
not available for affiliation. When the 2nd port is affiliated later the
GID cache must be refreshed in order to get the default GIDs for the 2nd
port in the cache. Export roce_rescan_device to provide a mechanism to
refresh the cache after a new port is bound.
In a multiport configuration all IB object (QP, MR, PD, etc) related
commands should flow through the master mlx5_core_dev, other commands
must be sent to the slave port mlx5_core_mdev, an interface is provide
to get the correct mdev for non IB object commands.
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
It always returns 0. Change return type to void.
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The cases for FDR/EDR signalling speed, were missing in
ib_rate_to_mult and mult_to_ib_rate giving wrong return values when
drivers convert static rate to/from inter-packet-delay.
Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The commit 1a1c116f3d ("RDMA/netlink: Simplify the put_msg and put_attr")
removes nlmsg_len calculation in ibnl_put_attr causing netlink messages and
caused to miss source and destination addresses.
Fixes: 1a1c116f3d ("RDMA/netlink: Simplify the put_msg and put_attr")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Holding locks is mandatory when calling __ib_device_get_by_index,
otherwise there are races during the list iteration with device removal.
Since the locks are static to device.c, __ib_device_get_by_index can
never be called correctly by any user out side the file.
Make the function static and provide a safe function that gets the
correct locks and returns a kref'd pointer. Fix all callers.
Fixes: e5c9469efc ("RDMA/netlink: Add nldev device doit implementation")
Fixes: c3f66f7b00 ("RDMA/netlink: Implement nldev port doit callback")
Fixes: 7d02f605f0 ("RDMA/netlink: Add nldev port dumpit implementation")
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The NLDEV commands are using IB device indexes and names as a handle
for netlink communications. Put all relevant code into one function
to remove code duplication in followup patches.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is an existing function to decrease reference counter
of the device, let's use it.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The request_module() call is internally wrapped by CONFIG_MODULE,
so there is no need to check it in our RDMA code too.
Refactor to simplify the code.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
- cxgb4 fix for an iser testing failure as debugged by Steve and Sagi.
The problem was a driver bug in the handling of shutting down a QP.
- Various vmw_pvrdma fixes for bogus WARN_ON, missed resource free on error
unwind and a use after free bug
- Improper congestion counter values on mlx5 when link aggregation is enabled
- ipoib lockdep regression introduced in this merge window
- hfi1 regression supporting the device in a VM introduced in a recent patch
- Typo that breaks future uAPI compatibility in the verbs core
- More SELinux related oops fixing
- Fix an oops during error unwind in mlx5
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJaRIC/AAoJEDht9xV+IJsaJfQP/1Z97/kDlJGIJQ4vBJ52xdHV
LfRdmCBqU5nrAihEBpFLRc2S+kaSJbYAY48tRn28Jx6s9dmSvU6v2J2IqhmnM6p6
ruWLR0Yqjg+xHcw+eaEoscJjRw+jDUEeVOgfbYc0HViWwvMNTrBB32HpAV48HuAl
aCbM/qrQYXdYuJBImM4glERIpjlvYKoxv4D9xCJhJRRQvTnKOymHzZpKbqNujWxl
dzCmZeOrw+HVxNW9MHHtUxClBoLNnykfRVKzMcdDjsqJ+Fdo2bY3ksgMvgiatRwY
NxGfixhouhOz9vjN/ljpWXxTV5TTm6Nrib8XcHuOWjcYn/AFwJMMRsM+1w1AuCKs
Zviq7QVApZzYuvHw1ewupRGvDX+P13sufD5sbc6cfVUT3w6ZX0Clpspl4++JN4ER
WvBZikozaviL3w9ir0drlZ6k9BDnjQ6P7wZcBjDZC/j0zXKM65rISZrTsK7TeiTk
lBNdLCkwZhO0dvafCNwA910tTaXEPhqqAh8Okob2A5U5lUAewd0AEHJusL/iCmSl
uXnnxu8ik61QzOqwneEHSyVMkOSLEC+kk13fiFAq/LjPUSm9N/MihZd4JNxwSa6W
4Rah7IKdh9F6qEnaKLPEfHxPhfghhb7O51zCA8mwA/JNCneqc4Gqi0U2JXkuloml
395aK2aZSShIkZvIwbI8
=IkGi
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"This is the next batch of for-rc patches from RDMA. It includes the
fix for the ipoib regression I mentioned last time, and the result of
a fairly major debugging effort to get iser working reliably on cxgb4
hardware - it turns out the cxgb4 driver was not handling QP error
flushing properly causing iser to fail.
- cxgb4 fix for an iser testing failure as debugged by Steve and
Sagi. The problem was a driver bug in the handling of shutting down
a QP.
- Various vmw_pvrdma fixes for bogus WARN_ON, missed resource free on
error unwind and a use after free bug
- Improper congestion counter values on mlx5 when link aggregation is
enabled
- ipoib lockdep regression introduced in this merge window
- hfi1 regression supporting the device in a VM introduced in a
recent patch
- Typo that breaks future uAPI compatibility in the verbs core
- More SELinux related oops fixing
- Fix an oops during error unwind in mlx5"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/mlx5: Fix mlx5_ib_alloc_mr error flow
IB/core: Verify that QP is security enabled in create and destroy
IB/uverbs: Fix command checking as part of ib_uverbs_ex_modify_qp()
IB/mlx5: Serialize access to the VMA list
IB/hfi: Only read capability registers if the capability exists
IB/ipoib: Fix lockdep issue found on ipoib_ib_dev_heavy_flush
IB/mlx5: Fix congestion counters in LAG mode
RDMA/vmw_pvrdma: Avoid use after free due to QP/CQ/SRQ destroy
RDMA/vmw_pvrdma: Use refcount_dec_and_test to avoid warning
RDMA/vmw_pvrdma: Call ib_umem_release on destroy QP path
iw_cxgb4: when flushing, complete all wrs in a chain
iw_cxgb4: reflect the original WR opcode in drain cqes
iw_cxgb4: Only validate the MSN for successful completions
Patches for 4.16 that are dependent on patches sent to 4.15-rc.
These are small clean ups for the vmw_pvrdma and i40iw drivers.
* 'from-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git:
RDMA/vmw_pvrdma: Remove usage of BIT() from UAPI header
RDMA/vmw_pvrdma: Use refcount_t instead of atomic_t
RDMA/vmw_pvrdma: Use more specific sizeof in kcalloc
RDMA/vmw_pvrdma: Clarify QP and CQ is_kernel logic
RDMA/vmw_pvrdma: Add UAR SRQ macros in ABI header file
i40iw: Change accelerated flag to bool
Delete ibnl_chk_listeners() and its kernel-doc comments from the
core_priv.h header file. There is no such function.
Fixes: 233c195583 ("RDMA/netlink: Reduce exposure of RDMA netlink functions")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The rq/sq->psn is 24 bits as defined in the IB spec, therefore we mask
out the 8 most significant bits to avoid overflow in modify_qp.
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The XRC target QP create flow sets up qp_sec only if there is an IB link with
LSM security enabled. However, several other related uAPI entry points blindly
follow the qp_sec NULL pointer, resulting in a possible oops.
Check for NULL before using qp_sec.
Cc: <stable@vger.kernel.org> # v4.12
Fixes: d291f1a652 ("IB/core: Enforce PKey security on QPs")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If the input command length is larger than the kernel supports an error should
be returned in case the unsupported bytes are not cleared, instead of the
other way aroudn. This matches what all other callers of ib_is_udata_cleared
do and will avoid user ABI problems in the future.
Cc: <stable@vger.kernel.org> # v4.10
Fixes: 189aba99e7 ("IB/uverbs: Extend modify_qp and support packet pacing")
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use rdma_cap_opa_mad() to check for OPA to promote code reuse.
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
SA queries SM for class port info when there is a LID_CHANGE event.
When a base lid is configured before fm is started ie when smlid is
not yet assigned, SA handles the LID_CHANGE event and tries query SM
with lid 0. This will cause an hang.
[ 1106.958820] INFO: task kworker/2:0:23 blocked for more than 120 seconds.
[ 1106.965082] Tainted: G O 4.12.0+ #1
[ 1106.969602] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[ 1106.977227] kworker/2:0 D 0 23 2 0x00000000
[ 1106.977250] Workqueue: infiniband update_ib_cpi [ib_core]
[ 1106.977261] Call Trace:
[ 1106.977273] __schedule+0x28e/0x860
[ 1106.977285] schedule+0x36/0x80
[ 1106.977298] schedule_timeout+0x1a3/0x2e0
[ 1106.977310] ? radix_tree_iter_tag_clear+0x1b/0x20
[ 1106.977322] ? idr_alloc+0x64/0x90
[ 1106.977334] wait_for_completion+0xe3/0x140
[ 1106.977347] ? wake_up_q+0x80/0x80
[ 1106.977369] update_ib_cpi+0x163/0x210 [ib_core]
[ 1106.977381] process_one_work+0x147/0x370
[ 1106.977394] worker_thread+0x4a/0x390
[ 1106.977406] kthread+0x109/0x140
[ 1106.977418] ? process_one_work+0x370/0x370
[ 1106.977430] ? kthread_park+0x60/0x60
[ 1106.977443] ret_from_fork+0x22/0x30
Always ensure a proper smlid is assigned before querying SM for cpi.
Fixes: ee1c60b1bf ("IB/SA: Modify SA to implicitly cache Class Port info")
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
These duplicate includes have been found with scripts/checkincludes.pl but
they have been removed manually to avoid removing false positives.
Signed-off-by: Pravin Shedge <pravin.shedge4linux@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When path ah_attr initialization from path record
fails, ib_cm_send_rej() uses av.ah_attr fields to send out reject
message. In such cases initialization of path record software fields
is not needed. Code is simplified for same.
Additionally in current code in cm_req_handler, when ib_get_cached_gid
fails for a given sgid_index of the GID of the GRH of the incoming CM MAD,
error code 12 is sent. This error code refers to primary GID in incoming
CM REQ and not for the GID in in MAD packet.
Therefore code is refactored to send code 5 (unsupported request) for such
error.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently ib_init_ah_from_wc initializes address handle attributes and
not the address handle object itself.
To avoid confusion between ah_attr vs ah, ib_init_ah_from_wc is
renamed to ib_init_ah_attr_from_wc to reflect that its initialzes
ah_attr.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Since ib_init_ah_from_path initializes the address handle attribute, it is
renamed to reflect so.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In case of LAP are used for RoCE, it can lead to a problem of sleeping a
context while spin lock is held in below flow.
cm_lap_handler
->spin_lock
-> <..switch_case..>
-> cm_init_av_for_response
-> ib_init_ah_from_wc
-> rdma_addr_find_l2_eth_by_grh
wait_for_completion()
Therefore ah attribute initialization is done for incoming lap requests
outside of the lock context.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
cm_init_av_by_path depends on ib_init_ah_from_path to initialize ah
attribute and ib_init_ah_from_path() can fail, such error should not
be ignored.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
cm_init_av_for_response depends on ib_init_ah_from_wc() whose return
status is ignored.
ib_init_ah_from_wc() can fail and its return status should be handled as
done in this patch.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently there are no users of ib_find_gid for RoCE transport. It is
only used by IPoIB.
Therefore its simplified to ignore RoCE ports and GID type check which
was previously done for every port.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rdma_copy_addr copies the ifndex to bound_dev_if.
Therefore avoid copying it again after rdma_copy_addr call is completed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Since no caller needs vlan, rdma_translate_ip is simplified to avoid
vlan pointer.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rdma_addr_find_smac_by_sgid() is exported symbol not used by any kernel
module. Therefore its removed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rdma_resolve_ip already copies 'addr' to its dev_addr argument.
Remove the duplicate memcpy and since it was the only user, remove the
'addr' member from resolve_cb_context.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ib_find_gid_by_filter() is used only by ib_core, therefore avoid
exporting it.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently on every gid entry comparison miss found variable is checked;
which is not needed as those two comparison fail already indicate that
GID is not found yet.
So refactor to avoid such check and copy the GID index when found.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Type cast from void to struct find_gid_index_context is not needed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Introduce and user helper functions to initialize work for address
resolved and route resolved event that avoid code duplication at few
places.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Avoid setting path record type twice for RoCE.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Current code checks for NULL ndev twice where 2nd check is always
invalid given the fact that during route resolving stage, device address
must be bound to netdevice interface.
This patch simplifies such check.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
As the function name suggests cma_resolve_iboe_route() resolves RoCE
route. However, its default GID type is IB_GID_TYPE_IB and not
IB_GID_TYPE_ROCE, even though both are mapped to the same enum value.
Change default GID type to IB_GID_TYPE_ROCE.
cma_iboe_set_mgid() is updated to reflect the RoCEv2 GID check.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In ib_umem structure npages holds original number of sg entries, while
nmap is number of DMA blocks returned by dma_map_sg.
Fixes: c5d76f130b ('IB/core: Add umem function to read data from user-space')
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add debug prints to the error paths in the connection manager control
flows, to help debug connection management problems.
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The code was using the src size when formatting the dst. They are almost
certainly the same value but it reads wrong.
Fixes: ce117ffac2 ("RDMA/cma: Export AF_IB statistics")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ib_security_modify_qp and ib_security_pkey_access are core internal
function. So avoid exporting them.
ib_security_pkey_access is used only when secuirty hooks are enabled so
avoid defining it otherwise.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
RoCEv1 does not use the IPv6 stack to resolve the link local DGID since it
uses GID address. It forms the DMAC directly from the DGID.
The code became confused and also tried to use this bypass for RoCEv2
packets, however RoCEv2 always uses a IP address in the GID and must
always use ARP or neighbor discovery to get the DMAC address.
Now that rdma_addr_find_l2_eth_by_grh() supports resolving link local
address to find destination mac address, lets make use of it.
This aligns it to how the rest of the IPv6 stack resolves link local
destination IPv6 address.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When computing a UD reverse path (return AH) from a WC the code was not
doing a route lookup anchored in a specific netdevice. This caused several
bugs, including broken IPv6 link-local address support in RoCEv2. [1]
This fixes the lookup by determining the GID table entry that the HW
matched to the SGID for the WC and then using the netdevice from that
entry to perform the route and ND lookup for the 'DGID' to build a return
AH.
RoCE GID table management ensures that right upper netdevices of the
physical netdevices are added. Therefore init_ah_from_wc doesn't need to
perform such check.
Now that route lookup is done based on the netdevice of the GID entry,
simplify code to not have ifindex and vlan pointers. As part of that,
refactor to have netdevice as input parameter. This is already discussed
at [2].
Finally ib_init_ah_from_wc resolves dmac for unicast GID in similar way as
what ib_resolve_eth_dmac() does. So ib_resolve_eth_dmac is refactored to
split for unicast and non unicast GIDs, so that it can be reused by
ib_init_ah_from_wc.
While we are at refactoring ib_resolve_eth_dmac(), it is further
simplified
(a) to avoid hoplimit as optional parameter, as there is only one
user who always queries hoplimit.
(b) for empty line.
(c) avoided zero initialization of ret.
(d) removed as exported symbol as only ib core uses it.
For IPv6, this is tested using simple rping test as below.
rping -sv -a ::0
rping -c -a fe80::268a:7ff:fe55:4661%ens2f1 -C 1 -v -d
[1] https://www.spinics.net/lists/linux-rdma/msg45690.html
[2] https://www.spinics.net/lists/linux-rdma/msg45710.html
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Reported-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
- Fix for SELinux on the umad SMI path. Some old hardware
does not fill the PKey properly exposing another bug in the newer
SELinux code.
- Check the input port as we can exceed array bounds from this
user supplied value
- Users are unable to use the hash field support as they want due to
incorrect checks on the field restrictions, correct that so the
feature works as intended
- User triggerable oops in the NETLINK_RDMA handler
- cxgb4 driver fix for a bad interaction with CQ flushing in iser
caused by patches in this merge window, and bad CQ flushing during
normal close.
- Unbalanced memalloc_noio in ipoib in an error path.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJaNGJ9AAoJEDht9xV+IJsawK4P/iVlUR8DReXKVPkxYOQk15bI
GEKG3t2Ce1GaeFZY7TBmsdHBRTHhxf2osEM57TbBmWv6N/pG83GLresE6xxOhRHz
3s2hzWElJXpYnM0QttHCNjJvySIzjzLZiaQyhiWFqs0+cPVUM9zQd0G77LwHngRf
1gO7toTMYk8eZkJt0ClQwHMeH6qR892o+zDUtorX/Ez4Ly4tT3I/RwRpbZ1HHpsA
uWMYcsge7lRzFbZnC+lDoeqozcv20B7n9UBEcAHJkVSh5JFC+TByRmCAZ/hPzjXb
Pr2E4gTYT+ULUsPRECtIwupT30xfFdByFYBAl+EQ+fiJvGgBxcgVjdLDQ3Ddlb6n
ga5UEverYKivizitKowtpMCJ0nVH6R4qLt5vcPwxuoHKQmUtXQFeg/haZPWCiPwr
B4Ahm371yRx8xo4AITBFX4L4PdtmdAueyrrjz/MxJm5YM2eRy08OONFVlBqXTuqT
EdbtHFCbXtE3aAIiWGmUA0jbswKN9fkUct/wMwkny8T3h/XPKhBqA/SWN5SX1KC3
EHAjczAcX+MOS52pyhf07C3Z/oq4gpXSCQQHSat9es8oxst4w0CWcCGqIhyqyu2q
s5CZG3Ok+OvTmKWRJkaEAJXHRoTB1OjkgEod13xRDQmONW/cabeBKe/BC1zmvctM
g2eyl4amP8MGaRTldpou
=oyIl
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"More fixes from testing done on the rc kernel, including more SELinux
testing. Looking forward, lockdep found regression today in ipoib
which is still being fixed.
Summary:
- Fix for SELinux on the umad SMI path. Some old hardware does not
fill the PKey properly exposing another bug in the newer SELinux
code.
- Check the input port as we can exceed array bounds from this user
supplied value
- Users are unable to use the hash field support as they want due to
incorrect checks on the field restrictions, correct that so the
feature works as intended
- User triggerable oops in the NETLINK_RDMA handler
- cxgb4 driver fix for a bad interaction with CQ flushing in iser
caused by patches in this merge window, and bad CQ flushing during
normal close.
- Unbalanced memalloc_noio in ipoib in an error path"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/ipoib: Restore MM behavior in case of tx_ring allocation failure
iw_cxgb4: only insert drain cqes if wq is flushed
iw_cxgb4: only clear the ARMED bit if a notification is needed
RDMA/netlink: Fix general protection fault
IB/mlx4: Fix RSS hash fields restrictions
IB/core: Don't enforce PKey security on SMI MADs
IB/core: Bound check alternate path port number
With gcc-4.1.2:
drivers/infiniband/core/iwpm_util.c: In function ‘iwpm_send_mapinfo’:
drivers/infiniband/core/iwpm_util.c:647: warning: ‘ret’ may be used uninitialized in this function
Indeed, if nl_client is not found in any of the scanned has buckets, ret
will be used uninitialized.
Preinitialize ret to -EINVAL to fix this.
Fixes: 30dc5e63d6 ("RDMA/core: Add support for iWARP Port Mapper user space service")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fix ptr_ret.cocci warnings:
drivers/infiniband/core/uverbs_cmd.c:1156:1-3: WARNING: PTR_ERR_OR_ZERO can be used
Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
Generated by: scripts/coccinelle/api/ptr_ret.cocci
Signed-off-by: Vasyl Gomonovych <gomonovych@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ULPs do not understand OPA GIDs and will reject CM requests
if the sgid does not match the local_gid. In order to
fix this behavior we convert the OPA GID back to an IB GID.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Per the infiniband spec an SMI MAD can have any PKey. Checking the pkey
on SMI MADs is not necessary, and it seems that some older adapters
using the mthca driver don't follow the convention of using the default
PKey, resulting in false denials, or errors querying the PKey cache.
SMI MAD security is still enforced, only agents allowed to manage the
subnet are able to receive or send SMI MADs.
Reported-by: Chris Blake <chrisrblake93@gmail.com>
Cc: <stable@vger.kernel.org> # v4.12
Fixes: 47a2b338fe ("IB/core: Enforce security on management datagrams")
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The alternate port number is used as an array index in the IB
security implementation, invalid values can result in a kernel panic.
Cc: <stable@vger.kernel.org> # v4.12
Fixes: d291f1a652 ("IB/core: Enforce PKey security on QPs")
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
fix for a regression in iWarp if SELinux is enabled, a fix for a compilation
regression introduced in this merge window, and one obscure kconfig
combination that oops's the kernel.
For drivers, we have hns fixes needed to make their devices work on certain
ARM IOMMU configurations, a stack data leak for hfi1, and various testing
discovered -rc bug fixes for i40iw.
This cycle we pushed back on the driver maintainers to have better commit
messages for -rc material.
You may need to pull my latest PGP key from the GPG key servers for this, I am
not certain if the subkey update will make it to kernel.org's WKD before you
need it.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJaJcu6AAoJEDht9xV+IJsa8aoP/Rh3tHT3mqN6v/p9HgUyUNRS
gyDJ4BHg3A+O1UrnBjFjrAhpX9bqik/96n9t14Er7kVM9gxaDzOaCNd9+ASsKjDf
fXRusCnKS5RP5CpQ16e6qurkOsBXghsJKTL+zpqGSmDf0yUBQCJUkmRNJNhiaUtW
YEp92dfZytTK+iEmuXW4fJoIKWK3N5aOkttiK8BFb6XvmsUnWSp1wlBS2FhRzDq9
PPwfM2EE/x46dFF1/w04M5hVDPO6Bngq0Tvo+EdOlAMwKN3Zmun+fSOLKaxg44Of
dyN6dsu5tKi200Nbdq6cBkehWL6CukSGdJnepeI+xW+8hve9Eu9O6j6O3pMb/dYn
/vvqE14KhrR1B3F5LFkJLcxxKRl97S2uPhOY2j3oU4L93s9B4X6geXX2oLVIos1r
41YPu1/7OQyQffp4eKgsz4eA38TpdG6DoOlFMXgdIboJ8bASuRuyfLISVviMc8dx
SKQTZTY54FK7uJMRw4rkcOlVUpJ2tyuVZr+Lt8p80IpnySCdJsEgAZkJngCPOKRT
2h8VdfFwzhdlf3Ni5tZRZdMtE6oMD5BMa0jri7xtyKYa0o3gUqvHDGMTSVlQ1maF
qXMP2mApTcpdFuFvdbnxIeLzP8zigJVkvIsqeKHGS8gt+dxF/934rTY1NTj3rQN5
zmClIoiVg7NvHlDzvwg+
=YaUB
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Here is the first rc pull request for RDMA. This includes an important
core fix for a regression in iWarp if SELinux is enabled, a fix for a
compilation regression introduced in this merge window, and one
obscure kconfig combination that oops's the kernel.
For drivers, we have hns fixes needed to make their devices work on
certain ARM IOMMU configurations, a stack data leak for hfi1, and
various testing discovered -rc bug fixes for i40iw.
This cycle we pushed back on the driver maintainers to have better
commit messages for -rc material"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/core: Only enforce security for InfiniBand
RDMA/hns: Get rid of page operation after dma_alloc_coherent
RDMA/hns: Get rid of virt_to_page and vmap calls after dma_alloc_coherent
RDMA/hns: Fix the issue of IOVA not page continuous in hip08
IB/core: Init subsys if compiled to vmlinuz-core
RDMA/cma: Make sure that PSN is not over max allowed
i40iw: Notify user of established connection after QP in RTS
i40iw: Move MPA request event for loopback after connect
i40iw: Correct ARP index mask
i40iw: Do not free sqbuf when event is I40IW_TIMER_TYPE_CLOSE
i40iw: Allocate a sdbuf per CQP WQE
IB: INFINIBAND should depend on HAS_DMA
IB/hfi1: Initialize bth1 in 16B rc ack builder
For now the only LSM security enforcement mechanism available is
specific to InfiniBand. Bypass enforcement for non-IB link types.
This fixes a regression where modify_qp fails for iWARP because
querying the PKEY returns -EINVAL.
Cc: Paul Moore <paul@paul-moore.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: stable@vger.kernel.org
Reported-by: Potnuri Bharat Teja <bharat@chelsio.com>
Fixes: d291f1a65232("IB/core: Enforce PKey security on QPs")
Fixes: 47a2b338fe63("IB/core: Enforce security on management datagrams")
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Tested-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Once infiniband is compiled as a core component its subsystem must be
enabled before device initialization. Otherwise there is a NULL pointer
dereference during mlx4_core init, calltrace:
->device_add
if (dev->class) {
deref dev->class->p =>NULLPTR
#Config
CONFIG_NET_DEVLINK=y
CONFIG_MAY_USE_DEVLINK=y
CONFIG_MLX4_EN=y
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch limits the initial value for PSN to 24 bits as
spec requires.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow RDMA to create long standing memory registrations
against filesytem-dax vmas.
Link: http://lkml.kernel.org/r/151068941011.7446.7766030590347262502.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 3565fce3a6 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Inki Dae <inki.dae@samsung.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joonyoung Shim <jy0922.shim@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>