get_netdev: get the net_device on the physical port of the IB transport port. In
port aggregation mode it is required to return the netdev of the active port.
modify_gid: note for a change in the RoCE gid cache. Handle this by writing to
the harsware GID table. It is possible that indexes in cahce and hardware tables
won't match so a translation is required when modifying a QP or creating an
address handle.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Handling bonding and other devices require us to all all GIDs of the
net-devices which are upper-devices of the RoCE port related
net-device.
Active-backup configurations imposes even more challenges as the
default GID should only be set on the active devices (this is
necessary as otherwise the same MAC could be used for several
slaves and thus several slaves will have identical GIDs).
Managing these configurations are done by listening to:
(a) NETDEV_CHANGEUPPER event
(1) if a related net-device is linked, delete all inactive
slaves default GIDs and add the upper device GIDs.
(2) if a related net-device is unlinked, delete all upper GIDs
and add the default GIDs.
(b) NETDEV_BONDING_FAILOVER:
(1) delete the bond GIDs from inactive slaves
(2) delete the inactive slave's default GIDs
(3) Add the bond GIDs to the active slave.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Smatch says that, based on the indenting, we should probably add curly
braces here.
Fixes: 03db3a2d81 ('IB/core: Add RoCE GID table management')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
RoCE GIDs are based on IP addresses configured on Ethernet net-devices
which relate to the RDMA (RoCE) device port.
Currently, each of the low-level drivers that support RoCE (ocrdma,
mlx4) manages its own RoCE port GID table. As there's nothing which is
essentially vendor specific, we generalize that, and enhance the RDMA
core GID cache to do this job.
In order to populate the GID table, we listen for events:
(a) netdev up/down/change_addr events - if a netdev is built onto
our RoCE device, we need to add/delete its IPs. This involves
adding all GIDs related to this ndev, add default GIDs, etc.
(b) inet events - add new GIDs (according to the IP addresses)
to the table.
For programming the port RoCE GID table, providers must implement
the add_gid and del_gid callbacks.
RoCE GID management requires us to state the associated net_device
alongside the GID. This information is necessary in order to manage
the GID table. For example, when a net_device is removed, its
associated GIDs need to be removed as well.
RoCE mandates generating a default GID for each port, based on the
related net-device's IPv6 link local. In contrast to the GID based on
the regular IPv6 link-local (as we generate GID per IP address),
the default GID is also available when the net device is down (in
order to support loopback).
Locking is done as follows:
The patch modify the GID table code both for new RoCE drivers
implementing the add_gid/del_gid callbacks and for current RoCE and
IB drivers that do not. The flows for updating the table are
different, so the locking requirements are too.
While updating RoCE GID table, protection against multiple writers is
achieved via mutex_lock(&table->lock). Since writing to a table
requires us to find an entry (possible a free entry) in the table and
then modify it, this mutex protects both the find_gid and write_gid
ensuring the atomicity of the action.
Each entry in the GID cache is protected by rwlock. In RoCE, writing
(usually results from netdev notifier) involves invoking the vendor's
add_gid and del_gid callbacks, which could sleep.
Therefore, an invalid flag is added for each entry. Updates for RoCE are
done via a workqueue, thus sleeping is permitted.
In IB, updates are done in write_lock_irq(&device->cache.lock), thus
write_gid isn't allowed to sleep and add_gid/del_gid are not called.
When passing net-device into/out-of the GID cache, the device
is always passed held (dev_hold).
The code uses a single work item for updating all RDMA devices,
following a netdev or inet notifier.
The patch moves the cache from being a client (which was incorrect,
as the cache is part of the IB infrastructure) to being explicitly
initialized/freed when a device is registered/removed.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This gets rid of the weird in-between state where struct ib_device
was allocated but the kobject didn't work.
Consequently ib_device_release is now guaranteed to be called in
all situations and we needn't duplicate its kfrees on error paths.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fully replaced by a more generic and suitable
ib_alloc_mr.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use ib_alloc_mr with specific parameters.
Change the existing callers.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This was added in a thought of uniting all mr allocation
and deallocation routines but the fact is we have a single
deallocation routine already, ib_dereg_mr.
And, move mlx5_ib_destroy_mr specific logic into mlx5_ib_dereg_mr
(includes only signature stuff for now).
And, fixup the only callers (iser/isert) accordingly.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When no matching listening ID is found for a given request, the net_dev
that was used to find the request isn't released.
Fixes: 0b3ca768fc ("IB/cma: Use found net_dev for passive connections")
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Now that there are no ib_cm clients using the compare_data feature for
matching IB CM requests' private data, remove the compare_data parameter of
ib_cm_listen and remove the code implementing the feature.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use ib_cm_insert_listen to create listening IB CM IDs or share existing
ones if needed. When given a request on a specific CM ID, the code now
matches the request to the RDMA CM ID based on the request parameters, so
it no longer needs to rely on the ib_cm's private data matching
capabilities.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When receiving a new connection in cma_req_handler, we actually already
know the net_dev that is used for the connection's creation. Instead of
calling cma_translate_addr to resolve the new connection id's source
address, just use the net_dev that was found.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pass incoming request parameters through the relevant IPv4/IPv6 routing
tables and make sure the network stack is configured to handle such
requests.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Instead of relying on a the ib_cm module to check an incoming CM request's
private data header, add these checks to the RDMA CM module. This allows a
following patch to to clean up the ib_cm interface and remove the code that
looks into the private headers. It will also allow supporting namespaces in
RDMA CM by making these checks namespace aware later on.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The rdma_cm module will later use the P_Key from the BTH to de-mux
requests.
See discussion at:
http://www.spinics.net/lists/netdev/msg336067.html
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Liran Liss <liranl@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add helper functions to access the IDRs by port-space and port number.
Pass around the port-space enum in cma.c instead of using pointers to
port-space IDRs.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When receiving a connection request, rdma_cm needs to associate the request
with a network device, in order to disambiguate requests. To do this, it
needs to know the request's destination IP. For this the module needs to
allow getting this information from the private data in the request packet,
instead of relying on the information already being in the listening RDMA
CM ID.
When creating a new incoming connection ID, the code in
cma_save_ip{4,6}_info can no longer rely on the listener's private data to
find the port number, so it reads it from the requested service ID.
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Enabling network namespaces for RDMA CM will allow processes on different
namespaces to listen on the same port. In order to leave namespace support
out of the CM layer, this requires that multiple RDMA CM IDs will be able
to share a single CM ID.
This patch adds infrastructure to retrieve an existing listening ib_cm_id,
based on its device and service ID, or create a new one if one does not
already exist. It also adds a reference count for such instances
(cm_id_private.listen_sharecount), and prevents cm_destroy_id from
destroying a CM if it is still shared. See the relevant discussion [1].
[1] Re: [PATCH v3 for-next 05/13] IB/cm: Reference count ib_cm_ids
http://www.spinics.net/lists/netdev/msg328860.html
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Expose the service ID on an incoming CM or SIDR request to the event
handler. This will allow the RDMA CM module to de-multiplex connection
requests based on the information encoded in the service ID.
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In the case of IPoIB, and maybe in other cases, the network device is
managed by an upper-layer protocol (ULP). In order to expose this
network device to other users of the IB device, let ULPs implement
a callback that returns network device according to connection parameters.
The IB device and port, together with the P_Key and the GID should
be enough to uniquely identify the ULP net device. However, in current
kernels there can be multiple IPoIB interfaces created with the same GID.
Furthermore, such configuration may be desireable to support ipvlan-like
configurations for RDMA CM with IPoIB. To resolve the device in these
cases the code will also take the IP address as an additional input.
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
An ib_client callback that is called with the lists_rwsem locked only for
read is protected from changes to the IB client lists, but not from
ib_unregister_device() freeing its client data. This is because
ib_unregister_device() will remove the device from the device list with
lists_rwsem locked for write, but perform the rest of the cleanup,
including the call to remove() without that lock.
Mark client data that is undergoing de-registration with a new going_down
flag in the client data context. Lock the client data list with lists_rwsem
for write in addition to using the spinlock, so that functions calling the
callback would be able to lock only lists_rwsem for read and let callbacks
sleep.
Since ib_unregister_client() now marks the client data context, no need for
remove() to search the context again, so pass the client data directly to
remove() callbacks.
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently the RDMA subsystem's device list and client list are protected by
a single mutex. This prevents adding user-facing APIs that iterate these
lists, since using them may cause a deadlock. The patch attempts to solve
this problem by adding a read-write semaphore to protect the lists. Readers
now don't need the mutex, and are safe just by read-locking the semaphore.
The ib_register_device, ib_register_client, ib_unregister_device, and
ib_unregister_client functions are modified to lock the semaphore for write
during their respective list modification. Also, in order to make sure
client callbacks are called only between add() and remove() calls, the code
is changed to only add items to the lists after the add() calls and remove
from the lists before the remove() calls.
This patch attempts to solve a similar need [1] that was seen in the RoCE
v2 patch series.
[1] http://www.spinics.net/lists/linux-rdma/msg24733.html
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Matan Barak <matanb@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Resolving a link-local IPv6 address with an unspecified source address
was broken by commit 5462eddd7a, which prevented the IPv6 stack from
learning the scope id of the link-local IPv6 address, causing random
failures as the IP stack chose a random link to resolve the address on.
This commit 5462eddd7a made us bail out of cma_check_linklocal early if
the address passed in was not an IPv6 link-local address. On the address
resolution path, the address passed in is the source address; if the
source address is the unspecified address, which is not link-local, we
will bail out early.
This is mostly correct, but if the destination address is a link-local
address, then we will be following a link-local route, and we'll need to
tell the IPv6 stack what the scope id of the destination address is.
This used to be done by last line of cma_check_linklocal, which is
skipped when bailing out early:
dev_addr->bound_dev_if = sin6->sin6_scope_id;
(In cma_bind_addr, the sin6_scope_id of the source address is set to the
sin6_scope_id of the destination address, so this is correct)
This line is required in turn for the following line, L279 of
addr6_resolve, to actually inform the IPv6 stack of the scope id:
fl6.flowi6_oif = addr->bound_dev_if;
Since we can only know we are in this failure case when we have access
to both the source IPv6 address and destination IPv6 address, we have to
deal with this further up the stack. So detect this failure case in
cma_bind_addr, and set bound_dev_if to the destination address scope id
to correct it.
Signed-off-by: Spencer Baugh <sbaugh@catern.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Something like this:
CPU A CPU B
Acked-by: Sean Hefty <sean.hefty@intel.com>
======================== ================================
ucma_destroy_id()
wait_for_completion()
.. anything
ucma_put_ctx()
complete()
.. continues ...
ucma_leave_multicast()
mutex_lock(mut)
atomic_inc(ctx->ref)
mutex_unlock(mut)
ucma_free_ctx()
ucma_cleanup_multicast()
mutex_lock(mut)
kfree(mc)
rdma_leave_multicast(mc->ctx->cm_id,..
Fix it by latching the ref at 0. Once it goes to 0 mc and ctx cannot
leave the mutex(mut) protection.
The other atomic_inc in ucma_get_ctx is OK because mutex(mut) protects
it from racing with ucma_destroy_id.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The verbs are obsolete. The ib_rereg_phys_mr() verb is not used by
kernel ULPs, and the last ib_reg_phys_mr() call site in the kernel
tree has now been removed.
Two staging tree call sites remain in the Lustre client. The Lustre
team has been notified of the deprecation of reg_phys_mr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
ib_ucm_release_dev clears the wrong bit if devnum is greater
than IB_UCM_MAX_DEVICES.
Signed-off-by: Carol L Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fixes to allow clients to make remove mapping requests, after
they have provided the user space service with the mapping
information, they are using when the service is restarted.
1) Adding IWPM_REG_VALID, IWPM_REG_INCOMPL and IWPM_REG_UNDEF
registration types for the port mapper clients and functions
to set/check the registration type.
2) If the port mapper user space service is not available to register
the client, then its registration stays IWPM_REG_UNDEF and the
registration isn't checked until the service becomes available
(no mappings are possible, if the user space service isn't running).
3) After the service is restarted, the user space port mapper pid is set
to valid and the client registration is set to IWPM_REG_INCOMPL
to allow the client to make remove mapping requests.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Whenever ib_cm gets remove_one call, like when there is a hot-unplug
event, the driver should mark itself as going_down and confirm that no
new works are going to be queued for that device.
so, the order of the actions are:
1. mark the going_down bit.
2. flush the wq.
3. [make sure no new works for that device.]
4. unregister mad agent.
otherwise, works that are already queued can be scheduled after the mad
agent was freed.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The define OPA_LID_PERMISSIVE is big endian and was compared to the
cpu endian variable opa_drslid.
Problem caught by 0-day build infrastructure.
Fixes: 8e4349d13f (IB/mad: Add final OPA MAD processing)
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: John, Jubin <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Persuant to Liran's comments on node_type on linux-rdma
mailing list:
In an effort to reform the RDMA core and ULPs to minimize use of
node_type in struct ib_device, an additional bit is added to
struct ib_device for is_switch (IB switch). This is needed
to be initialized by any IB switch device driver. This is a
NEW requirement on such device drivers which are all
"out of tree".
In addition, an ib_switch helper was added to ib_verbs.h
based on the is_switch device bit rather than node_type
(although those should be consistent).
The RDMA core (MAD, SMI, agent, sa_query, multicast, sysfs)
as well as (IPoIB and SRP) ULPs are updated where
appropriate to use this new helper. In some cases,
the helper is now used under the covers of using
rdma_[start end]_port rather than the open coding
previously used.
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Hal Rosenstock <hal@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
- A large cleanup of how device capabilities are checked for various
features
- Additional cleanups in the MAD processing
- Update to the srp driver
- Creation and use of centralized log message helpers
- Add const to a number of args to calls and clean up call chain
- Add support for extended cq create verb
- Add support for timestamps on cq completion
- Add support for processing OPA MAD packets
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVeyzqAAoJELgmozMOVy/di3wP/jml4F9crvQn7UBJjGm/rgcI
wzZ2GZTqxQE8dn+W6gQsdKOzy0Ibxx5UYGp9ruInuxAcVh9t1PcylanasaiGMEtY
mrGRFjipJ9jYa+yDQDTQi8EFMClZuMSvtRLKjzYITudCXQck37V+F5YlP6VphjX7
JeiM4a+4rD0ukk5PKGvUw51sP1eawKtEdUvnqcOEI2tJgQmzJBP4mXrhVtS/0wSc
Pi8TRN5QKi3Drom/tK9QQ/ncoYngi4BKLfszCeU373HJq6qXqsxBYvs3jX6MPzfv
Aooj272JxBgCYxkmEfECezDpmi3PbWDJjXj/xCLjfhjISDtHHHVLGVMODZpwUEsL
2wBgwlzdajVopSbSLvsjQNtQw25s7sDWpu+TFKbS0u+W2d0ZOyipM1Xeje+OtDHQ
clhwvDhgSfeI/bJ1YdtNLbvINrwsfZD213zD+WH21A/9weAVr3hEfTuSaNFiTiRn
5yywP36TM0wH90KhiWoLrztcHvoE5p7kGuqzv04MRjrMMNHEJK2/IhWvT97Ewngu
vWrZl7QRzXYcGspCOp2aJW9Wr2rhGRrv28TF+thpNrIJOB2JM4q4koCKZCcI0s2D
E6pY2YQSzvrA/ZSfcWIg4yhugcycIJkOf7ur2N/U43cwGXtaCzPWVnKMApmdnVOO
ZEMwD3OZ1OGcCHLhRL8Y
=yISf
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma updates from Doug Ledford:
- a large cleanup of how device capabilities are checked for various
features
- additional cleanups in the MAD processing
- update to the srp driver
- creation and use of centralized log message helpers
- add const to a number of args to calls and clean up call chain
- add support for extended cq create verb
- add support for timestamps on cq completion
- add support for processing OPA MAD packets
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (92 commits)
IB/mad: Add final OPA MAD processing
IB/mad: Add partial Intel OPA MAD support
IB/mad: Add partial Intel OPA MAD support
IB/core: Add OPA MAD core capability flag
IB/mad: Add support for additional MAD info to/from drivers
IB/mad: Convert allocations from kmem_cache to kzalloc
IB/core: Add ability for drivers to report an alternate MAD size.
IB/mad: Support alternate Base Versions when creating MADs
IB/mad: Create a generic helper for DR forwarding checks
IB/mad: Create a generic helper for DR SMP Recv processing
IB/mad: Create a generic helper for DR SMP Send processing
IB/mad: Split IB SMI handling from MAD Recv handler
IB/mad cleanup: Generalize processing of MAD data
IB/mad cleanup: Clean up function params -- find_mad_agent
IB/mlx4: Add support for CQ time-stamping
IB/mlx4: Add mmap call to map the hardware clock
IB/core: Pass hardware specific data in query_device
IB/core: Add timestamp_mask and hca_core_clock to query_device
IB/core: Extend ib_uverbs_create_cq
IB/core: Add CQ creation time-stamping flag
...
For devices which support OPA MADs
1) Use previously defined SMP support functions.
2) Pass correct base version to ib_create_send_mad when processing OPA MADs.
3) Process out_mad_key_index returned by agents for a response. This is
necessary because OPA SMP packets must carry a valid pkey.
4) Carry the correct segment size (OPA vs IBTA) of RMPP messages within
ib_mad_recv_wc.
5) Handle variable length OPA MADs by:
* Adjusting the 'fake' WC for locally routed SMP's to represent the
proper incoming byte_len
* out_mad_size is used from the local HCA agents
1) when sending agent responses on the wire
2) when passing responses through the local_completions
function
NOTE: wc.byte_len includes the GRH length and therefore is different
from the in_mad_size specified to the local HCA agents.
out_mad_size should _not_ include the GRH length as it is added
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add OPA SMP processing functionality.
Define the new OPA SMP format, create support functions for this format using
the previously defined helper functions as appropriate.
These functions are defined in this patch and used in the final OPA MAD support
patch.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch is the first of 3 which adds processing of OPA MADs
1) Add Intel Omni-Path Architecture defines
2) Increase max management version to accommodate OPA
3) update ib_create_send_mad
If the device supports OPA MADs and the MAD being sent is the OPA base
version alter the MAD size and sg lengths as appropriate
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to support alternate sized MADs (and variable sized MADs on OPA
devices) add in/out MAD size parameters to the process_mad core call.
In addition, add an out_mad_pkey_index to communicate the pkey index the driver
wishes the MAD stack to use when sending OPA MAD responses.
The out MAD size and the out MAD PKey index are required by the MAD
stack to generate responses on OPA devices.
Furthermore, the in and out MAD parameters are made generic by specifying them
as ib_mad_hdr rather than ib_mad.
Drivers are modified as needed and are protected by BUG_ON flags if the MAD
sizes passed to them is incorrect.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch implements allocating alternate receive MAD buffers within the MAD
stack. Support for OPA to send/recv variable sized MADs is implemented later.
1) Convert MAD allocations from kmem_cache to kzalloc
kzalloc is more flexible to support devices with different sized MADs
and research and testing showed that the current use of kmem_cache does
not provide performance benefits over kzalloc.
2) Change struct ib_mad_private to use a flex array for the mad data
3) Allocate ib_mad_private based on the size specified by devices in
rdma_max_mad_size.
4) Carry the allocated size in ib_mad_private to be used when processing
ib_mad_private objects.
5) Alter DMA mappings based on the mad_size of ib_mad_private.
6) Replace the use of sizeof and static defines as appropriate
7) Add appropriate casts for the MAD data when calling processing
functions.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add max MAD size to the device immutable data set and have all drivers that
support MADs report the current IB MAD size (IB_MGMT_MAD_SIZE) to the core.
Verify MAD size data in both the MAD core and when reading the immutable data.
OPA drivers will report alternate MAD sizes in subsequent patches.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In preparation to support the new OPA MAD Base version, add a base version
parameter to ib_create_send_mad and set it to IB_MGMT_BASE_VERSION for current
users.
Definition of the new base version and it's processing will occur in later
patches.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
IB and OPA SMPs share the same processing algorithm but have different header
formats and permissive LID detection.
Add a helper function which is generic to processing the DR forwarding checks which
can be used by both IB and OPA SMP code.
Use this function in the current IB function smi_check_forward_dr_smp.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
IB and OPA SMPs share the same processing algorithm but have different header
formats and permissive LID detection.
Add a helper function which is generic to processing DR SMP Recv messages which
can be used by both IB and OPA SMP code.
Use this function in the current IB function smi_handle_dr_smp_recv.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
IB and OPA SMPs share the same processing algorithm but have different header
formats and permissive LID detection.
Add a helper function which is generic to processing DR SMP Send messages which
can be used by both IB and OPA SMP code.
Use this function in the current IB function smi_handle_dr_smp_send.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Make a helper function to process Directed Route SMPs to be called by the IB
MAD Recv Handler, ib_mad_recv_done_handler.
This cleans up the MAD receive handler code a bit and allows for us to better
share the SMP processing code between IB and OPA SMPs.
IB and OPA SMPs share the same processing algorithm but have different header
formats and permissive LID detection. Therefore this and subsequent patches
split the common processing code from the IB specific code in anticipation of
sharing those algorithms with the OPA code.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
ib_find_send_mad only needs access to the MAD header not the full IB MAD.
Change the local variable to ib_mad_hdr and change the corresponding cast.
This allows for clean usage of this function with both IB and OPA MADs because
OPA MADs carry the same header as IB MADs.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
find_mad_agent only needs read only access to the MAD header. Update the
ib_mad pointer to be const ib_mad_hdr. Adjust call tree.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vendors should be able to pass vendor specific data to/from
user-space via query_device uverb. In order to do this,
we need to pass the vendors' specific udata.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to expose timestamp we need to expose two new attributes in
query_device to be used for CQ completion time-stamping:
timestamp_mask - how many bits are valid in the timestamp, where timestamp
values could be 64bits the most.
hca_core_clock - timestamp is given in HW cycles, the frequency in KHZ units
of the HCA, necessary in order to convert cycles to seconds.
This is added both to ib_query_device and its respective uverbs counterpart.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
ib_uverbs_ex_create_cq follows the extension verbs
mechanism. New features (for example, CQ creation flags
field which is added in a downstream patch) could used
via user-space libraries without breaking the ABI.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, ib_create_cq uses cqe and comp_vecotr instead
of the extendible ib_cq_init_attr struct.
Earlier patches already changed the vendors to work with
ib_cq_init_attr. This patch changes the consumers too.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add a new ib_cq_init_attr structure which contains the
previous cqe (minimum number of CQ entries) and comp_vector
(completion vector) in addition to a new flags field.
All vendors' create_cq callbacks are changed in order
to work with the new API.
This commit does not change any functionality.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-By: Devesh Sharma <devesh.sharma@avagotech.com> to patch #2
Signed-off-by: Doug Ledford <dledford@redhat.com>
Registering an event handler is done for a device. This device may have
one RoCE port (no SA cap) and one InfiniBand port (has SA cap).
Therefore, warning from the event handler about a specific port that
doesn't have SA cap is correct but pollutes the kernel log without a
need.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to support constant callers of agent_send_response we add const
specifiers to the its pointer arguments.
Adjust the call tree accordingly.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
rdma-cma/iw_cm: Export tos field to iwarp providers
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Support for using UD and AF_IB is currently broken. The
IB_CM_SIDR_REQ_RECEIVED message is not handled properly in
cma_save_net_info() and we end up falling into code that will try and
process the request as ipv4/ipv6, which will end up failing.
The resolution is to add a check for the SIDR_REQ and call
cma_save_ib_info() with a NULL path record. Change cma_save_ib_info()
to copy the src sib info from the listen_id when the path record is NULL.
Reported-by: Hari Shankar <Hari.Shankar@netapp.com>
Signed-off-by: Matt Finlay <matt@mellanox.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
After discussion upstream, it was agreed to transition the usage of iboe
in the kernel to roce. This keeps our terminology consistent with what
was finalized in the IBTA Annex 16 and IBTA Annex 17 publications.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Problem reported by: Ted Kim <ted.h.kim@oracle.com>:
We have a case where a Linux system and a non-Linux system are
trying to interoperate. The Linux host is the active side and
starts the connection establishment, but later decides to not go
through with the connection setup and does rdma_destroy_id().
The rdma_destroy_id() eventually works its way down to cm_destroy_id()
in core/cm.c, where a REJ is sent. The non-Linux system
has some trouble recognizing the REJ because of:
A. CM states which can't receive the REJ
B. Some issues about REJ formatting (missing comm ID)
ISSUE A: That part of the spec says, a Consumer Reject REJ can be
sent for a connection abort, but it goes further
and says: can send a REJ message with a "Consumer Reject"
Reason code if they are in a CM state (i.e. REP
Rcvd, MRA(REP) Sent, REQ Rcvd, MRA Sent) that allows
a REJ to be sent (lines 35-38).
Of the states listed there in that sentence, it would
seem to limit the active side to using the Consumer Reject
(for the abort case) in just the REP-Rcvd and MRA-REP-Sent
states. That is basically only after the active side
sees a REP (or alternatively goes down the state transitions
to timeout in which case a Timeout REJ is sent).
As a fix, in cm-destroy-id() move the IB-CM-MRA-REQ-RCVD case
to the same as REQ-SENT. Essentially, make a REJ sent after
getting an MRA on active side a timeout rather than Consumer-
Reject, which is arguably more correct with the CM state
diagrams previous to getting a REP.
Signed-off-by: Ted Kim <ted.h.kim@oracle.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
As of commit 5eb620c81c "IB/core: Add helpers for uncached GID and P_Key
searches"; pkey_tbl_len and gid_tbl_len are immutable data which are stored in
the ib_device.
The per port core capability flags to be added later are also immutable data to
be stored in the ib_device object.
In preparation for this create a structure for per port immutable data and
place the pkey and gid table lengths within this structure.
"get_port_immutable" is added as a mandatory device function to allow the
drivers to fill in this data.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The addition of the rdma_cap_ib_mad is technically broken in ib_umad_remove_one
because the loop "i" value is not a port value.
This bug resulted in the ib_umad failing to properly remove its resources when
the core capability functions were converted to bit fields.
NOTE: e17371d73908 did not result in broken behavior on its own. It was only
an issue when the implementation of rdma_cap_ib_mad was changed.
Pass the port value to rdma_cap_ib_mad.
Fixes: e17371d73908 ("IB/Verbs: Use management helper rdma_cap_ib_mad()")
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use the new common rdma_[start|end]_port functions instead of using
local variables and figuring it out on the fly.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The following functions only need read access to the data passed to them.
ib_mad_kernel_rmpp_agent
is_rmpp_data_mad
rcv_has_same_gid
ib_find_send_mad
Clarify with const specifiers
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
rcv_has_same_class only needs access to the MAD header
specify WR and Receive WC as const
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
ib_response_mad only needs read access to the MAD header, not write access
to the entire mad struct, so replace struct ib_mad with const struct
ib_mad_hdr
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
validate_mad only needs read access to the MAD header, not write access
to the entire mad struct, so replace struct ib_mad with const struct
ib_mad_hdr
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Some of us keep revisiting the code to decode enumerations that
appear in out logs. Let's borrow the nice logging helpers that
exists in xprtrdma and rds for CMA events, IB events and WC statuses.
Reviewd-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Return values of 0 do not make sense for functions which return enum
smi_action
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
is_rmpp_data_mad is more descriptive for this function.
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Previously start_port and end_port were defined in 2 places, cache.c and
device.c and this prevented their use in other modules.
Make these common functions, change the name to reflect the rdma
name space, and update existing users.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_eth_ah() to help us check if the port of an
IB device support Ethernet Address Handler.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_af_ib() to help us check if the port of an
IB device support Native Infiniband Address.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_ib_mcast() to help us check if the port of an
IB device support Infiniband Multicast.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_ib_sa() to help us check if the port of an
IB device support Infiniband Subnet Administration.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_iw_cm() to help us check if the port of an
IB device support IWARP Communication Manager.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_ib_cm() to help us check if the port of an
IB device support Infiniband Communication Manager.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_ib_smi() to help us check if the port of an
IB device support Infiniband Subnet Management Interface.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_ib_mad() to help us check if the port of an
IB device support Infiniband Management Datagrams.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform rest part in IB-core cma.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Reform cma_acquire_dev() with management helpers, introduce
cma_validate_port() to make the code more clean.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform mcast related part in IB-core cma.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform route related part in IB-core cma.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform cm related part in IB-core cma/ucm.
Few checks focus on the device cm type rather than the port capability,
directly pass port 1 works currently, but can't support mixing cm type
device in future.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform IB-core verbs
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform IB-core multicast.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform IB-core sa_query.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform IB-core cm.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform IB-core mad/agent/user_mad.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The string iwpm_ulib_name is recorded in a nlmsg as a netlink attribute.
Without this fix parsing of the nlmsg by the userspace port mapper service fails
because of unknown attribute length, causing the port mapper service not to
register the client, which has sent the nlmsg.
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Cc: <stable@vger.kernel.org> #v3.16
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Addresses the following kernel logs seen during boot of sparc systems:
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Signed-off-by: David Ahern <david.ahern@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
While unmapping an ODP writable page, the dirty bit of the page is set. In
order to do so, the head of the compound page is found.
Currently, the compound head is found even on non-writable pages, where it is
never used, leading to unnecessary cpu barrier that impacts performance.
This patch moves the search for the compound head to be done only when needed.
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Acked-by: Shachar Raindel <raindel@mellanox.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, while mapping or unmapping pages for ODP, the umem mutex is locked
and unlocked once for each page. Such lock/unlock operation take few tens to
hundreds of nsecs. This makes a significant impact when mapping or unmapping few
MBs of memory.
To avoid this, the mutex should be locked only once per operation, and not per
page.
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Acked-by: Shachar Raindel <raindel@mellanox.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add functionality to enable the port mapper on the passive side to provide to its
clients the actual (non-mapped) ip/tcp address information of the connecting peer
1) Adding remote_info_cb() to process the address info of the connecting peer
The address info is provided by the user space port mapper service when
the connection is initiated by the peer
2) Adding a hash list to store the remote address info
3) Adding functionality to add/remove the remote address info
After the info has been provided to the port mapper client,
it is removed from the hash list
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When accepting a new IPv4 connect to an IPv6 socket, the CMA tries to
canonize the address family to IPv4, but does not properly process
the listening sockaddr to get the listening port, and does not properly
set the address family of the canonized sockaddr.
Fixes: e51060f08a ("IB: IP address based RDMA connection manager")
Cc: <stable@vger.kernel.org>
Reported-By: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Hello,
When an application using XRCs abruptly terminates, the mmaped pages
of the CQ buffers are leaked.
This comes from the fact that when resources are released in
ib_uverbs_cleanup_ucontext(), we fail to release the CQs because their
refcount is not 0.
When creating an XRC SRQ, we increment the associated CQ refcount.
This refcount is only decremented when the SRQ is released.
Therefore we need to release the SRQs prior to the CQs to make sure
that all references to the CQs are gone before trying to release these.
Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In a call to ib_umem_get(), if address is 0x0 and size is
already page aligned, check added in commit 8494057ab5
("IB/uverbs: Prevent integer overflow in ib_umem_get address
arithmetic") will refuse to register a memory region that
could otherwise be valid (provided vm.mmap_min_addr sysctl
and mmap_low_allowed SELinux knobs allow userspace to map
something at address 0x0).
This patch allows back such registration: ib_umem_get()
should probably don't care of the base address provided it
can be pinned with get_user_pages().
There's two possible overflows, in (addr + size) and in
PAGE_ALIGN(addr + size), this patch keep ensuring none
of them happen while allowing to pin memory at address
0x0. Anyway, the case of size equal 0 is no more (partially)
handled as 0-length memory region are disallowed by an
earlier check.
Link: http://mid.gmane.org/cover.1428929103.git.ydroneaud@opteya.com
Cc: <stable@vger.kernel.org> # 8494057ab5 ("IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic")
Cc: Shachar Raindel <raindel@mellanox.com>
Cc: Jack Morgenstein <jackm@mellanox.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If ib_umem_get() is called with a size equal to 0 and an
non-page aligned address, one page will be pinned and a
0-sized umem will be returned to the caller.
This should not be allowed: it's not expected for a memory
region to have a size equal to 0.
This patch adds a check to explicitly refuse to register
a 0-sized region.
Link: http://mid.gmane.org/cover.1428929103.git.ydroneaud@opteya.com
Cc: <stable@vger.kernel.org>
Cc: Shachar Raindel <raindel@mellanox.com>
Cc: Jack Morgenstein <jackm@mellanox.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Properly verify that the resulting page aligned end address is larger
than both the start address and the length of the memory area requested.
Both the start and length arguments for ib_umem_get are controlled by
the user. A misbehaving user can provide values which will cause an
integer overflow when calculating the page aligned end address.
This overflow can cause also miscalculation of the number of pages
mapped, and additional logic issues.
Addresses: CVE-2014-8159
Cc: <stable@vger.kernel.org>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add on-demand paging capabilities reporting to the extended query device verb.
Yann Droneaud writes:
Note: as offsetof() is used to retrieve the size of the lower chunk
of the response, beware that it only works if the upper chunk
is right after, without any implicit padding. And, as the size of
the latter chunk is added to the base size, implicit padding at the
end of the structure is not taken in account. Both point must be
taken in account when extending the uverbs functionalities.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Reviewed-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add extensible query device capabilities verb to allow adding new features.
ib_uverbs_ex_query_device is added and copy_query_dev_fields is used to copy
capability fields to be used by both ib_uverbs_query_device and
ib_uverbs_ex_query_device.
Following the discussion about this patch [1], the code now validates
the command's comp_mask is zero, returning -EINVAL for unknown values,
in order to allow extending the verb in the future.
The verb also checks the user-space provided response buffer size and
only fills in capabilities that will fit in the buffer. In attempt to
follow the spirit of presentation [2] by Tzahi Oved that was presented
during OpenFabrics Alliance International Developer Workshop 2013, the
comp_mask bits will only describe which fields are valid. Furthermore,
fields that can simply be cleared when they are not supported, do not
require a comp_mask bit at all. The verb returns a response_length
field containing the actual number of bytes written by the kernel, so
that a newer version running on an older kernel can tell which fields
were actually returned.
[1] [PATCH v1 0/5] IB/core: extended query device caps cleanup for v3.19
http://thread.gmane.org/gmane.linux.kernel.api/7889/
[2] https://www.openfabrics.org/images/docs/2013_Dev_Workshop/Tues_0423/2013_Workshop_Tues_0830_Tzahi_Oved-verbs_extensions_ofa_2013-tzahio.pdf
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Reviewed-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When the last on-demand paging MR is released the notifier count is
left non-zero so that concurrent page faults will have to abort. If a
new MR is then registered, the counter is reset. However, the decision
is made to put the new MR in the list waiting for the notifier count
to reach zero, before the counter is reset. An invalidation or another
MR registration can release the MR to handle page faults, but without
such an event the MR can wait forever.
The patch fixes this issue by adding a check whether the MR is the
first on-demand paging MR when deciding whether it is ready to handle
page faults. If it is the first MR, we know that there are no mmu
notifiers running in parallel to the registration.
Fixes: 882214e2b1 ("IB/core: Implement support for MMU notifiers regarding on demand paging regions")
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The deadlock occurs in __uverbs_modify_qp: we take a lock (idr_read_qp)
and in case of failure in ib_resolve_eth_l2_attrs we don't release
it (put_qp_read). Fix that.
Fixes: ed4c54e5b4 ("IB/core: Resolve Ethernet L2 addresses when modifying QP")
Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When marshaling a user path to the kernel struct ib_sa_path, we need
to zero smac and dmac and set the vlan id to the "no vlan" value.
This is to ensure that Ethernet attributes are not used with
InfiniBand QPs.
Fixes: dd5f03beb4 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
Signed-off-by: Ilya Nelkenbaum <ilyan@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
While commit 7e36ef8205 ("IB/core: Temporarily disable
ex_query_device uverb") is correct as it makes the extended
QUERY_DEVICE uverb (which came as part of commit 5a77abf9a9
("IB/core: Add support for extended query device caps") and commit
860f10a799 ("IB/core: Add flags for on demand paging support")) not
available to userspace, it doesn't address the initial issue regarding
ib_copy_to_udata() [1][2].
Additionally, further discussions around this new uverb seems to
conclude it would require a different data structure than the one
currently described in <rdma/ib_user_verbs.h> [3].
Both of these issues require a revert of the changes, so this patch
partially reverts commit 8cdd312cfe ("IB/mlx5: Implement the ODP
capability query verb") and commit 860f10a799 ("IB/core: Add flags
for on demand paging support") and fully reverts commit 5a77abf9a9
("IB/core: Add support for extended query device caps").
[1] "Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps"
http://mid.gmane.org/1418733236.2779.26.camel@opteya.com
[2] "Re: [PATCH] IB/core: Temporarily disable ex_query_device uverb"
http://mid.gmane.org/1423067503.3030.83.camel@opteya.com
[3] "RE: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask"
http://mid.gmane.org/2807E5FD2F6FDA4886F6618EAC48510E0CC12C30@CRSMSX101.amr.corp.intel.com
Cc: Eli Cohen <eli@mellanox.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 5a77abf9a9 ("IB/core: Add support for extended query device caps")
added a new extended verb to query the capabilities of RDMA devices, but the
semantics of this verb are still under debate [1].
Don't expose this verb to userspace until the ABI is nailed down.
[1] [PATCH v1 0/5] IB/core: extended query device caps cleanup for v3.19
http://www.spinics.net/lists/linux-rdma/msg22904.html
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Reviewed-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* Add an interval tree implementation for ODP umems. Create an
interval tree for each ucontext (including a count of the number of
ODP MRs in this context, semaphore, etc.), and register ODP umems in
the interval tree.
* Add MMU notifiers handling functions, using the interval tree to
notify only the relevant umems and underlying MRs.
* Register to receive MMU notifier events from the MM subsystem upon
ODP MR registration (and unregister accordingly).
* Add a completion object to synchronize the destruction of ODP umems.
* Add mechanism to abort page faults when there's a concurrent invalidation.
The way we synchronize between concurrent invalidations and page
faults is by keeping a counter of currently running invalidations, and
a sequence number that is incremented whenever an invalidation is
caught. The page fault code checks the counter and also verifies that
the sequence number hasn't progressed before it updates the umem's
page tables. This is similar to what the kvm module does.
In order to prevent the case where we register a umem in the middle of
an ongoing notifier, we also keep a per ucontext counter of the total
number of active mmu notifiers. We only enable new umems when all the
running notifiers complete.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yuval Dagan <yuvalda@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* Extend the umem struct to keep the ODP related data.
* Allocate and initialize the ODP related information in the umem
(page_list, dma_list) and freeing as needed in the end of the run.
* Store a reference to the process PID struct in the ucontext. Used to
safely obtain the task_struct and the mm during fault handling,
without preventing the task destruction if needed.
* Add 2 helper functions: ib_umem_odp_map_dma_pages and
ib_umem_odp_unmap_dma_pages. These functions get the DMA addresses
of specific pages of the umem (and, currently, pin them).
* Support for page faults only - IB core will keep the reference on
the pages used and call put_page when freeing an ODP umem
area. Invalidations support will be added in a later patch.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* Add a configuration option for enable on-demand paging support in
the infiniband subsystem (CONFIG_INFINIBAND_ON_DEMAND_PAGING). In a
later patch, this configuration option will select the MMU_NOTIFIER
configuration option to enable mmu notifiers.
* Add a flag for on demand paging (ODP) support in the IB device capabilities.
* Add a flag to request ODP MR in the access flags to reg_mr.
* Fail registrations done with the ODP flag when the low-level driver
doesn't support this.
* Change the conditions in which an MR will be writable to explicitly
specify the access flags. This is to avoid making an MR writable just
because it is an ODP MR.
* Add a ODP capabilities to the extended query device verb.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add extensible query device capabilities verb to allow adding new features.
ib_uverbs_ex_query_device is added and copy_query_dev_fields is used to
copy capability fields to be used by both ib_uverbs_query_device and
ib_uverbs_ex_query_device.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In some drivers there's a need to read data from a user space area
that was pinned using ib_umem when running from a different process
context.
The ib_umem_copy_from function allows reading data from the physical
pages pinned in the ib_umem struct.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In order to allow umems that do not pin memory, we need the umem to
keep track of its region's address.
This makes the offset field redundant, and so this patch removes it.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Address resolution always does a context switch to a work-queue to
deliver the address resolution event. When the IP address is already
cached in the system ARP table, we're going through the following:
chain:
rdma_resolve_ip --> addr_resolve (cache hit) -->
which ends up with:
queue_req --> set_timeout (now) --> mod_delayed_work(,, delay=1)
We actually do realize that the timeout should be zero, but the code
forces it to a minimum of one jiffie.
Using one jiffie as the minimum delay value results in sub-optimal
scheduling of executing this work item by the workqueue, which on the
below testbed costs about 3-4ms out of 12ms total time.
To fix that, we let the minimum delay to be zero. Note that the
connect step times change too, as there are address resolution calls
from that flow.
The results were taken from running both client and server on the
same node, over mlx4 RoCE port.
before -->
step total ms max ms min us us / conn
create id : 0.01 0.01 6.00 6.00
resolve addr : 4.02 4.01 4013.00 4016.00
resolve route: 0.18 0.18 182.00 183.00
create qp : 1.15 1.15 1150.00 1150.00
connect : 6.73 6.73 6730.00 6731.00
disconnect : 0.55 0.55 549.00 550.00
destroy : 0.01 0.01 9.00 9.00
after -->
step total ms max ms min us us / conn
create id : 0.01 0.01 6.00 6.00
resolve addr : 0.05 0.05 49.00 52.00
resolve route: 0.21 0.21 207.00 208.00
create qp : 1.10 1.10 1104.00 1104.00
connect : 1.22 1.22 1220.00 1221.00
disconnect : 0.71 0.71 713.00 713.00
destroy : 0.01 0.01 9.00 9.00
Signed-off-by: Or Kehati <ork@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Applications can request that the SM assign an MGID by passing a mcast
member request containing MGID = 0. When the SM responds by sending
the allocated MGID, this MGID replaces the 0-MGID in the multicast group.
However, the MGID field in the group is also the key field in the IB
core multicast code rbtree containing the multicast groups for the
port.
Since this is a key field, correct handling requires that the group
entry be deleted from the rbtree and then re-inserted with the new
key, so that the table structure is properly maintained.
The current code does not do this correctly. Correct operation
requires that if the key-field gid has changed at all, it should be
deleted and re-inserted.
Note that when inserting, if the new MGID is zero (not the case here
but the code should handle this correctly), we allow duplicate entries
for 0-MGIDs.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For RoCE, resolution of layer 2 address attributes forces no VLAN if
link-local GIDs are used. This patch allows applications to choose
the VLAN ID for link-local based RoCE GIDs by setting IB_QP_VID in
their QP attribute mask, and prevents the core from overriding this
choice.
Cc: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In ib_uverbs_open_qp, the sharable xrc target qp is created as a
"pseudo" qp and added to a list of qp's sharing the same physical
QP. This is done before the "pseudo" qp is assigned a uobject.
There is a race condition here if an async event arrives at the
physical qp. If the event is handled after the pseudo qp is added to
the list, but before it is assigned a uobject, the kernel crashes in
ib_uverbs_qp_event_handler, due to trying to dereference a NULL
uobject pointer.
Note that simply checking for non-NULL is not enough, due to error
flows in ib_uverbs_open_qp. If the failure is after assigning the
uobject, but before the qp has fully been created, we still have a
problem.
Thus, in ib_uverbs_qp_event_handler, we test that the uobject is
present, and also that it is live.
Reported-by: Matthew Finlay <matt@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
During create-ah from userspace, uverbs is sending garbage data in
attr.dmac and attr.vlan_id. This patch sets attr.dmac and
attr.vlan_id to zero.
Fixes: dd5f03beb4 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
Signed-off-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Clear the reserved field of struct ib_uverbs_async_event_desc which is
copied to user space.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When marsheling a user path to the kernel struct ib_sa_path, need
to zero smac, dmac and set the vlan id to the "no vlan" value.
Fixes: dd5f03beb4 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
Reported-by: Aleksey Senin <alekseys@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In debugging an application that receives -ENOMEM from ib_reg_mr(), I
found that ib_umem_get() can fail because the pinned_vm count has
wrapped causing it to always be larger than the lock limit even with
RLIMIT_MEMLOCK set to RLIM_INFINITY.
The wrapping of pinned_vm occurs because the process that calls
ib_reg_mr() will have its mm->pinned_vm count incremented. Later a
different process with a different mm_struct than the one that
allocated the ib_umem struct ends up releasing it which results in
decrementing the new processes mm->pinned_vm count past zero and
wrapping.
I'm not entirely sure what circumstances cause a different process to
release the ib_umem than the one that allocated it but the kernel
stack trace of the freeing process from my situation looks like the
following:
Call Trace:
[<ffffffff814d64b1>] dump_stack+0x19/0x1b
[<ffffffffa0b522a5>] ib_umem_release+0x1f5/0x200 [ib_core]
[<ffffffffa0b90681>] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
[<ffffffffa0b4d93c>] ib_destroy_qp+0x12c/0x170 [ib_core]
[<ffffffffa0cc7129>] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
[<ffffffff81141cba>] __fput+0xba/0x240
[<ffffffff81141e4e>] ____fput+0xe/0x10
[<ffffffff81060894>] task_work_run+0xc4/0xe0
[<ffffffff810029e5>] do_notify_resume+0x95/0xa0
[<ffffffff814e3dd0>] int_signal+0x12/0x17
The following patch fixes the issue by storing the pid struct of the
process that calls ib_umem_get() so that ib_umem_release and/or
ib_umem_account() can properly decrement the pinned_vm count of the
correct mm_struct.
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Reviewed-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Using the new registration mechanism, define a flag that indicates the
user wishes to process RMPP messages in user space rather than have
the kernel process them.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Registrations options are specified through flags. Definitions of flags will
be in subsequent patches.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Registration failures can be difficult to debug from userspace. This
gives more visibility.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Use dev_* style print when struct device is available.
Also combine previously line broken user-visible strings as per
Documentation/CodingStyle:
"However, never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
[ Remove PFX so the patch actually builds. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Use dev_* style print when struct device is available.
Also combine previously line broken user-visible strings as per
Documentation/CodingStyle:
"However, never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If the user creates a listening cm_id with backlog of 0 the IWCM ends
up not allowing any connection requests at all. The correct behavior
is for the IWCM to pick a default value if the user backlog parameter
is zero.
Lustre from version 1.8.8 onward uses a backlog of 0, which breaks
iwarp support without this fix.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Memory re-registration is a feature that enables changing the
attributes of a memory region registered by user-space, including PD,
translation (address and length) and access flags.
Add the required support in uverbs and the kernel verbs API.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch adds iWARP Port Mapper (IWPM) Version 2 support. The iWARP
Port Mapper implementation is based on the port mapper specification
section in the Sockets Direct Protocol paper -
http://www.rdmaconsortium.org/home/draft-pinkerton-iwarp-sdp-v1.0.pdf
Existing iWARP RDMA providers use the same IP address as the native
TCP/IP stack when creating RDMA connections. They need a mechanism to
claim the TCP ports used for RDMA connections to prevent TCP port
collisions when other host applications use TCP ports. The iWARP Port
Mapper provides a standard mechanism to accomplish this. Without this
service it is possible for RDMA application to bind/listen on the same
port which is already being used by native TCP host application. If
that happens the incoming TCP connection data can be passed to the
RDMA stack with error.
The iWARP Port Mapper solution doesn't contain any changes to the
existing network stack in the kernel space. All the changes are
contained with the infiniband tree and also in user space.
The iWARP Port Mapper service is implemented as a user space daemon
process. Source for the IWPM service is located at
http://git.openfabrics.org/git?p=~tnikolova/libiwpm-1.0.0/.git;a=summary
The iWARP driver (port mapper client) sends to the IWPM service the
local IP address and TCP port it has received from the RDMA
application, when starting a connection. The IWPM service performs a
socket bind from user space to get an available TCP port, called a
mapped port, and communicates it back to the client. In that sense,
the IWPM service is used to map the TCP port, which the RDMA
application uses to any port available from the host TCP port
space. The mapped ports are used in iWARP RDMA connections to avoid
collisions with native TCP stack which is aware that these ports are
taken. When an RDMA connection using a mapped port is terminated, the
client notifies the IWPM service, which then releases the TCP port.
The message exchange between the IWPM service and the iWARP drivers
(between user space and kernel space) is implemented using netlink
sockets.
1) Netlink interface functions are added: ibnl_unicast() and
ibnl_mulitcast() for sending netlink messages to user space
2) The signature of the existing ibnl_put_msg() is changed to be more
generic
3) Two netlink clients are added: RDMA_NL_NES, RDMA_NL_C4IW
corresponding to the two iWarp drivers - nes and cxgb4 which use
the IWPM service
4) Enums are added to enumerate the attributes in the netlink
messages, which are exchanged between the user space IWPM service
and the iWARP drivers
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: PJ Waskiewicz <pj.waskiewicz@solidfire.com>
[ Fold in range checking fixes and nlh_next removal as suggested by Dan
Carpenter and Steve Wise. Fix sparse endianness in hash. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Avoid that closing /dev/infiniband/umad<n> or /dev/infiniband/issm<n>
triggers a use-after-free. __fput() invokes f_op->release() before it
invokes cdev_put(). Make sure that the ib_umad_device structure is
freed by the cdev_put() call instead of f_op->release(). This avoids
that changing the port mode from IB into Ethernet and back to IB
followed by restarting opensmd triggers the following kernel oops:
general protection fault: 0000 [#1] PREEMPT SMP
RIP: 0010:[<ffffffff810cc65c>] [<ffffffff810cc65c>] module_put+0x2c/0x170
Call Trace:
[<ffffffff81190f20>] cdev_put+0x20/0x30
[<ffffffff8118e2ce>] __fput+0x1ae/0x1f0
[<ffffffff8118e35e>] ____fput+0xe/0x10
[<ffffffff810723bc>] task_work_run+0xac/0xe0
[<ffffffff81002a9f>] do_notify_resume+0x9f/0xc0
[<ffffffff814b8398>] int_signal+0x12/0x17
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=75051
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Yann Droneaud <ydroneaud@opteya.com>
Cc: <stable@vger.kernel.org> # 3.x: 8ec0a0e6b5: IB/umad: Fix error handling
Signed-off-by: Roland Dreier <roland@purestorage.com>
The ports kobject isn't being released during error flow in device
registration. This patch refactors the ports kobject cleanup into a
single function called from both the error flow in device registration
and from the unregistration function.
A couple of attributes aren't being deleted (iw_stats_group, and
ib_class_attributes). While this may be handled implicitly by the
destruction of their kobjects, it seems better to handle all the
attributes the same way.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
[ Make free_port_list_attributes() static. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
When encountering an error during the add_port function, adding a port
to sysfs, the port kobject is freed without being deleted from sysfs.
Instead of freeing it directly, the patch uses kobject_put to release
the kobject and delete it.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The ib_core module will call kobject_get on the parent object of each
kobject it creates. This is redundant since kobject_add does that
anyway.
As a side effect, this patch should fix leaking the ports kobject and
the device kobject during unregister flow, since the previous code
didn't seem to take into account the kobject_get calls on behalf of
the child kobjects.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix a few functions that are declared with __attribute_const__ in the
ib_verbs.h header file but defined without it in verbs.c. This gets rid
of the following sparse warnings:
drivers/infiniband/core/verbs.c:51:5: error: symbol 'ib_rate_to_mult' redeclared with different type (originally declared at include/rdma/ib_verbs.h:469) - different modifiers
drivers/infiniband/core/verbs.c:68:14: error: symbol 'mult_to_ib_rate' redeclared with different type (originally declared at include/rdma/ib_verbs.h:607) - different modifiers
drivers/infiniband/core/verbs.c:85:5: error: symbol 'ib_rate_to_mbps' redeclared with different type (originally declared at include/rdma/ib_verbs.h:476) - different modifiers
drivers/infiniband/core/verbs.c:111:1: error: symbol 'rdma_node_get_transport' redeclared with different type (originally declared at include/rdma/ib_verbs.h:84) - different modifiers
Signed-off-by: Roland Dreier <roland@purestorage.com>
Properly convert gfp_t & result to bool to fix:
drivers/infiniband/core/sa_query.c:621:33: warning: incorrect type in initializer (different base types)
drivers/infiniband/core/sa_query.c:621:33: expected bool [unsigned] [usertype] preload
drivers/infiniband/core/sa_query.c:621:33: got restricted gfp_t
Signed-off-by: Roland Dreier <roland@purestorage.com>
Avoid leaking a kref count in ib_umad_open() if port->ib_dev == NULL
or if nonseekable_open() fails.
Avoid leaking a kref count, that sm_sem is kept down and also that the
IB_PORT_SM capability mask is not cleared in ib_umad_sm_open() if
nonseekable_open() fails.
Since container_of() never returns NULL, remove the code that tests
whether container_of() returns NULL.
Moving the kref_get() call from the start of ib_umad_*open() to the
end is safe since it is the responsibility of the caller of these
functions to ensure that the cdev pointer remains valid until at least
when these functions return.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: <stable@vger.kernel.org>
[ydroneaud@opteya.com: rework a bit to reduce the amount of code changed]
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
[ nonseekable_open() can't actually fail, but.... - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
The code that resolves the passive side source MAC within the rdma_cm
connection request handler was both redundant and buggy, so remove it.
It was redundant since later, when an RC QP is modified to RTR state,
the resolution will take place in the ib_core module. It was buggy
because this callback also deals with UD SIDR exchange, for which we
incorrectly looked at the REQ member of the CM event and dereferenced
a random value.
Fixes: dd5f03beb4 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Running with DMA_API_DEBUG enabled and not checking for DMA mapping
errors triggers a kernel stack trace with "DMA-API: device driver
failed to check map error" message. Add these checks to the MAD
module, both to be be more robust and also eliminate these
false-positive stack traces.
Signed-off-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Introduce a verbs interface for signature-related operations. A
signature handover operation configures the layouts of data and
protection attributes both in memory and wire domains.
Signature operations are:
- INSERT:
Generate and insert protection information when handing over
data from input space to output space.
- validate and STRIP:
Validate protection information and remove it when handing over
data from input space to output space.
- validate and PASS:
Validate protection information and pass it when handing over
data from input space to output space.
Once the signature handover opration is done, the HCA will offload
data integrity generation/validation while performing the actual data
transfer.
Additions:
1. HCA signature capabilities in device attributes
Verbs provider supporting signature handover operations fills
relevant fields in device attributes structure returned by
ib_query_device.
2. QP creation flag IB_QP_CREATE_SIGNATURE_EN
Creating a QP that will carry signature handover operations may
require some special preparations from the verbs provider. So we
add QP creation flag IB_QP_CREATE_SIGNATURE_EN to declare that the
created QP may carry out signature handover operations. Expose
signature support to verbs layer (no support for now).
3. New send work request IB_WR_REG_SIG_MR
Signature handover work request. This WR will define the signature
handover properties of the memory/wire domains as well as the
domains layout. The purpose of this work request is to bind all
the needed information for the signature operation:
- data to be transferred: wr->sg_list (ib_sge).
* The raw data, pre-registered to a single MR (normally, before
signature, this MR would have been used directly for the data
transfer)
- data protection guards: sig_handover.prot (ib_sge).
* The data protection buffer, pre-registered to a single MR, which
contains the data integrity guards of the raw data blocks.
Note that it may not always exist, only in cases where the user is
interested in storing protection guards in memory.
- signature operation attributes: sig_handover.sig_attrs.
* Tells the HCA how to validate/generate the protection information.
Once the work request is executed, the memory region that will
describe the signature transaction will be the sig_mr. The
application can now go ahead and send the sig_mr.rkey or use the
sig_mr.lkey for data transfer.
4. New Verb ib_check_mr_status
check_mr_status verb checks the status of the memory region post
transaction. The first check that may be used is
IB_MR_CHECK_SIG_STATUS, which will indicate if any signature
errors are pending for a specific signature-enabled ib_mr. This
verb is a lightwight check and is allowed to be taken from
interrupt context. An application must call this verb after it is
known that the actual data transfer has finished.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This commit introduces verbs for creating/destoying memory
regions which will allow new types of memory key operations such
as protected memory registration.
Indirect memory registration is registering several (one
of more) pre-registered memory regions in a specific layout.
The Indirect region may potentialy describe several regions
and some repitition format between them.
Protected Memory registration is registering a memory region
with various data integrity attributes which will describe protection
schemes that will be handled by the HCA in an offloaded manner.
These memory regions will be applicable for a new REG_SIG_MR
work request introduced later in this patchset.
In the future these routines may replace or implement current memory
regions creation routines existing today:
- ib_reg_user_mr
- ib_alloc_fast_reg_mr
- ib_get_dma_mr
- ib_dereg_mr
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch refactors the IB core umem code and vendor drivers to use a
linear (chained) SG table instead of chunk list. With this change the
relevant code becomes clearer—no need for nested loops to build and
use umem.
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Pull networking updates from David Miller:
1) BPF debugger and asm tool by Daniel Borkmann.
2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann.
3) Correct reciprocal_divide and update users, from Hannes Frederic
Sowa and Daniel Borkmann.
4) Currently we only have a "set" operation for the hw timestamp socket
ioctl, add a "get" operation to match. From Ben Hutchings.
5) Add better trace events for debugging driver datapath problems, also
from Ben Hutchings.
6) Implement auto corking in TCP, from Eric Dumazet. Basically, if we
have a small send and a previous packet is already in the qdisc or
device queue, defer until TX completion or we get more data.
7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko.
8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel
Borkmann.
9) Share IP header compression code between Bluetooth and IEEE802154
layers, from Jukka Rissanen.
10) Fix ipv6 router reachability probing, from Jiri Benc.
11) Allow packets to be captured on macvtap devices, from Vlad Yasevich.
12) Support tunneling in GRO layer, from Jerry Chu.
13) Allow bonding to be configured fully using netlink, from Scott
Feldman.
14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can
already get the TCI. From Atzm Watanabe.
15) New "Heavy Hitter" qdisc, from Terry Lam.
16) Significantly improve the IPSEC support in pktgen, from Fan Du.
17) Allow ipv4 tunnels to cache routes, just like sockets. From Tom
Herbert.
18) Add Proportional Integral Enhanced packet scheduler, from Vijay
Subramanian.
19) Allow openvswitch to mmap'd netlink, from Thomas Graf.
20) Key TCP metrics blobs also by source address, not just destination
address. From Christoph Paasch.
21) Support 10G in generic phylib. From Andy Fleming.
22) Try to short-circuit GRO flow compares using device provided RX
hash, if provided. From Tom Herbert.
The wireless and netfilter folks have been busy little bees too.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits)
net/cxgb4: Fix referencing freed adapter
ipv6: reallocate addrconf router for ipv6 address when lo device up
fib_frontend: fix possible NULL pointer dereference
rtnetlink: remove IFLA_BOND_SLAVE definition
rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info
qlcnic: update version to 5.3.55
qlcnic: Enhance logic to calculate msix vectors.
qlcnic: Refactor interrupt coalescing code for all adapters.
qlcnic: Update poll controller code path
qlcnic: Interrupt code cleanup
qlcnic: Enhance Tx timeout debugging.
qlcnic: Use bool for rx_mac_learn.
bonding: fix u64 division
rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC
sfc: Use the correct maximum TX DMA ring size for SFC9100
Add Shradha Shah as the sfc driver maintainer.
net/vxlan: Share RX skb de-marking and checksum checks with ovs
tulip: cleanup by using ARRAY_SIZE()
ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called
net/cxgb4: Don't retrieve stats during recovery
...
Fix the below "make W=1" build warning:
drivers/infiniband/core/iwcm.c: In function ‘destroy_cm_id’:
drivers/infiniband/core/iwcm.c:330: warning: variable ‘ret’ set but not used
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If addr is not a linklocal address, the code incorrectly fails to
return and ends up assigning the scope ID to the scope id of the
address, which is wrong. Fix by checking if it's a link local address
first, and immediately return 0 if not.
Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add the missing unlock before return from function cm_init_qp_rtr_attr()
in the error handling case.
Fixes: dd5f03beb4 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Roland Dreier <roland@purestorage.com>
IP based addressing introduces the usage of rdma_addr_find_dmac_by_grh()
within ib_core. Since this function is declared in ib_addr, ib_addr
should be a part of the core INFINIBAND modules, rather than
INFINIBAND_ADDR_TRANS.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Existing user space applications provide only IBoE L3 address
attributes to the kernel when they issue a modify QP modify. To work
with them and let such apps (plus kernel consumers which don't use the
RDMA-CM) keep working transparently under the IBoE GID IP addressing
changes, add an Eth L2 address resolution helper.
Signed-off-by: Moni Shoua <monis@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Currently, the IB core and specifically the RDMA-CM assumes that IBoE
(RoCE) gids encode related Ethernet netdevice interface MAC address
and possibly VLAN id.
Change GIDs to be treated as they encode interface IP address.
Since Ethernet layer 2 address parameters are not longer encoded
within gids, we have to extend the Infiniband address structures (e.g.
ib_ah_attr) with layer 2 address parameters, namely mac and vlan.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add the complementary RDMA_NODE_USNIC_UDP for RDMA_TRANSPORT_USNIC_UDP.
Signed-off-by: Upinder Malhi <umalhi@cisco.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch removes the net_random and net_srandom macros and replaces
them with direct calls to the prandom ones. As new commits only seem to
use prandom_u32 there is no use to keep them around.
This change makes it easier to grep for users of prandom_u32.
Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add the support for Ethernet L2 attributes in the
verbs/cm/cma structures.
When dealing with L2 Ethernet, we should use smac, dmac, vlan ID and priority
in a similar manner that the IB L2 (and the L4 PKEY) attributes are used.
Thus, those attributes were added to the following structures:
* ib_ah_attr - added dmac
* ib_qp_attr - added smac and vlan_id, (sl remains vlan priority)
* ib_wc - added smac, vlan_id
* ib_sa_path_rec - added smac, dmac, vlan_id
* cm_av - added smac and vlan_id
For the path record structure, extra care was taken to avoid the new
fields when packing it into wire format, so we don't break the IB CM
and SA wire protocol.
On the active side, the CM fills. its internal structures from the
path provided by the ULP. We add there taking the ETH L2 attributes
and placing them into the CM Address Handle (struct cm_av).
On the passive side, the CM fills its internal structures from the WC
associated with the REQ message. We add there taking the ETH L2
attributes from the WC.
When the HW driver provides the required ETH L2 attributes in the WC,
they set the IB_WC_WITH_SMAC and IB_WC_WITH_VLAN flags. The IB core
code checks for the presence of these flags, and in their absence does
address resolution from the ib_init_ah_from_wc() helper function.
ib_modify_qp_is_ok is also updated to consider the link layer. Some
parameters are mandatory for Ethernet link layer, while they are
irrelevant for IB. Vendor drivers are modified to support the new
function signature.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add RDMA_TRANSPORT_USNIC_UDP which will be used by usNIC.
Signed-off-by: Upinder Malhi <umalhi@cisco.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch adds a check on the output buffer with access_ok(VERIFY_WRITE, ...)
to ensure the whole buffer is in userspace memory before using the
pointer in uverbs functions. If the buffer or a subset of it is not
valid, returns -EFAULT to the caller.
This will also catch invalid buffer before the final call to
copy_to_user() which happen late in most uverb functions.
Just like the check in read(2) syscall, it's a sanity check to detect
invalid parameters provided by userspace. This particular check was added
in vfs_read() by Linus Torvalds for v2.6.12 with following commit message:
https://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/?id=fd770e66c9a65b14ce114e171266cf6f393df502
Make read/write always do the full "access_ok()" tests.
The actual user copy will do them too, but only for the
range that ends up being actually copied. That hides
bugs when the range has been clamped by file size or other
issues.
Note: there's no need to check input buffer since vfs_write() already does
access_ok(VERIFY_READ, ...) as part of write() syscall.
Link: http://marc.info/?i=cover.1387273677.git.ydroneaud@opteya.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Since ib_copy_from_udata() doesn't check yet the available input data
length before accessing userspace memory, an explicit check of this
length is required to prevent:
- reading past the user provided buffer,
- underflow when subtracting the expected command size from the input
length.
This will ensure the newly added flow steering uverbs don't try to
process truncated commands.
Link: http://marc.info/?i=cover.1386798254.git.ydroneaud@opteya.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If the flow_spec items parsed count does not match the number of items
declared in the flow_attr command, or if not all bytes are used for
flow_spec items (eg. trailing garbage), a log message is reported and
the function leave through the error path. Unfortunately the error
code is currently not set.
This patch set error code to -EINVAL in such cases, so that the error
is reported to userspace instead of silently fail.
Link: http://marc.info/?i=cover.1386798254.git.ydroneaud@opteya.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
As noted by Daniel Vetter in its article "Botching up ioctls"[1]
"Check *all* unused fields and flags and all the padding for whether
it's 0, and reject the ioctl if that's not the case. Otherwise
your nice plan for future extensions is going right down the
gutters since someone *will* submit an ioctl struct with random
stack garbage in the yet unused parts. Which then bakes in the ABI
that those fields can never be used for anything else but garbage."
It's important to ensure that reserved fields are set to known value,
so that it will be possible to use them latter to extend the ABI.
The same reasonning apply to comp_mask field present in newer uverbs
command: per commit 22878dbc91 ("IB/core: Better checking of
userspace values for receive flow steering"), unsupported values in
comp_mask are rejected.
[1] http://blog.ffwll.ch/2013/11/botching-up-ioctls.html
Link: http://marc.info/?i=cover.1386798254.git.ydroneaud@opteya.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
As noted by Daniel Vetter in its article "Botching up ioctls"[1]
"Check *all* unused fields and flags and all the padding for whether
it's 0, and reject the ioctl if that's not the case. Otherwise
your nice plan for future extensions is going right down the
gutters since someone *will* submit an ioctl struct with random
stack garbage in the yet unused parts. Which then bakes in the ABI
that those fields can never be used for anything else but garbage."
It's important to ensure that reserved fields are set to known value,
so that it will be possible to use them latter to extend the ABI.
The same reasonning apply to comp_mask field present in newer uverbs
command: per commit 22878dbc91 ("IB/core: Better checking of
userspace values for receive flow steering"), unsupported values in
comp_mask are rejected.
[1] http://blog.ffwll.ch/2013/11/botching-up-ioctls.html
Link: http://marc.info/?i=cover.1386798254.git.ydroneaud@opteya.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Trying to have a ternary operator to choose between NULL (or 0) and the
real pointer value in invocations leads to an impossible choice between
a sparse error about a literal 0 used as a NULL pointer, and a gcc
warning about "pointer/integer type mismatch in conditional expression."
Rather than clutter the source with more casts, move the ternary
operator into a new INIT_UDATA_BUF_OR_NULL() macro, which makes it
easier to use and simplifies its callers.
Reported-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Userspace input buffer is not modified by kernel, so it can be 'const'.
This is also a prerequisite to remove the implicit cast
from INIT_UDATA().
Link: http://marc.info/?i=cover.1386798254.git.ydroneaud@opteya.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
rem_ref() calls iwcm_deref_id(), which will wake up any blockers on
cm_id_priv->destroy_comp if the refcnt hits 0. That will unblock
someone in iw_destroy_cm_id() which will free the cmid. If that
happens before rem_ref() calls test_bit(IWCM_F_CALLBACK_DESTROY,
&cm_id_priv->flags), then the test_bit() will touch freed memory.
The fix is to read the bit first, then deref. We should never be in
iw_destroy_cm_id() with IWCM_F_CALLBACK_DESTROY set, and there is a
BUG_ON() to make sure of that.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- Re-enable flow steering verbs with new improved userspace ABI
- Fixes for slow connection due to GID lookup scalability
- IPoIB fixes
- Many fixes to HW drivers including mlx4, mlx5, ocrdma and qib
- Further improvements to SRP error handling
- Add new transport type for Cisco usNIC
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABCAAGBQJSil7BAAoJEENa44ZhAt0hbtgP/A+AmUalbOX6ZKzuOFxsrtY2
r55CX9b1JBeFM/Zhn2o6y+81lpCjkckJSggESMe4izNgocGw0nW4vYGN4SBynatj
y8sR9OSn+G3ihuENrzG41MJUGEa5WbcNMy4boN+Oa+qyTlV/WjLR7Fv4WbikK7Wm
o8FNlXiiDhMoGfHHG5J0MD0EQsnxuLDk2XP+ciu4tLtTs+wBka+gFK8WnMvztle3
gTeMNna5ilvCS2fdBxteuPA3KeDnJE9AgJSMJ2a4Rh+DR8uTgWYQ6n7amjmOc546
yhAKkoBkxPE10+Yj82WOPhCFxSeWcuSwJvpgv5dTVZ1XqUUcC1V3TEcZDHmyyHQ7
uPXgS1A+erBW3OYPBjZqtKvnHObscV12fL+rId3vIhcAQIbFroci08ZwPidEYRkn
fvwlEKcrIsBIpRXEyjlFCxsiiDnfq1wC1VayMR3jrIK0P6idf1SXf/geiRp9+RGT
wKUc0j51jvEx29qc65xuhEP9FQV9pCMxyd+FEE0d0KkjMz5hsIkjmcUcBbgF0CGg
GEyDPlgRLv+vmWDGpT8XraaV/0CJOEQDIgB4WSN87/AZ4UoNt7spW2xqsLsp1toy
5e0100tpWUleTPLe/Wig5GtBdagQ2jAUK1+186CP93pFPtkwc4/7X3hyp7qPIPTz
VDvT9DEy6zjSMCLpMcdo
=xxC+
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull infiniband/rdma updates from Roland Dreier:
- Re-enable flow steering verbs with new improved userspace ABI
- Fixes for slow connection due to GID lookup scalability
- IPoIB fixes
- Many fixes to HW drivers including mlx4, mlx5, ocrdma and qib
- Further improvements to SRP error handling
- Add new transport type for Cisco usNIC
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (66 commits)
IB/core: Re-enable create_flow/destroy_flow uverbs
IB/core: extended command: an improved infrastructure for uverbs commands
IB/core: Remove ib_uverbs_flow_spec structure from userspace
IB/core: Use a common header for uverbs flow_specs
IB/core: Make uverbs flow structure use names like verbs ones
IB/core: Rename 'flow' structs to match other uverbs structs
IB/core: clarify overflow/underflow checks on ib_create/destroy_flow
IB/ucma: Convert use of typedef ctl_table to struct ctl_table
IB/cm: Convert to using idr_alloc_cyclic()
IB/mlx5: Fix page shift in create CQ for userspace
IB/mlx4: Fix device max capabilities check
IB/mlx5: Fix list_del of empty list
IB/mlx5: Remove dead code
IB/core: Encorce MR access rights rules on kernel consumers
IB/mlx4: Fix endless loop in resize CQ
RDMA/cma: Remove unused argument and minor dead code
RDMA/ucma: Discard events for IDs not yet claimed by user space
IB/core: Add Cisco usNIC rdma node and transport types
RDMA/nes: Remove self-assignment from nes_query_qp()
IB/srp: Report receive errors correctly
...
This commit reverts commit 7afbddfae9 ("IB/core: Temporarily disable
create_flow/destroy_flow uverbs"). Since the uverbs extensions
functionality was experimental for v3.12, this patch re-enables the
support for them and flow-steering for v3.13.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 400dbc9658 ("IB/core: Infrastructure for extensible uverbs
commands") added an infrastructure for extensible uverbs commands
while later commit 436f2ad05a ("IB/core: Export ib_create/destroy_flow
through uverbs") exported ib_create_flow()/ib_destroy_flow() functions
using this new infrastructure.
According to the commit 400dbc9658, the purpose of this
infrastructure is to support passing around provider (eg. hardware)
specific buffers when userspace issue commands to the kernel, so that
it would be possible to extend uverbs (eg. core) buffers independently
from the provider buffers.
But the new kernel command function prototypes were not modified to
take advantage of this extension. This issue was exposed by Roland
Dreier in a previous review[1].
So the following patch is an attempt to a revised extensible command
infrastructure.
This improved extensible command infrastructure distinguish between
core (eg. legacy)'s command/response buffers from provider
(eg. hardware)'s command/response buffers: each extended command
implementing function is given a struct ib_udata to hold core
(eg. uverbs) input and output buffers, and another struct ib_udata to
hold the hw (eg. provider) input and output buffers.
Having those buffers identified separately make it easier to increase
one buffer to support extension without having to add some code to
guess the exact size of each command/response parts: This should make
the extended functions more reliable.
Additionally, instead of relying on command identifier being greater
than IB_USER_VERBS_CMD_THRESHOLD, the proposed infrastructure rely on
unused bits in command field: on the 32 bits provided by command
field, only 6 bits are really needed to encode the identifier of
commands currently supported by the kernel. (Even using only 6 bits
leaves room for about 23 new commands).
So this patch makes use of some high order bits in command field to
store flags, leaving enough room for more command identifiers than one
will ever need (eg. 256).
The new flags are used to specify if the command should be processed
as an extended one or a legacy one. While designing the new command
format, care was taken to make usage of flags itself extensible.
Using high order bits of the commands field ensure that newer
libibverbs on older kernel will properly fail when trying to call
extended commands. On the other hand, older libibverbs on newer kernel
will never be able to issue calls to extended commands.
The extended command header includes the optional response pointer so
that output buffer length and output buffer pointer are located
together in the command, allowing proper parameters checking. This
should make implementing functions easier and safer.
Additionally the extended header ensure 64bits alignment, while making
all sizes multiple of 8 bytes, extending the maximum buffer size:
legacy extended
Maximum command buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes)
Maximum response buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes)
For the purpose of doing proper buffer size accounting, the headers
size are no more taken in account in "in_words".
One of the odds of the current extensible infrastructure, reading
twice the "legacy" command header, is fixed by removing the "legacy"
command header from the extended command header: they are processed as
two different parts of the command: memory is read once and
information are not duplicated: it's making clear that's an extended
command scheme and not a different command scheme.
The proposed scheme will format input (command) and output (response)
buffers this way:
- command:
legacy header +
extended header +
command data (core + hw):
+----------------------------------------+
| flags | 00 00 | command |
| in_words | out_words |
+----------------------------------------+
| response |
| response |
| provider_in_words | provider_out_words |
| padding |
+----------------------------------------+
| |
. <uverbs input> .
. (in_words * 8) .
| |
+----------------------------------------+
| |
. <provider input> .
. (provider_in_words * 8) .
| |
+----------------------------------------+
- response, if present:
+----------------------------------------+
| |
. <uverbs output space> .
. (out_words * 8) .
| |
+----------------------------------------+
| |
. <provider output space> .
. (provider_out_words * 8) .
| |
+----------------------------------------+
The overall design is to ensure that the extensible infrastructure is
itself extensible while begin more reliable with more input and bound
checking.
Note:
The unused field in the extended header would be perfect candidate to
hold the command "comp_mask" (eg. bit field used to handle
compatibility). This was suggested by Roland Dreier in a previous
review[2]. But "comp_mask" field is likely to be present in the uverb
input and/or provider input, likewise for the response, as noted by
Matan Barak[3], so it doesn't make sense to put "comp_mask" in the
header.
[1]:
http://marc.info/?i=CAL1RGDWxmM17W2o_era24A-TTDeKyoL6u3NRu_=t_dhV_ZA9MA@mail.gmail.com
[2]:
http://marc.info/?i=CAL1RGDXJtrc849M6_XNZT5xO1+ybKtLWGq6yg6LhoSsKpsmkYA@mail.gmail.com
[3]:
http://marc.info/?i=525C1149.6000701@mellanox.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
[ Convert "ret ? ret : 0" to the equivalent "ret". - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
The structure holding any types of flow_spec is of no use to
userspace. It would be wrong for userspace to do:
struct ib_uverbs_flow_spec flow_spec;
flow_spec.type = IB_FLOW_SPEC_TCP;
flow_spec.size = sizeof(flow_spec);
Instead, userspace should use the dedicated flow_spec structure for
- Ethernet : struct ib_uverbs_flow_spec_eth,
- IPv4 : struct ib_uverbs_flow_spec_ipv4,
- TCP/UDP : struct ib_uverbs_flow_spec_tcp_udp.
In other words, struct ib_uverbs_flow_spec is a "virtual" data
structure that can only be use by the kernel as an alias to the other.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch adds "flow" prefix to most of data structure added as part
of commit 436f2ad05a ("IB/core: Export ib_create/destroy_flow through
uverbs") to keep those names in sync with the data structures added in
commit 319a441d13 ("IB/core: Add receive flow steering support").
It's just a matter of translating 'ib_flow' to 'ib_uverbs_flow'.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 436f2ad05a ("IB/core: Export ib_create/destroy_flow through
uverbs") added public data structures to support receive flow
steering. The new structs are not following the 'uverbs' pattern:
they're lacking the common prefix 'ib_uverbs'.
This patch replaces ib_kern prefix by ib_uverbs.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch fixes the following issues:
1. Unneeded checks were removed
2. Removed the fixed size out of flow_attr.size, thus simplifying the checks.
3. Remove a 32bit hole on 64bit systems with strict alignment in
struct ib_kern_flow_att by adding a reserved field.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This typedef is unnecessary and should just be removed.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 3e6628c4b3 ("idr: introduce idr_alloc_cyclic()") adds a new
idr_alloc_cyclic() routine and converts several of these users to it.
This is just a missed one - add it.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Enforce the rule that when requesting remote write or atomic permissions, local
write must be indicated as well. See IB spec 11.2.8.2.
Spotted by: Hagay Abramovsky <hagaya@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The dev variable is never assigned after being initialised.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Problem reported by Avneesh Pant <avneesh.pant@oracle.com>:
It looks like we are triggering a bug in RDMA CM/UCM interaction.
The bug specifically hits when we have an incoming connection
request and the connecting process dies BEFORE the passive end of
the connection can process the request i.e. it does not call
rdma_get_cm_event() to retrieve the initial connection event. We
were able to triage this further and have some additional
information now.
In the example below when P1 dies after issuing a connect request
as the CM id is being destroyed all outstanding connects (to P2)
are sent a reject message. We see this reject message being
received on the passive end and the appropriate CM ID created for
the initial connection message being retrieved in cm_match_req().
The problem is in the ucma_event_handler() code when this reject
message is delivered to it and the initial connect message itself
HAS NOT been delivered to the client. In fact the client has not
even called rdma_cm_get_event() at this stage so we haven't
allocated a new ctx in ucma_get_event() and updated the new
connection CM_ID to point to the new UCMA context.
This results in the reject message not being dropped in
ucma_event_handler() for the new connection request as the
(if (!ctx->uid)) block is skipped since the ctx it refers to is
the listen CM id context which does have a valid UID associated
with it (I believe the new CMID for the connection initially
uses the listen CMID -> context when it is created in
cma_new_conn_id). Thus the assumption that new events for a
connection can get dropped in ucma_event_handler() is incorrect
IF the initial connect request has not been retrieved in the
first case. We end up getting a CM Reject event on the listen CM
ID and our upper layer code asserts (in fact this event does not
even have the listen_id set as that only gets set up librdmacm
for connect requests).
The solution is to verify that the cm_id being reported in the event
is the same as the cm_id referenced by the ucma context. A mismatch
indicates that the ucma context corresponds to the listen. This fix
was validated by using a modified version of librdmacm that was able
to verify the problem and see that the reject message was indeed
dropped after this patch was applied.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch adds new rdma node and new rdma transport, and supporting
code used by Cisco's low latency driver called usNIC. usNIC uses its
own transport, distinct from IB and iWARP.
Signed-off-by: Upinder Malhi <umalhi@cisco.com>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
'op' is the already RDMA_NL_GET_OP() masked 'type'. No need to mask it again.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Yann Droneaud <ydroneaud@opteya.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Currently, we don't copy the immediate data from the userspace struct
to the kernel one when UD messages are being sent.
This patch makes sure that the immediate data is set correctly.
Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Roland Dreier <roland@purestorage.com>
As a simple optimization that should speed up the vast majority of
connect attemps on IB devices, when we are searching for the GID of an
incoming connection in the cached GID lists of devices, search the
device that received the incoming connection request first. If we
don't find it there, then move on to other devices.
This reduces the time to perform 10,000 connections considerably.
Prior to this patch, a bad run of cmtime would look like this:
connect : 12399.26 12351.10 8609.00 1239.93
With this patch, it looks more like this:
connect : 5864.86 5799.80 8876.00 586.49
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The cma_acquire_dev function was changed by commit 3c86aa70bf
("RDMA/cm: Add RDMA CM support for IBoE devices") to use find_gid_port()
because multiport devices might have either IB or IBoE formatted gids.
The old function assumed that all ports on the same device used the
same GID format.
However, when it was changed to use find_gid_port(), we inadvertently
lost usage of the GID cache. This turned out to be a very costly
change. In our testing, each iteration through each index of the GID
table takes roughly 35us. When you have multiple devices in a system,
and the GID you are looking for is on one of the later devices, the
code loops through all of the GID indexes on all of the early devices
before it finally succeeds on the target device. This pathological
search behavior combined with 35us per GID table index retrieval
results in results such as the following from the cmtime application
that's part of the latest librdmacm git repo:
ib1:
step total ms max ms min us us / conn
create id : 29.42 0.04 1.00 2.94
bind addr : 186705.66 19.00 18556.00 18670.57
resolve addr : 41.93 9.68 619.00 4.19
resolve route: 486.93 0.48 101.00 48.69
create qp : 4021.95 6.18 330.00 402.20
connect : 68350.39 68588.17 24632.00 6835.04
disconnect : 1460.43 252.65-1862269.00 146.04
destroy : 41.16 0.04 2.00 4.12
ib0:
step total ms max ms min us us / conn
create id : 28.61 0.68 1.00 2.86
bind addr : 2178.86 2.95 201.00 217.89
resolve addr : 51.26 16.85 845.00 5.13
resolve route: 620.08 0.43 92.00 62.01
create qp : 3344.40 6.36 273.00 334.44
connect : 6435.99 6368.53 7844.00 643.60
disconnect : 5095.38 321.90 757.00 509.54
destroy : 37.13 0.02 2.00 3.71
Clearly, both the bind address and connect operations suffer
a huge penalty for being anything other than the default
GID on the first port in the system.
After applying this patch, the numbers now look like this:
ib1:
step total ms max ms min us us / conn
create id : 30.15 0.03 1.00 3.01
bind addr : 80.27 0.04 7.00 8.03
resolve addr : 43.02 13.53 589.00 4.30
resolve route: 482.90 0.45 100.00 48.29
create qp : 3986.55 5.80 330.00 398.66
connect : 7141.53 7051.29 5005.00 714.15
disconnect : 5038.85 193.63 918.00 503.88
destroy : 37.02 0.04 2.00 3.70
ib0:
step total ms max ms min us us / conn
create id : 34.27 0.05 1.00 3.43
bind addr : 26.45 0.04 1.00 2.64
resolve addr : 38.25 10.54 760.00 3.82
resolve route: 604.79 0.43 97.00 60.48
create qp : 3314.95 6.34 273.00 331.49
connect : 12399.26 12351.10 8609.00 1239.93
disconnect : 5096.76 270.72 1015.00 509.68
destroy : 37.10 0.03 2.00 3.71
It's worth noting that we still suffer a bit of a penalty on
connect to the wrong device, but the penalty is much less than
it used to be. Follow on patches deal with this penalty.
Many thanks to Neil Horman for helping to track the source of
slow function that allowed us to track down the fact that
the original patch I mentioned above backed out cache usage
and identify just how much that impacted the system.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
On top of commit 366cddb40 "IB/rdma_cm: TOS <=> UP mapping for IBoE", add
support for case vlan egress map is used.
When the IBoE session is being set over a vlan, inherit the socket priority
to vlan priority mapping which was configured for the vlan device egress map.
Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/qmi_wwan.c
include/net/dst.h
Trivial merge conflicts, both were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
The create_flow/destroy_flow uverbs and the associated extensions to
the user-kernel verbs ABI are under review and are too experimental to
freeze at this point.
So userspace is not exposed to experimental features and an uinstable
ABI, temporarily disable this for v3.12 (with a Kconfig option behind
staging to reenable it if desired).
The feature will be enabled after proper cleanup for v3.13.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1381351016.git.ydroneaud@opteya.com
Link: http://marc.info/?i=cover.1381177342.git.ydroneaud@opteya.com
[ Add a Kconfig option to reenable these verbs. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
- Move sysctl_local_ports from a global variable into struct netns_ipv4.
- Modify inet_get_local_port_range to take a struct net, and update all
of the callers.
- Move the initialization of sysctl_local_ports into
sysctl_net_ipv4.c:ipv4_sysctl_init_net from inet_connection_sock.c
v2:
- Ensure indentation used tabs
- Fixed ip.h so it applies cleanly to todays net-next
v3:
- Compile fixes of strange callers of inet_get_local_port_range.
This patch now successfully passes an allmodconfig build.
Removed manual inlining of inet_get_local_port_range in ipv4_local_port_range
Originally-by: Samya <samya@twitter.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Large ocrdma HW driver update: add "fast register" work requests,
fixes, cleanups
- Add receive flow steering support for raw QPs
- Fix IPoIB neighbour race that leads to crash
- iSER updates including support for using "fast register" memory
registration
- IPv6 support for iWARP
- XRC transport fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABCAAGBQJSJ2drAAoJEENa44ZhAt0hpaUQAJ6EdNFh2bon9c/uHz1lw58A
DIvVniUwWGCRgpbsv/IxZDBVX3G5IKSW6U0Y3/JxMXeIF/cGOUaJZgCeKPiYNoNA
yxiEGI0ffvpWGAJUHSE2ARJKvgdjrpqGo5UzsiinEhJx3uczeZBcooosQTWDXxya
/qSWH3rARkic9abuaizkuVzdWEjWLNjyh2bWnqJ4HNZplE8RfSK4bk8QtvEmpn9Z
dNBKFFujysfHHGflLuSkFAkP1NdqlZeQ4/2uzD23p3YofbJwrJoJNFxkCfx3UUml
fjPptlhU6kZMCwSXsD24tAV8Exr/CCgmxriFIN/Xqhu4gvBELScRPPqPqFquhtXI
pP5hbG9/9P5C8BLgABe6IAvlU1lUyraR67bA78nYeKm2JpATE6T23Nx7nUgMgzRS
Ee3ZvZGZvk8BZ7zyQh8D7Aig1XOtzqA9D5nLUyEJ2aRPSnXJxyi0S3Fbn0vmOMb7
8CF+99KECG8fb4sjNAjX5Zm48+7WJd+WCd3t89oUzCbJKKpa1Chctph8VH7dEfke
JkEjOJhuAGqau4bIMZ2nZkISoTfSD7vbjPAxMPASS7fOdR7fe6AqDn9/LMAhWjpw
4Fb25ZW7rYf9+6jqA4MgMRFCTkXhR25FsqUUL9h8NaWai4F1dfVoZHbRvGIZah5n
5HQupXGquwES4y6pvHjc
=rabl
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull main batch of InfiniBand/RDMA changes from Roland Dreier:
- Large ocrdma HW driver update: add "fast register" work requests,
fixes, cleanups
- Add receive flow steering support for raw QPs
- Fix IPoIB neighbour race that leads to crash
- iSER updates including support for using "fast register" memory
registration
- IPv6 support for iWARP
- XRC transport fixes
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (54 commits)
RDMA/ocrdma: Fix compiler warning about int/pointer size mismatch
IB/iser: Fix redundant pointer check in dealloc flow
IB/iser: Fix possible memory leak in iser_create_frwr_pool()
IB/qib: Move COUNTER_MASK definition within qib_mad.h header guards
RDMA/ocrdma: Fix passing wrong opcode to modify_srq
RDMA/ocrdma: Fill PVID in UMC case
RDMA/ocrdma: Add ABI versioning support
RDMA/ocrdma: Consider multiple SGES in case of DPP
RDMA/ocrdma: Fix for displaying proper link speed
RDMA/ocrdma: Increase STAG array size
RDMA/ocrdma: Dont use PD 0 for userpace CQ DB
RDMA/ocrdma: FRMA code cleanup
RDMA/ocrdma: For ERX2 irrespective of Qid, num_posted offset is 24
RDMA/ocrdma: Fix to work with even a single MSI-X vector
RDMA/ocrdma: Remove the MTU check based on Ethernet MTU
RDMA/ocrdma: Add support for fast register work requests (FRWR)
RDMA/ocrdma: Create IRD queue fix
IB/core: Better checking of userspace values for receive flow steering
IB/mlx4: Add receive flow steering support
IB/core: Export ib_create/destroy_flow through uverbs
...
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.
This has been sitting in linux-next for a whole cycle.
Thanks,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJSJo+1AAoJENkgDmzRrbjxIC4QALJK95o8AUXuwUkl+2fmFkUt
hh2/PJ1vDYgk4Xt0J6hyoK7XMa0H1RkbBrROuDdsBnorMFpEsGcgdkUZte9ufoAS
97Bg+7N0KPbTB/S8vOwtW1vbERTJIVPN2uf6h1Wqm9Xc2puCh3HbMMr1AWMGu0WQ
NqY5+Zz8zecy1UOrMhEP6H1CjeQcL1w1DO6YM5ydeqlKNzAz+JMfDXriLPDwiE7+
XFPDF/O3Vtd2ckA7L70Lio7hfHwxV5U4WwFVfiwls98XB4jcZqDKIoh1r8z4SRgR
+0Rae2DN3BaOabGMr//5XdrzQVpwJTh5m2w8BAOHJvCJ9HR7Sq29UIN4u+TowZBy
L2xYo4dvFxkympwu5zEd3c7vHYWKIaqmSq5PIjr4gF/uIo2OeOTrpPIK782ZEYb7
e+qUgOEM05V9AmQZCrSZeP9u474Sj8ow3sCtWxfdRtwNfoEIcUXsNNJd/zDHlVtW
cEtXqc2xXIpcuUJQWlSaGp8fmRQjVZPzrLKYLM2m39ZcOOJbf5rzQAYS7hHPosIa
SK+YVux/+Zzi+Xo/vXq1OlM/SruCr5S7JOgCxLowoQ88vupgXME6uPyC8EO+QQ50
GsrHes5ZNLbk0uVsfcexIyojkUnyvDmmnDpv+1zdC6RgZLJQn8OXp5yNhHhnhrFT
BiHX6YFWtDDqRlVv8Q0F
=LeaW
-----END PGP SIGNATURE-----
Merge tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull PTR_RET() removal patches from Rusty Russell:
"PTR_RET() is a weird name, and led to some confusing usage. We ended
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.
This has been sitting in linux-next for a whole cycle"
[ There are still some PTR_RET users scattered about, with some of them
possibly being new, but most of them existing in Rusty's tree too. We
have that
#define PTR_RET(p) PTR_ERR_OR_ZERO(p)
thing in <linux/err.h>, so they continue to work for now - Linus ]
* tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO
Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO
drm/cma: Replace PTR_RET with PTR_ERR_OR_ZERO
sh_veu: Replace PTR_RET with PTR_ERR_OR_ZERO
dma-buf: Replace PTR_RET with PTR_ERR_OR_ZERO
drivers/rtc: Replace PTR_RET with PTR_ERR_OR_ZERO
mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR().
staging/zcache: don't use PTR_RET().
remoteproc: don't use PTR_RET().
pinctrl: don't use PTR_RET().
acpi: Replace weird use of PTR_RET.
s390: Replace weird use of PTR_RET.
PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.
PTR_RET is now PTR_ERR_OR_ZERO
- Don't allow unsupported comp_mask values, user should check
ibv_query_device to know which features are supported.
- Add a check in ib_uverbs_create_flow() to verify the size passed
from the user space.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Implement ib_uverbs_create_flow() and ib_uverbs_destroy_flow() to
support flow steering for user space applications.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add infrastructure to support extended uverbs capabilities in a
forward/backward manner. Uverbs command opcodes which are based on
the verbs extensions approach should be greater or equal to
IB_USER_VERBS_CMD_THRESHOLD. They have new header format and
processed a bit differently.
Whenever a specific IB_USER_VERBS_CMD_XXX is extended, which practically means
it needs to have additional arguments, we will be able to add them without creating
a completely new IB_USER_VERBS_CMD_YYY command or bumping the uverbs ABI version.
This patch for itself doesn't provide the whole scheme which is also dependent
on adding a comp_mask field to each extended uverbs command struct.
The new header framework allows for future extension of the CMD arguments
(ib_uverbs_cmd_hdr.in_words, ib_uverbs_cmd_hdr.out_words) for an existing
new command (that is a command that supports the new uverbs command header format
suggested in this patch) w/o bumping ABI version and with maintaining backward
and formward compatibility to new and old libibverbs versions.
In the uverbs command we are passing both uverbs arguments and the provider arguments.
We split the ib_uverbs_cmd_hdr.in_words to ib_uverbs_cmd_hdr.in_words which will now carry only
uverbs input argument struct size and ib_uverbs_cmd_hdr.provider_in_words that will carry
the provider input argument size. Same goes for the response (the uverbs CMD output argument).
For example take the create_cq call and the mlx4_ib provider:
The uverbs layer gets libibverb's struct ibv_create_cq (named struct ib_uverbs_create_cq
in the kernel), mlx4_ib gets libmlx4's struct mlx4_create_cq (which includes struct
ibv_create_cq and is named struct mlx4_ib_create_cq in the kernel) and
in_words = sizeof(mlx4_create_cq)/4 .
Thus ib_uverbs_cmd_hdr.in_words carry both uverbs plus mlx4_ib input argument sizes,
where uverbs assumes it knows the size of its input argument - struct ibv_create_cq.
Now, if we wish to add a variable to struct ibv_create_cq, we can add a comp_mask field
to the struct which is basically bit field indicating which fields exists in the struct
(as done for the libibverbs API extension), but we need a way to tell what is the total
size of the struct and not assume the struct size is predefined (since we may get different
struct sizes from different user libibverbs versions). So we know at which point the
provider input argument (struct mlx4_create_cq) begins. Same goes for extending the
provider struct mlx4_create_cq. Thus we split the ib_uverbs_cmd_hdr.in_words to
ib_uverbs_cmd_hdr.in_words which will now carry only uverbs input argument struct size and
ib_uverbs_cmd_hdr.provider_in_words that will carry the provider (mlx4_ib) input argument size.
Signed-off-by: Igor Ivanov <Igor.Ivanov@itseez.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The RDMA stack allows for applications to create IB_QPT_RAW_PACKET
QPs, which receive plain Ethernet packets, specifically packets that
don't carry any QPN to be matched by the receiving side. Applications
using these QPs must be provided with a method to program some
steering rule with the HW so packets arriving at the local port can be
routed to them.
This patch adds ib_create_flow(), which allow providing a flow
specification for a QP. When there's a match between the
specification and a received packet, the packet is forwarded to that
QP, in a the same way one uses ib_attach_multicast() for IB UD
multicast handling.
Flow specifications are provided as instances of struct ib_flow_spec_yyy,
which describe L2, L3 and L4 headers. Currently specs for Ethernet, IPv4,
TCP and UDP are defined. Flow specs are made of values and masks.
The input to ib_create_flow() is a struct ib_flow_attr, which contains
a few mandatory control elements and optional flow specs.
struct ib_flow_attr {
enum ib_flow_attr_type type;
u16 size;
u16 priority;
u32 flags;
u8 num_of_specs;
u8 port;
/* Following are the optional layers according to user request
* struct ib_flow_spec_yyy
* struct ib_flow_spec_zzz
*/
};
As these specs are eventually coming from user space, they are defined and
used in a way which allows adding new spec types without kernel/user ABI
change, just with a little API enhancement which defines the newly added spec.
The flow spec structures are defined with TLV (Type-Length-Value)
entries, which allows calling ib_create_flow() with a list of variable
length of optional specs.
For the actual processing of ib_flow_attr the driver uses the number
of specs and the size mandatory fields along with the TLV nature of
the specs.
Steering rules processing order is according to the domain over which
the rule is set and the rule priority. All rules set by user space
applicatations fall into the IB_FLOW_DOMAIN_USER domain, other domains
could be used by future IPoIB RFS and Ethetool flow-steering interface
implementation. Lower numerical value for the priority field means
higher priority.
The returned value from ib_create_flow() is a struct ib_flow, which
contains a database pointer (handle) provided by the HW driver to be
used when calling ib_destroy_flow().
Applications that offload TCP/IP traffic can also be written over IB
UD QPs. The ib_create_flow() / ib_destroy_flow() API is designed to
support UD QPs too. A HW driver can set IB_DEVICE_MANAGED_FLOW_STEERING
to denote support for flow steering.
The ib_flow_attr enum type supports usage of flow steering for promiscuous
and sniffer purposes:
IB_FLOW_ATTR_NORMAL - "regular" rule, steering according to rule specification
IB_FLOW_ATTR_ALL_DEFAULT - default unicast and multicast rule, receive
all Ethernet traffic which isn't steered to any QP
IB_FLOW_ATTR_MC_DEFAULT - same as IB_FLOW_ATTR_ALL_DEFAULT but only for multicast
IB_FLOW_ATTR_SNIFFER - sniffer rule, receive all port traffic
ALL_DEFAULT and MC_DEFAULT rules options are valid only for Ethernet link type.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Added reference counting mechanism for XRC target QPs between
ib_uqp_object and its ib_uxrcd_object. This prevents closing an XRC
domain that is still attached to a QP. In addition, add missing code
in ib_uverbs_destroy_srq() to handle ib_uxrcd_object reference
counting correctly when destroying an xsrq.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix a potential race when event occurrs on a target XRC QP and in the
middle of reporting that on its shared qps, one of them is destroyed
by user space application. Also add note for kernel consumers in
ib_verbs.h that they must not destroy the QP from within the handler.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Modify the type of local_addr and remote_addr fields in struct
iw_cm_id from struct sockaddr_in to struct sockaddr_storage to hold
IPv6 and IPv4 addresses uniformly.
Change the references of local_addr and remote_addr in cxgb4, cxgb3,
nes and amso drivers to match this. However to be able to actully run
traffic over IPv6, low-level drivers have to add code to support this.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
[ Fix unused variable warnings when INFINIBAND_NES_DEBUG not set.
- Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Currently, QP1 is created using pkey index 0. This patch simply looks
for the index containing the default pkey, rather than hard-coding
pkey index 0.
This change will have no effect in native mode, since QP0 and QP1 are
created before the SM configures the port, so pkey table will still be
the default table defined by the IB Spec, in C10-123: "If non-volatile
storage is not used to hold P_Key Table contents, then if a PM
(Partition Manager) is not present, and prior to PM initialization of
the P_Key Table, the P_Key Table must act as if it contains a single
valid entry, at P_Key_ix = 0, containing the default partition
key. All other entries in the P_Key Table must be invalid."
Thus, in the native mode case, the driver will find the default pkey
at index 0 (so it will be no different than the hard-coding).
However, in SR-IOV mode, for VFs, the pkey table may be
paravirtualized, so that the VF's pkey index zero may not necessarily
be mapped to the real pkey index 0. For VFs, therefore, it is
important to find the virtual index which maps to the real default
pkey.
This commit does the following for QP1 creation:
1. Find the pkey index containing the default pkey, and use that index
if found. ib_find_pkey() returns the index of the
limited-membership default pkey (0x7FFF) if the full-member default
pkey is not in the table.
2. If neither form of the default pkey is found, use pkey index 0
(previous behavior).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Calling cma_save_ib_info() for CM SIDR REQs results in a crash
accessing an invalid path record pointer.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If a application is using AF_IB with a UD QP, but does not provide any
private data, we will end up accessing invalid memory. Check for this
case and handle it appropriately.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Building cma.o triggers this gcc warning:
drivers/infiniband/core/cma.c: In function ‘rdma_resolve_addr’:
drivers/infiniband/core/cma.c:465:23: warning: ‘port’ may be used uninitialized in this function [-Wmaybe-uninitialized]
drivers/infiniband/core/cma.c:426:5: note: ‘port’ was declared here
This is a false positive, as "port" will always be initialized if we're
at "found". But if we assign to "id_priv->id.port_num" directly, we can
drop "port". That will, obviously, silence gcc.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- AF_IB (native IB addressing) for CMA from Sean Hefty
- New mlx5 driver for Mellanox Connect-IB adapters (including post merge request fixes)
- SRP fixes from Bart Van Assche (including fix to first merge request)
- qib HW driver updates
- Resurrection of ocrdma HW driver development
- uverbs conversion to create fds with O_CLOEXEC set
- Other small changes and fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJR30TKAAoJEENa44ZhAt0h854P/jvAhK5u+XTM5VyjAi0DKJ7P
bWcsu+KxbOIFnjEdsYQl1mGP44gdO8GPZp7+JR5nDHDRpw9K76qy6QQiPbaF6Y8D
cZH8Xlq4hzBfElTWBkExEemPrVUUq77j03FE9TBatdLAtEyYkgrNyqr7Ys6zVwVK
ugR8nAahvnB7Jh1tsyZBBd9kfbWtXJnaGC8/Zk3Na4n4zXRAbr0DcnRF0sncTL38
VFnWbi33OQAxu5bsb2jGec/SNP3BbNwspFPjSCKqiiItRaCj13JiHhrKKvVk4RZe
hIRnPH47kjLRp2/PwBo6o+gTXZuRg48VGBx4CKUTwx1nCzPPN1iz9ZOfqUv9Qwcv
LX8mxC7QS/Yvud4KeEBsj6kotb80EkRF2KV5RkIKCxQiwetGD9127bZylC8ttxGw
2f6MzYtAGD4R4C10lO8N+59VugSg1xAvwsqz0a/jy2XyVHbI1ugQedzkB20x5WPY
51S08ABvtU9yIxIYrw2VEaa/5WN+XJ6+LpG9QBAGXdMLiCiiAe7n/YzyXI6AgwaW
Jl/uKr6H6/jEHUHKwkyqsmbpVGPhtGWu8deyr1oYvOEP4i48gcDqMQsfMcCISrQV
MeQU3hS/obykUlNeqjmMI2CXrecqSsiq0hXd4DLaSoZ2Rb4Drx2Wj6sTQLIAgL2q
GBYjHWMUpZXIFHQaH7am
=nZh8
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull InfiniBand/RDMA changes from Roland Dreier:
- AF_IB (native IB addressing) for CMA from Sean Hefty
- new mlx5 driver for Mellanox Connect-IB adapters (including post
merge request fixes)
- SRP fixes from Bart Van Assche (including fix to first merge request)
- qib HW driver updates
- resurrection of ocrdma HW driver development
- uverbs conversion to create fds with O_CLOEXEC set
- other small changes and fixes
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (66 commits)
mlx5: Return -EFAULT instead of -EPERM
IB/qib: Log all SDMA errors unconditionally
IB/qib: Fix module-level leak
mlx5_core: Adjust hca_cap.uar_page_sz to conform to Connect-IB spec
IB/srp: Let srp_abort() return FAST_IO_FAIL if TL offline
IB/uverbs: Use get_unused_fd_flags(O_CLOEXEC) instead of get_unused_fd()
mlx5_core: Fixes for sparse warnings
IB/mlx5: Make profile[] static in main.c
mlx5: Fix parameter type of health_handler_t
mlx5: Add driver for Mellanox Connect-IB adapters
IB/core: Add reserved values to enums for low-level driver use
IB/srp: Bump driver version and release date
IB/srp: Make HCA completion vector configurable
IB/srp: Maintain a single connection per I_T nexus
IB/srp: Fail I/O fast if target offline
IB/srp: Skip host settle delay
IB/srp: Avoid skipping srp_reset_host() after a transport error
IB/srp: Fix remove_one crash due to resource exhaustion
IB/qib: New transmitter tunning settings for Dell 1.1 backplane
IB/core: Fix error return code in add_port()
...
Pull networking updates from David Miller:
"This is a re-do of the net-next pull request for the current merge
window. The only difference from the one I made the other day is that
this has Eliezer's interface renames and the timeout handling changes
made based upon your feedback, as well as a few bug fixes that have
trickeled in.
Highlights:
1) Low latency device polling, eliminating the cost of interrupt
handling and context switches. Allows direct polling of a network
device from socket operations, such as recvmsg() and poll().
Currently ixgbe, mlx4, and bnx2x support this feature.
Full high level description, performance numbers, and design in
commit 0a4db187a9 ("Merge branch 'll_poll'")
From Eliezer Tamir.
2) With the routing cache removed, ip_check_mc_rcu() gets exercised
more than ever before in the case where we have lots of multicast
addresses. Use a hash table instead of a simple linked list, from
Eric Dumazet.
3) Add driver for Atheros CQA98xx 802.11ac wireless devices, from
Bartosz Markowski, Janusz Dziedzic, Kalle Valo, Marek Kwaczynski,
Marek Puzyniak, Michal Kazior, and Sujith Manoharan.
4) Support reporting the TUN device persist flag to userspace, from
Pavel Emelyanov.
5) Allow controlling network device VF link state using netlink, from
Rony Efraim.
6) Support GRE tunneling in openvswitch, from Pravin B Shelar.
7) Adjust SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF for modern times, from
Daniel Borkmann and Eric Dumazet.
8) Allow controlling of TCP quickack behavior on a per-route basis,
from Cong Wang.
9) Several bug fixes and improvements to vxlan from Stephen
Hemminger, Pravin B Shelar, and Mike Rapoport. In particular,
support receiving on multiple UDP ports.
10) Major cleanups, particular in the area of debugging and cookie
lifetime handline, to the SCTP protocol code. From Daniel
Borkmann.
11) Allow packets to cross network namespaces when traversing tunnel
devices. From Nicolas Dichtel.
12) Allow monitoring netlink traffic via AF_PACKET sockets, in a
manner akin to how we monitor real network traffic via ptype_all.
From Daniel Borkmann.
13) Several bug fixes and improvements for the new alx device driver,
from Johannes Berg.
14) Fix scalability issues in the netem packet scheduler's time queue,
by using an rbtree. From Eric Dumazet.
15) Several bug fixes in TCP loss recovery handling, from Yuchung
Cheng.
16) Add support for GSO segmentation of MPLS packets, from Simon
Horman.
17) Make network notifiers have a real data type for the opaque
pointer that's passed into them. Use this to properly handle
network device flag changes in arp_netdev_event(). From Jiri
Pirko and Timo Teräs.
18) Convert several drivers over to module_pci_driver(), from Peter
Huewe.
19) tcp_fixup_rcvbuf() can loop 500 times over loopback, just use a
O(1) calculation instead. From Eric Dumazet.
20) Support setting of explicit tunnel peer addresses in ipv6, just
like ipv4. From Nicolas Dichtel.
21) Protect x86 BPF JIT against spraying attacks, from Eric Dumazet.
22) Prevent a single high rate flow from overruning an individual cpu
during RX packet processing via selective flow shedding. From
Willem de Bruijn.
23) Don't use spinlocks in TCP md5 signing fast paths, from Eric
Dumazet.
24) Don't just drop GSO packets which are above the TBF scheduler's
burst limit, chop them up so they are in-bounds instead. Also
from Eric Dumazet.
25) VLAN offloads are missed when configured on top of a bridge, fix
from Vlad Yasevich.
26) Support IPV6 in ping sockets. From Lorenzo Colitti.
27) Receive flow steering targets should be updated at poll() time
too, from David Majnemer.
28) Fix several corner case regressions in PMTU/redirect handling due
to the routing cache removal, from Timo Teräs.
29) We have to be mindful of ipv4 mapped ipv6 sockets in
upd_v6_push_pending_frames(). From Hannes Frederic Sowa.
30) Fix L2TP sequence number handling bugs, from James Chapman."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1214 commits)
drivers/net: caif: fix wrong rtnl_is_locked() usage
drivers/net: enic: release rtnl_lock on error-path
vhost-net: fix use-after-free in vhost_net_flush
net: mv643xx_eth: do not use port number as platform device id
net: sctp: confirm route during forward progress
virtio_net: fix race in RX VQ processing
virtio: support unlocked queue poll
net/cadence/macb: fix bug/typo in extracting gem_irq_read_clear bit
Documentation: Fix references to defunct linux-net@vger.kernel.org
net/fs: change busy poll time accounting
net: rename low latency sockets functions to busy poll
bridge: fix some kernel warning in multicast timer
sfc: Fix memory leak when discarding scattered packets
sit: fix tunnel update via netlink
dt:net:stmmac: Add dt specific phy reset callback support.
dt:net:stmmac: Add support to dwmac version 3.610 and 3.710
dt:net:stmmac: Allocate platform data only if its NULL.
net:stmmac: fix memleak in the open method
ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available
net: ipv6: fix wrong ping_v6_sendmsg return value
...
The macro get_unused_fd() is used to allocate a file descriptor with
default flags. Those default flags (0) can be "unsafe": O_CLOEXEC must
be used by default to not leak file descriptor across exec().
Replace calls to get_unused_fd() in uverbs with calls to
get_unused_fd_flags(O_CLOEXEC). Inheriting uverbs fds across exec()
cannot be used to do anything useful.
Based on a patch/suggestion from Yann Droneaud <ydroneaud@opteya.com>.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Calling dev_set_name with a single paramter causes it to be handled as a
format string. Many callers are passing potentially dynamic string
content, so use "%s" in those cases to avoid any potential accidents,
including wrappers like device_create*() and bdi_register().
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix to return -ENOMEM in the add_port() error handling case instead of
0, as done elsewhere in this function.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Report AF_IB source and destination addresses through netlink
interface.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow user space applications to join multicast groups using MGIDs
directly. MGIDs may be passed using AF_IB addresses. Since the
current multicast join command only supports addresses as large as
sockaddr_in6, define a new structure for joining addresses specified
using sockaddr_ib.
Since AF_IB allows the user to specify the qkey when resolving a
remote UD QP address, when joining the multicast group use the qkey
value, if one has been assigned.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow user space applications to call resolve_addr using AF_IB. To
support sockaddr_ib, we need to define a new structure capable of
handling the larger address size.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Support user space binding to addresses using AF_IB. Since
sockaddr_ib is larger than sockaddr_in6, we need to define a larger
structure when binding using AF_IB. This time we use sockaddr_storage
to cover future cases.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Several commands into the RDMA CM from user space are restricted to
supporting addresses which fit into a sockaddr_in6 structure: bind
address, resolve address, and join multicast.
With the addition of AF_IB, we need to support addresses which are
larger than sockaddr_in6. This will be done by adding new commands
that exchange address information using sockaddr_storage. However, to
support existing applications, we maintain the current commands and
structures, but rename them to indicate that they only support IPv4
and v6 addresses.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Part of address resolution is mapping IP addresses to IB GIDs. With
the changes to support querying larger addresses and more path records,
also provide a way to query IB GIDs after resolution completes.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow the rdma_ucm to query the IB service ID formed or allocated by
the rdma_cm by exporting the cma_get_service_id() functionality.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The current query_route call can return up to two path records. The
assumption being that one is the primary path, with optional support
for an alternate path. In both cases, the paths are assumed to be
reversible and are used to send CM MADs.
With the ability to manually set IB path data, the rdma cm can
eventually be capable of using up to 6 paths per connection:
forward primary, reverse primary,
forward alternate, reverse alternate,
reversible primary path for CM MADs
reversible alternate path for CM MADs.
(It is unclear at this time if IB routing will complicate this) In
order to handle more flexible routing topologies, add a new command to
report any number of paths.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow converting from struct ib_sa_path_rec to the IB defined SA path
record wire format. This will be used to report path data from the
rdma cm into user space.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The sockaddr structure for AF_IB is larger than sockaddr_in6. The
rdma cm user space ABI uses the latter to exchange address information
between user space and the kernel.
To support querying for larger addresses, define a new query command
that exchanges data using sockaddr_storage, rather than sockaddr_in6.
Unlike the existing query_route command, the new command only returns
address information. Route (i.e. path record) data is separated.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If an rdma_cm_id is bound to AF_IB, with a wild card address, only
listen on IB devices.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow the user to specify the qkey when using AF_IB. The qkey is
added to struct rdma_ucm_conn_param in place of a reserved field, but
for backwards compatability, is only accessed if the associated
rdma_cm_id is using AF_IB.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If the source or destination address is AF_IB, then do not reserve a
portion of the private data in the IB CM REQ or SIDR REQ messages for
the cma header. Instead, all private data should be exported to the
user. When AF_IB is used, the rdma cm does not have sufficient
information to fill in the cma header. Additionally, this will be
necessary to support any IB connection through the rdma cm interface,
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
With the removal of SDP related code, we can merge cma_get_net_info()
with cma_save_net_info(), since we're only ever dealing with a single
header format.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The SDP protocol was never merged upstream. Remove unused SDP related
code from the RDMA CM.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
cma_get_service_id() forms the service ID based on the port space and
port number of the rdma_cm_id. Extend the call to support AF_IB,
which contains the service ID directly. This will be needed to
support any arbitrary SID.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow rdma_resolve_route() to handle the case where the user specified
the source and destination addresses using AF_IB.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow the user to specify the remote address using AF_IB format. When
AF_IB is used, the remote address simply needs to be recorded, and no
resolution using ARP is done. The local address may still need to be
matched with a local IB device.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If a user specifies AF_IB as the source address for a loopback
connection, limit the resolution to IB devices only.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Provide inline helpers to extract source and destination address data
from the rdma_cm_id.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
cma_resolve_loopback is called after an rdma_cm_id has been
bound to a specific sa_family and port. Once the
source sa_family for the id has been set, do not modify it.
Only the actual IP address portion of the source address
needs to be set.
As part of this fix, we can simplify setting the source address
by moving the loopback address assignment from cma_resolve_loopback
to cma_bind_loopback. cma_bind_loopback is only invoked when
the source address is the loopback address.
Finally, add loopback support for AF_IB as part of the change.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Modify rdma_bind_addr to allow the user to specify AF_IB when binding
to a device. AF_IB indicates that the user is not mapping an IP
address to the native IB addressing. (The mapping may have already
been done, or is not needed)
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The AF_IB uses a 64-bit service id (SID), which the user can control
through the use of a mask. The rdma_cm will assign values to the
unmasked portions of the SID based on the selected port space and port
number.
Because the IB spec divides the SID range into several regions, a
SID/mask combination may fall into one of the existing port space
ranges as defined by the RDMA CM IP Annex. Map the AF_IB SID to the
correct RDMA port space.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add support for AF_IB to ip_addr_size, and rename the function to
account for the change. Give the compiler more control over whether
the call should be inline or not by moving the definition into the .c
file, removing the static inline, and exporting it.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Enhance checks for loopback and any address to support AF_IB in
addition to AF_INET and AF_INT6. This will allow future patches to
use AF_IB when binding and resolving addresses.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The rdma_cm only allows setting reuseaddr if the corresponding
rdma_cm_id is in the idle state. Allow setting this value in other
states. This brings the behavior more inline with sockets.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
So far, only net_device * could be passed along with netdevice notifier
event. This patch provides a possibility to pass custom structure
able to provide info that event listener needs to know.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
v2->v3: fix typo on simeth
shortened dev_getter
shortened notifier_info struct name
v1->v2: fix notifier_call parameter in call_netdevice_notifier()
Signed-off-by: David S. Miller <davem@davemloft.net>
The function cm_work_handler() cannot touch the cm_id after it derefs
it, because it might be freed on another concurrent thread. If there
are more work items queued for this cm_id, then we know there must be
more references because they are added when the work items are queued.
So in the while loop inside cm_work_handler(), after derefing, if the
queue is empty, then exit the function. Otherwise we know it's safe
to re-acquire the lock.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For QPs of type IB_QPT_XRC_TGT the IB core assigns a common event
handler __ib_shared_qp_event_handler(), and the optionally supplied
event handler is stored. When the common handler is called it iterates
on all opened QPs and calles the original handlers without checking if
they are NULL. Fix that.
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>