Currently ipoib choose cq completion vector based on port number, when HCA
only have one port, all the interface recv queue completion are bind to cq
completion vector 0.
To better distribute the load, use same method as __ib_alloc_cq_any to
choose completion vector, with the change, each interface now use
different completion vectors.
Link: https://lore.kernel.org/r/20201013074342.15867-1-jinpu.wang@cloud.ionos.com
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The IBTA specification has new speeds - HDR and NDR, supporting signaling
rate of 50Gb and 100Gb respectively. ethtool support of ipoib driver
translates IB speed to signaling rate. Added translation of HDR and NDR IB
types to rates of 50Gb and 100Gb ethernet speed.
Link: https://lore.kernel.org/r/20201026132904.1338526-1-leon@kernel.org
Signed-off-by: Meir Lichtinger <meirl@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Pull rdma updates from Jason Gunthorpe:
"A usual cycle for RDMA with a typical mix of driver and core subsystem
updates:
- Driver minor changes and bug fixes for mlx5, efa, rxe, vmw_pvrdma,
hns, usnic, qib, qedr, cxgb4, hns, bnxt_re
- Various rtrs fixes and updates
- Bug fix for mlx4 CM emulation for virtualization scenarios where
MRA wasn't working right
- Use tracepoints instead of pr_debug in the CM code
- Scrub the locking in ucma and cma to close more syzkaller bugs
- Use tasklet_setup in the subsystem
- Revert the idea that 'destroy' operations are not allowed to fail
at the driver level. This proved unworkable from a HW perspective.
- Revise how the umem API works so drivers make fewer mistakes using
it
- XRC support for qedr
- Convert uverbs objects RWQ and MW to new the allocation scheme
- Large queue entry sizes for hns
- Use hmm_range_fault() for mlx5 On Demand Paging
- uverbs APIs to inspect the GID table instead of sysfs
- Move some of the RDMA code for building large page SGLs into
lib/scatterlist"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (191 commits)
RDMA/ucma: Fix use after free in destroy id flow
RDMA/rxe: Handle skb_clone() failure in rxe_recv.c
RDMA/rxe: Move the definitions for rxe_av.network_type to uAPI
RDMA: Explicitly pass in the dma_device to ib_register_device
lib/scatterlist: Do not limit max_segment to PAGE_ALIGNED values
IB/mlx4: Convert rej_tmout radix-tree to XArray
RDMA/rxe: Fix bug rejecting all multicast packets
RDMA/rxe: Fix skb lifetime in rxe_rcv_mcast_pkt()
RDMA/rxe: Remove duplicate entries in struct rxe_mr
IB/hfi,rdmavt,qib,opa_vnic: Update MAINTAINERS
IB/rdmavt: Fix sizeof mismatch
MAINTAINERS: CISCO VIC LOW LATENCY NIC DRIVER
RDMA/bnxt_re: Fix sizeof mismatch for allocation of pbl_tbl.
RDMA/bnxt_re: Use rdma_umem_for_each_dma_block()
RDMA/umem: Move to allocate SG table from pages
lib/scatterlist: Add support in dynamic allocation of SG table from pages
tools/testing/scatterlist: Show errors in human readable form
tools/testing/scatterlist: Rejuvenate bit-rotten test
RDMA/ipoib: Set rtnl_link_ops for ipoib interfaces
RDMA/uverbs: Expose the new GID query API to user space
...
Report the "ipoib pkey", "mode" and "umcast" netlink attributes for every
IPoiB interface type, not just children created with 'ip link add'.
After setting the rtnl_link_ops for the parent interface, implement the
dellink() callback to block users from trying to remove it.
Fixes: 862096a8bb ("IB/ipoib: Add more rtnl_link_ops callbacks")
Link: https://lore.kernel.org/r/20201004132948.26669-1-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Functions related to nested interface infrastructure such as
netdev_walk_all_{ upper | lower }_dev() pass both private functions
and "data" pointer to handle their own things.
At this point, the data pointer type is void *.
In order to make it easier to expand common variables and functions,
this new netdev_nested_priv structure is added.
In the following patch, a new member variable will be added into this
struct to fix the lockdep issue.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In similar way to other IB objects, restore the ability to return error on
SRQ destroy. Strictly speaking, this change is not necessary, and provided
here to ensure a symmetrical interface like other destroy functions.
Fixes: 68e326dea1 ("RDMA: Handle SRQ allocations by IB/core")
Link: https://lore.kernel.org/r/20200907120921.476363-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Michael Guralnik says:
====================
This series handles IPoIB child interface creation with setting
interface's HW address.
In current implementation, lladdr requested by user is ignored and
overwritten. Child interface gets the same GID as the parent interface and
a QP number which is assigned by the underlying drivers.
In this series we fix this behavior so that user's requested address is
assigned to the newly created interface.
As specific QP number request is not supported for all vendors, QP number
requested by user will still be overwritten when this is not supported.
Behavior of creation of child interfaces through the sysfs mechanism or
without specifying a requested address, stays the same.
====================
Based on the mlx5-next branch at
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
due to dependencies.
* branch 'mlx5_ipoib_qpn':
RDMA/ipoib: Handle user-supplied address when creating child
net/mlx5: Enable QP number request when creating IPoIB underlay QP
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ipoib_mcast_carrier_on_task() insanely open codes a rtnl_lock() such that
the only time flush_workqueue() can be called is if it also clears
IPOIB_FLAG_OPER_UP.
Thus the flush inside ipoib_flush_ah() will deadlock if it gets unlucky
enough, and lockdep doesn't help us to find it early:
CPU0 CPU1 CPU2
__ipoib_ib_dev_flush()
down_read(vlan_rwsem)
ipoib_vlan_add()
rtnl_trylock()
down_write(vlan_rwsem)
ipoib_mcast_carrier_on_task()
while (!rtnl_trylock())
msleep(20);
ipoib_flush_ah()
flush_workqueue(priv->wq)
Clean up the ah_reaper related functions and lifecycle to make sense:
- Start/Stop of the reaper should only be done in open/stop NDOs, not in
any other places
- cancel and flush of the reaper should only happen in the stop NDO.
cancel is only functional when combined with IPOIB_STOP_REAPER.
- Non-stop places were flushing the AH's just need to flush out dead AH's
synchronously and ignore the background task completely. It is fully
locked and harmless to leave running.
Which ultimately fixes the ABBA deadlock by removing the unnecessary
flush_workqueue() from the problematic place under the vlan_rwsem.
Fixes: efc82eeeae ("IB/ipoib: No longer use flush as a parameter")
Link: https://lore.kernel.org/r/20200625174219.290842-1-kamalheib1@gmail.com
Reported-by: Kamal Heib <kheib@redhat.com>
Tested-by: Kamal Heib <kheib@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Since commit 84af7a6194 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.
This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.
There are a variety of indentation styles found.
a) 4 spaces + '---help---'
b) 7 spaces + '---help---'
c) 8 spaces + '---help---'
d) 1 space + 1 tab + '---help---'
e) 1 tab + '---help---' (correct indentation)
f) 1 tab + 1 space + '---help---'
g) 1 tab + 2 spaces + '---help---'
In order to convert all of them to 1 tab + 'help', I ran the
following commend:
$ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Pull rdma updates from Jason Gunthorpe:
"A more active cycle than most of the recent past, with a few large,
long discussed works this time.
The RNBD block driver has been posted for nearly two years now, and
flowing through RDMA due to it also introducing a new ULP.
The removal of FMR has been a recurring discussion theme for a long
time.
And the usual smattering of features and bug fixes.
Summary:
- Various small driver bugs fixes in rxe, mlx5, hfi1, and efa
- Continuing driver cleanups in bnxt_re, hns
- Big cleanup of mlx5 QP creation flows
- More consistent use of src port and flow label when LAG is used and
a mlx5 implementation
- Additional set of cleanups for IB CM
- 'RNBD' network block driver and target. This is a network block
RDMA device specific to ionos's cloud environment. It brings strong
multipath and resiliency capabilities.
- Accelerated IPoIB for HFI1
- QP/WQ/SRQ ioctl migration for uverbs, and support for multiple
async fds
- Support for exchanging the new IBTA defiend ECE data during RDMA CM
exchanges
- Removal of the very old and insecure FMR interface from all ULPs
and drivers. FRWR should be preferred for at least a decade now"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (247 commits)
RDMA/cm: Spurious WARNING triggered in cm_destroy_id()
RDMA/mlx5: Return ECE DC support
RDMA/mlx5: Don't rely on FW to set zeros in ECE response
RDMA/mlx5: Return an error if copy_to_user fails
IB/hfi1: Use free_netdev() in hfi1_netdev_free()
RDMA/hns: Uninitialized variable in modify_qp_init_to_rtr()
RDMA/core: Move and rename trace_cm_id_create()
IB/hfi1: Fix hfi1_netdev_rx_init() error handling
RDMA: Remove 'max_map_per_fmr'
RDMA: Remove 'max_fmr'
RDMA/core: Remove FMR device ops
RDMA/rdmavt: Remove FMR memory registration
RDMA/mthca: Remove FMR support for memory registration
RDMA/mlx4: Remove FMR support for memory registration
RDMA/i40iw: Remove FMR leftovers
RDMA/bnxt_re: Remove FMR leftovers
RDMA/mlx5: Remove FMR leftovers
RDMA/core: Remove FMR pool API
RDMA/rds: Remove FMR support for memory registration
RDMA/srp: Remove support for FMR memory registration
...
When connected mode is set, and we have connected and datagram traffic in
parallel, ipoib might crash with double free of datagram skb.
The current mechanism assumes that the order in the completion queue is
the same as the order of sent packets for all QPs. Order is kept only for
specific QP, in case of mixed UD and CM traffic we have few QPs (one UD and
few CM's) in parallel.
The problem:
----------------------------------------------------------
Transmit queue:
-----------------
UD skb pointer kept in queue itself, CM skb kept in spearate queue and
uses transmit queue as a placeholder to count the number of total
transmitted packets.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 .........127
------------------------------------------------------------
NL ud1 UD2 CM1 ud3 cm2 cm3 ud4 cm4 ud5 NL NL NL ...........
------------------------------------------------------------
^ ^
tail head
Completion queue (problematic scenario) - the order not the same as in
the transmit queue:
1 2 3 4 5 6 7 8 9
------------------------------------
ud1 CM1 UD2 ud3 cm2 cm3 ud4 cm4 ud5
------------------------------------
1. CM1 'wc' processing
- skb freed in cm separate ring.
- tx_tail of transmit queue increased although UD2 is not freed.
Now driver assumes UD2 index is already freed and it could be used for
new transmitted skb.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 .........127
------------------------------------------------------------
NL NL UD2 CM1 ud3 cm2 cm3 ud4 cm4 ud5 NL NL NL ...........
------------------------------------------------------------
^ ^ ^
(Bad)tail head
(Bad - Could be used for new SKB)
In this case (due to heavy load) UD2 skb pointer could be replaced by new
transmitted packet UD_NEW, as the driver assumes its free. At this point
we will have to process two 'wc' with same index but we have only one
pointer to free.
During second attempt to free the same skb we will have NULL pointer
exception.
2. UD2 'wc' processing
- skb freed according the index we got from 'wc', but it was already
overwritten by mistake. So actually the skb that was released is the
skb of the new transmitted packet and not the original one.
3. UD_NEW 'wc' processing
- attempt to free already freed skb. NUll pointer exception.
The fix:
-----------------------------------------------------------------------
The fix is to stop using the UD ring as a placeholder for CM packets, the
cyclic ring variables tx_head and tx_tail will manage the UD tx_ring, a
new cyclic variables global_tx_head and global_tx_tail are introduced for
managing and counting the overall outstanding sent packets, then the send
queue will be stopped and waken based on these variables only.
Note that no locking is needed since global_tx_head is updated in the xmit
flow and global_tx_tail is updated in the NAPI flow only. A previous
attempt tried to use one variable to count the outstanding sent packets,
but it did not work since xmit and NAPI flows can run at the same time and
the counter will be updated wrongly. Thus, we use the same simple cyclic
head and tail scheme that we have today for the UD tx_ring.
Fixes: 2c104ea683 ("IB/ipoib: Get rid of the tx_outstanding variable in all modes")
Link: https://lore.kernel.org/r/20200527134705.480068-1-leon@kernel.org
Signed-off-by: Valentine Fatiev <valentinef@mellanox.com>
Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This is the prerequisite modification to the ipoib ulp to allow a
rdma netdev to obtain the default ndo ops for init/uninit/open/close.
This is accomplished by setting the netdev ops field within the
callback function passed to the netdev allocation routine which
in turn was passed into the rdma netdev allocation routine.
This allows the rdma netdev to call back into the ulp to create the
resources required for connected mode operation.
Additionally as the ulp is not re-entrant, when switching modes,
the number of real tx queues is set to 1 for the connected mode.
For datagram mode the number of real tx queues is set to the
actual number of tx queues specified at the netdev's allocation.
For the internal ulp netdev the number of tx queues defaults to 1.
It is up to the rdma netdev to specify the actual number it can support.
When the driver does not support a rdma netdev for acceleration,
(-ENOTSUPPORTED return code or the verbs function for allocation is
NULL) the ipoib ulp functions are unaffected by using the internal
netdev allocated by the ipoib ulp.
Link: https://lore.kernel.org/r/20200511160706.173205.19086.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Adds capability to create a qpn to be recognized as an accelerated
UD QP for ipoib.
This is accomplished by reserving 0x81 in byte[0] of the qpn as the
prefix for these qp types and reserving qpns between 0x810000 and
0x81ffff.
The hfi1 capability mask already contained a flag for the VNIC netdev.
This has been renamed and extended to include both VNIC and ipoib.
The rvt code to allocate qps now recognizes this flag and sets 0x81
into byte[0] of the qpn.
The code to allocate qpns is modified to reset the qpn numbering when it
is detected that a value is located in byte[0] for a UD QP and it is a
qpn being requested for net dev use. If it is a regular UD QP then it is
allowable to have bits set in byte[0] of the qpn and provide the
previously normal behavior.
The code to free the qpn now checks for the AIP prefix value of 0x81 and
removes it from the qpn before being freed so that the lower 16 bit
number can be reused.
This patch requires minor changes in the IB core and ipoib to facilitate
the creation of accelerated UP QPs.
Link: https://lore.kernel.org/r/20200511160607.173205.11757.stgit@awfm-01.aw.intel.com
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When a client is added it isn't allowed to fail, but all the client's have
various failure paths within their add routines.
This creates the very fringe condition where the client was added, failed
during add and didn't set the client_data. The core code will then still
call other client_data centric ops like remove(), rename(), get_nl_info(),
and get_net_dev_by_params() with NULL client_data - which is confusing and
unexpected.
If the add() callback fails, then do not call any more client ops for the
device, even remove.
Remove all the now redundant checks for NULL client_data in ops callbacks.
Update all the add() callbacks to return error codes
appropriately. EOPNOTSUPP is used for cases where the ULP does not support
the ib_device - eg because it only works with IB.
Link: https://lore.kernel.org/r/20200421172440.387069-1-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Pull rdma updates from Jason Gunthorpe:
"The majority of the patches are cleanups, refactorings and clarity
improvements.
This cycle saw some more activity from Syzkaller, I think we are now
clean on all but one of those bugs, including the long standing and
obnoxious rdma_cm locking design defect. Continue to see many drivers
getting cleanups, with a few new user visible features.
Summary:
- Various driver updates for siw, bnxt_re, rxe, efa, mlx5, hfi1
- Lots of cleanup patches for hns
- Convert more places to use refcount
- Aggressively lock the RDMA CM code that syzkaller says isn't
working
- Work to clarify ib_cm
- Use the new ib_device lifecycle model in bnxt_re
- Fix mlx5's MR cache which seems to be failing more often with the
new ODP code
- mlx5 'dynamic uar' and 'tx steering' user interfaces"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (144 commits)
RDMA/bnxt_re: make bnxt_re_ib_init static
IB/qib: Delete struct qib_ivdev.qp_rnd
RDMA/hns: Fix uninitialized variable bug
RDMA/hns: Modify the mask of QP number for CQE of hip08
RDMA/hns: Reduce the maximum number of extend SGE per WQE
RDMA/hns: Reduce PFC frames in congestion scenarios
RDMA/mlx5: Add support for RDMA TX flow table
net/mlx5: Add support for RDMA TX steering
IB/hfi1: Call kobject_put() when kobject_init_and_add() fails
IB/hfi1: Fix memory leaks in sysfs registration and unregistration
IB/mlx5: Move to fully dynamic UAR mode once user space supports it
IB/mlx5: Limit the scope of struct mlx5_bfreg_info to mlx5_ib
IB/mlx5: Extend QP creation to get uar page index from user space
IB/mlx5: Extend CQ creation to get uar page index from user space
IB/mlx5: Expose UAR object and its alloc/destroy commands
IB/hfi1: Get rid of a warning
RDMA/hns: Remove redundant judgment of qp_type
RDMA/hns: Remove redundant assignment of wc->smac when polling cq
RDMA/hns: Remove redundant qpc setup operations
RDMA/hns: Remove meaningless prints
...
Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.
This driver did not previously reject unsupported parameters.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ndo operation to the network driver that enables configuring
ipoib_get_vf_guid operation. The operation allows to get a VF port
and node GUIDs.
Signed-off-by: Danit Goldberg <danitg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Use accessor functions for skb fragment's page_offset instead
of direct references, in preparation for bvec conversion.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull rdma updates from Jason Gunthorpe:
"A smaller cycle this time. Notably we see another new driver, 'Soft
iWarp', and the deletion of an ancient unused driver for nes.
- Revise and simplify the signature offload RDMA MR APIs
- More progress on hoisting object allocation boiler plate code out
of the drivers
- Driver bug fixes and revisions for hns, hfi1, efa, cxgb4, qib,
i40iw
- Tree wide cleanups: struct_size, put_user_page, xarray, rst doc
conversion
- Removal of obsolete ib_ucm chardev and nes driver
- netlink based discovery of chardevs and autoloading of the modules
providing them
- Move more of the rdamvt/hfi1 uapi to include/uapi/rdma
- New driver 'siw' for software based iWarp running on top of netdev,
much like rxe's software RoCE.
- mlx5 feature to report events in their raw devx format to userspace
- Expose per-object counters through rdma tool
- Adaptive interrupt moderation for RDMA (DIM), sharing the DIM core
from netdev"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (194 commits)
RMDA/siw: Require a 64 bit arch
RDMA/siw: Mark expected switch fall-throughs
RDMA/core: Fix -Wunused-const-variable warnings
rdma/siw: Remove set but not used variable 's'
rdma/siw: Add missing dependencies on LIBCRC32C and DMA_VIRT_OPS
RDMA/siw: Add missing rtnl_lock around access to ifa
rdma/siw: Use proper enumerated type in map_cqe_status
RDMA/siw: Remove unnecessary kthread create/destroy printouts
IB/rdmavt: Fix variable shadowing issue in rvt_create_cq
RDMA/core: Fix race when resolving IP address
RDMA/core: Make rdma_counter.h compile stand alone
IB/core: Work on the caller socket net namespace in nldev_newlink()
RDMA/rxe: Fill in wc byte_len with IB_WC_RECV_RDMA_WITH_IMM
RDMA/mlx5: Set RDMA DIM to be enabled by default
RDMA/nldev: Added configuration of RDMA dynamic interrupt moderation to netlink
RDMA/core: Provide RDMA DIM support for ULPs
linux/dim: Implement RDMA adaptive moderation (DIM)
IB/mlx5: Report correctly tag matching rendezvous capability
docs: infiniband: add it to the driver-api bookset
IB/mlx5: Implement VHCA tunnel mechanism in DEVX
...
Despite failure in ipoib_dev_init() we continue with initialization flow
and creation of child device. It causes to the situation where this child
device is added too early to parent device list.
Change the logic, so in case of failure we properly return error from
ipoib_dev_init() and add child only in success path.
Fixes: eaeb398425 ("IB/ipoib: Move init code to ndo_init")
Signed-off-by: Valentine Fatiev <valentinef@mellanox.com>
Reviewed-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
For dependencies in next patches.
Resolve conflicts:
- Use uverbs_get_cleared_udata() with new cq allocation flow
- Continue to delete nes despite SPDX conflict
- Resolve list appends in mlx5_command_str()
- Use u16 for vport_rule stuff
- Resolve list appends in struct ib_client
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The InfiniBand docs are plain text with no markups. So, all we needed to
do were to add the title markups and some markup sequences in order to
properly parse tables, lists and literal blocks.
At its new index.rst, let's add a :orphan: while this is not linked to the
main index.rst file, in order to avoid build warnings.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
in the case of IPoIB with SRIOV enabled hardware
ip link show command incorrecly prints
0 instead of a VF hardware address.
Before:
11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 256
link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
vf 0 MAC 00:00:00:00:00:00, spoof checking off, link-state disable,
trust off, query_rss off
...
After:
11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 256
link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
vf 0 link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof
checking off, link-state disable, trust off, query_rss off
v1->v2: just copy an address without modifing ifla_vf_mac
v2->v3: update the changelog
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The default action for unlisted tests is "not-supported", so given that
ipoib doesn't support ETH_SS_TEST, there is no need to check for it
in the case statements, just let it get caught by the default: case.
Fixes: e3614bc9dc ("IB/ipoib: Add readout of statistics using ethtool")
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add support for reporting link state for ipoib net devices.
$ ip l set dev mlx4_ib0 up
$ ethtool mlx4_ib0 | grep Link
Link detected: yes
$ ip l set dev mlx4_ib0 down
$ ethtool mlx4_ib0 | grep Link
Link detected: no
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There are nothing to do from user side with knowledge that destroy CQ
fails.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add SPDX license identifiers to all Make/Kconfig files which:
- Have no license information of any form
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Systemd triggers the following warning during IPoIB device load:
mlx5_core 0000:00:0c.0 ib0: "systemd-udevd" wants to know my dev_id.
Should it look at dev_port instead?
See Documentation/ABI/testing/sysfs-class-net for more info.
This is caused due to user space attempt to differentiate old systems
without dev_port and new systems with dev_port. In case dev_port will be
zero, the systemd will try to read dev_id instead.
There is no need to print a warning in such case, because it is valid
situation and it is needed to ensure systemd compatibility with old
kernels.
Link: https://github.com/systemd/systemd/blob/master/src/udev/udev-builtin-net_id.c#L358
Cc: <stable@vger.kernel.org> # 4.19
Fixes: f6350da41d ("IB/ipoib: Log sysfs 'dev_id' accesses from userspace")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When IPoIB receives an SM LID change event, it reacts by flushing its
path record cache and rejoining multicast groups. This is the same
behavior it performs when it receives a reregistration event. This
behavior is unnecessary as an SM may have database backup or
synchronization mechanisms which permit the SM location or LID to change
without loss of multicast membership and without impact to path records.
Both opensm and the OPA FM issue reregistration events if a new SM is
started (or restarted with a new config) or an SM event occurs which
results in loss of multicast membership records by the SM (such as
opensm failover) or the SM encounters new nodes with Active ports (such
as after joining 2 fabrics by connecting switches via ISLs). Hence this
event can be depended on as the trigger for IPoIB cache and multicast
flushing.
It appears that some drivers, such as qib, and hfi1 issue the
IB_EVENT_SM_CHANGE but other drivers such as mlx4 and mlx5 do not.
Empirical testing on Mellanox EDR using ibv_asyncwatch has confirmed
that Mellanox EDR HCAs do not generate SM change events and that opensm
does generate reregistration.
An SM LID change event is generated by the mentioned drivers to reflect
that sm_lid and/or sm_sl in the local port info has changed. The intent
of this event is to permit applications and ULPs which have a local copy
of this information (or an address handle using it) to update their
information.
The intent is that the reregistration event (caused by the SM via a bit
in Set(PortInfo)) be used to inform nodes that they need to rejoin
multicast groups, resubscribe for notices and potentially update path
records.
When an SM migrates or fails over, a SM LID change event can occur. In
response IPoIB discards path records and multicast membership and loses
connectivity until these records are restored via SA requests. In very
large fabrics, it may take minutes for the SM to be ready and for the SA
responses to be supplied. This can result in undesirable and
unnecessary IPoIB connectivity impacts. It also can result in an
unnecessary storm of SA queries from all nodes in a cluster potentially
followed by yet another storm if the SM issues the reregistration
request.
The fact the Mellanox HCAs do not even generate this event, is further
evidence that on modern IB fabrics there will be no ill side effects
from the proposed changes below to reduce the reaction by 3 kernel
components to this event. So these changes should be benign for Mellanox
IB fabrics and will benefit OPA fabrics while also making ib_core and
ULP behavor "correct" as intended by the IBTA spec and kernel RDMA event
APIs.
Address these issues by removing IB_EVENT_SM_CHANGE handling from ipoib.
IPoIB does not locally store sm_lid nor sm_sl, so it does not need to do
anything on SM LID change. IPoIB makes use of other ib_core components
to issue SA requests for it and those components correctly track SM LID
and SM LID changes.
Also in ib_core multicast handling, remove the test for
IB_EVENT_SM_CHANGE. This code is moving all multicast groups to the
error state, which will trigger rejoins. This code is used by IPoIB as
well as the connection manager and other clients of multicast groups.
This kernel module centralizes group membership status and joins since a
node can only join a given group once but multiple ULPs or applications
may want to join the same group. It makes use of the sa_query.c
component in ib_core, which correctly trackes SM LID and SL. This
component does not track SM LID nor SL itself and hence need not react
to their changes.
Similarly in the ib_core cache code remove the handling for the
IB_EVENT_SM_CHANGE. In this function. The ib_cache_update function
which is ultimately called is updating local copies of the pkey table,
gid table and lmc. It does not update nor retain sm_lid nor sm_sl. As
such it does not need to be called on an SM LID change. It technically
also does not need to be called on a reregistration. The LID_CHANGE,
PKEY_CHANGE, GID_CHANGE and port state change events (PORT_ERR,
PORT_ACTICE) should be sufficient triggers.
It is worth noting that the alternative of simply having the hfi1 and
qib drivers not generate the SM LID change event was explored. While
this would duplicate what Mellanox drivers do now, it is not the correct
behavior and removes the ability for an SM to migrate without requiring
reregistration. Since both opensm and OPA SM have mechanisms to backup
or synchronize registration information, it is desirable to let them
perform SM migrations (with LID or SL changes) without requiring
reregistration when they deem it appropriate.
Suggested-by: Todd Rimmer <todd.rimmer@intel.com>
Tested-by: Michael Brooks <michael.brooks@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Todd Rimmer <todd.rimmer@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
We have many loops iterating over all of the end port numbers on a struct
ib_device, simplify them with a for_each helper.
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Changed debug statements to use %s and __func__ instead
of hard-coded function's name.
Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Linux 5.0-rc5
Needed to merge the include/uapi changes so we have an up to date
single-tree for these files. Patches already posted are also expected to
need this for dependencies.
The function ipoib_intercept_dev_id_attr() is only used in ipoib_main.c
Fixes: f6350da41d ("IB/ipoib: Log sysfs 'dev_id' accesses from userspace")
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The following BUG was reported by kasan:
BUG: KASAN: use-after-free in ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib]
Read of size 80 at addr ffff88034c30bcd0 by task kworker/u16:1/24020
Workqueue: ipoib_wq ipoib_cm_tx_start [ib_ipoib]
Call Trace:
dump_stack+0x9a/0xeb
print_address_description+0xe3/0x2e0
kasan_report+0x18a/0x2e0
? ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib]
memcpy+0x1f/0x50
ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib]
? kvm_clock_read+0x1f/0x30
? ipoib_cm_skb_reap+0x610/0x610 [ib_ipoib]
? __lock_is_held+0xc2/0x170
? process_one_work+0x880/0x1960
? process_one_work+0x912/0x1960
process_one_work+0x912/0x1960
? wq_pool_ids_show+0x310/0x310
? lock_acquire+0x145/0x440
worker_thread+0x87/0xbb0
? process_one_work+0x1960/0x1960
kthread+0x314/0x3d0
? kthread_create_worker_on_cpu+0xc0/0xc0
ret_from_fork+0x3a/0x50
Allocated by task 0:
kasan_kmalloc+0xa0/0xd0
kmem_cache_alloc_trace+0x168/0x3e0
path_rec_create+0xa2/0x1f0 [ib_ipoib]
ipoib_start_xmit+0xa98/0x19e0 [ib_ipoib]
dev_hard_start_xmit+0x159/0x8d0
sch_direct_xmit+0x226/0xb40
__dev_queue_xmit+0x1d63/0x2950
neigh_update+0x889/0x1770
arp_process+0xc47/0x21f0
arp_rcv+0x462/0x760
__netif_receive_skb_core+0x1546/0x2da0
netif_receive_skb_internal+0xf2/0x590
napi_gro_receive+0x28e/0x390
ipoib_ib_handle_rx_wc_rss+0x873/0x1b60 [ib_ipoib]
ipoib_rx_poll_rss+0x17d/0x320 [ib_ipoib]
net_rx_action+0x427/0xe30
__do_softirq+0x28e/0xc42
Freed by task 26680:
__kasan_slab_free+0x11d/0x160
kfree+0xf5/0x360
ipoib_flush_paths+0x532/0x9d0 [ib_ipoib]
ipoib_set_mode_rss+0x1ad/0x560 [ib_ipoib]
set_mode+0xc8/0x150 [ib_ipoib]
kernfs_fop_write+0x279/0x440
__vfs_write+0xd8/0x5c0
vfs_write+0x15e/0x470
ksys_write+0xb8/0x180
do_syscall_64+0x9b/0x420
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The buggy address belongs to the object at ffff88034c30bcc8
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 8 bytes inside of
512-byte region [ffff88034c30bcc8, ffff88034c30bec8)
The buggy address belongs to the page:
The following race between change mode and xmit flow is the reason for
this use-after-free:
Change mode Send packet 1 to GID XX Send packet 2 to GID XX
| | |
start | |
| | |
| | |
| Create new path for GID XX |
| and update neigh path |
| | |
| | |
| | |
flush_paths | |
| |
queue_work(cm.start_task) |
| Path for GID XX not found
| create new path
|
|
start_task runs with old
released path
There is no locking to protect the lifetime of the path through the
ipoib_cm_tx struct, so delete it entirely and always use the newly looked
up path under the priv->lock.
Fixes: 546481c281 ("IB/ipoib: Fix memory corruption in ipoib cm mode connect flow")
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Drop LIST_HEAD where the variable it declares is never used.
Commit 31c02e2157 ("IPoIB: Avoid using stale last_send counter
when reaping AHs") removed the uses, but not the declaration.
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
... when != x
// </smpl>
Fixes: 31c02e2157 ("IPoIB: Avoid using stale last_send counter when reaping AHs")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Pull rdma updates from Jason Gunthorpe:
"This has been a fairly typical cycle, with the usual sorts of driver
updates. Several series continue to come through which improve and
modernize various parts of the core code, and we finally are starting
to get the uAPI command interface cleaned up.
- Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4,
mlx5, qib, rxe, usnic
- Rework the entire syscall flow for uverbs to be able to run over
ioctl(). Finally getting past the historic bad choice to use
write() for command execution
- More functional coverage with the mlx5 'devx' user API
- Start of the HFI1 series for 'TID RDMA'
- SRQ support in the hns driver
- Support for new IBTA defined 2x lane widths
- A big series to consolidate all the driver function pointers into a
big struct and have drivers provide a 'static const' version of the
struct instead of open coding initialization
- New 'advise_mr' uAPI to control device caching/loading of page
tables
- Support for inline data in SRPT
- Modernize how umad uses the driver core and creates cdev's and
sysfs files
- First steps toward removing 'uobject' from the view of the drivers"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (193 commits)
RDMA/srpt: Use kmem_cache_free() instead of kfree()
RDMA/mlx5: Signedness bug in UVERBS_HANDLER()
IB/uverbs: Signedness bug in UVERBS_HANDLER()
IB/mlx5: Allocate the per-port Q counter shared when DEVX is supported
IB/umad: Start using dev_groups of class
IB/umad: Use class_groups and let core create class file
IB/umad: Refactor code to use cdev_device_add()
IB/umad: Avoid destroying device while it is accessed
IB/umad: Simplify and avoid dynamic allocation of class
IB/mlx5: Fix wrong error unwind
IB/mlx4: Remove set but not used variable 'pd'
RDMA/iwcm: Don't copy past the end of dev_name() string
IB/mlx5: Fix long EEH recover time with NVMe offloads
IB/mlx5: Simplify netdev unbinding
IB/core: Move query port to ioctl
RDMA/nldev: Expose port_cap_flags2
IB/core: uverbs copy to struct or zero helper
IB/rxe: Reuse code which sets port state
IB/rxe: Make counters thread safe
IB/mlx5: Use the correct commands for UMEM and UCTX allocation
...
Introduce a 'flags' field to destroy address handle callback and add a
flag that marks whether the callback is executed in an atomic context or
not.
This will allow drivers to wait for completion instead of polling for it
when it is allowed.
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>