Commit Graph

14095 Commits

Author SHA1 Message Date
Linus Torvalds
bf9ca811bb RDMA v6.8 merge window
Small cycle, with some typical driver updates
 
  - General code tidying in siw, hfi1, idrdma, usnic, hns rtrs and bnxt_re
 
  - Many small siw cleanups without an overeaching theme
 
  - Debugfs stats for hns
 
  - Fix a TX queue timeout in IPoIB and missed locking of the mcast list
 
  - Support more features of P7 devices in bnxt_re including a new work
    submission protocol
 
  - CQ interrupts for MANA
 
  - netlink stats for erdma
 
  - EFA multipath PCI support
 
  - Fix Incorrect MR invalidation in iser
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZaCQDQAKCRCFwuHvBreF
 YemEAQCTGebv0k2hbocDOmKml5awt8j9aDJX3aO7Zpfi0AYUtwEAzk+kgN4yAo+B
 Vinvpu171zry+QvmGJsXv2mtZkXH6QY=
 =HT3p
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Small cycle, with some typical driver updates:

   - General code tidying in siw, hfi1, idrdma, usnic, hns rtrs and
     bnxt_re

   - Many small siw cleanups without an overeaching theme

   - Debugfs stats for hns

   - Fix a TX queue timeout in IPoIB and missed locking of the mcast
     list

   - Support more features of P7 devices in bnxt_re including a new work
     submission protocol

   - CQ interrupts for MANA

   - netlink stats for erdma

   - EFA multipath PCI support

   - Fix Incorrect MR invalidation in iser"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (66 commits)
  RDMA/bnxt_re: Fix error code in bnxt_re_create_cq()
  RDMA/efa: Add EFA query MR support
  IB/iser: Prevent invalidating wrong MR
  RDMA/erdma: Add hardware statistics support
  RDMA/erdma: Introduce dma pool for hardware responses of CMDQ requests
  IB/iser: iscsi_iser.h: fix kernel-doc warning and spellos
  RDMA/mana_ib: Add CQ interrupt support for RAW QP
  RDMA/mana_ib: query device capabilities
  RDMA/mana_ib: register RDMA device with GDMA
  RDMA/bnxt_re: Fix the sparse warnings
  RDMA/bnxt_re: Fix the offset for GenP7 adapters for user applications
  RDMA/bnxt_re: Share a page to expose per CQ info with userspace
  RDMA/bnxt_re: Add UAPI to share a page with user space
  IB/ipoib: Fix mcast list locking
  RDMA/mlx5: Expose register c0 for RDMA device
  net/mlx5: E-Switch, expose eswitch manager vport
  net/mlx5: Manage ICM type of SW encap
  RDMA/mlx5: Support handling of SW encap ICM area
  net/mlx5: Introduce indirect-sw-encap ICM properties
  RDMA/bnxt_re: Adds MSN table capability for Gen P7 adapters
  ...
2024-01-12 13:52:21 -08:00
Linus Torvalds
ab5f3fcb7c arm64 updates for 6.8
* for-next/cpufeature
 
   - Remove ARM64_HAS_NO_HW_PREFETCH copy_page() optimisation for ye olde
     Thunder-X machines.
   - Avoid mapping KPTI trampoline when it is not required.
   - Make CPU capability API more robust during early initialisation.
 
 * for-next/early-idreg-overrides
 
   - Remove dependencies on core kernel helpers from the early
     command-line parsing logic in preparation for moving this code
     before the kernel is mapped.
 
 * for-next/fpsimd
 
   - Restore kernel-mode fpsimd context lazily, allowing us to run fpsimd
     code sequences in the kernel with pre-emption enabled.
 
 * for-next/kbuild
 
   - Install 'vmlinuz.efi' when CONFIG_EFI_ZBOOT=y.
   - Makefile cleanups.
 
 * for-next/lpa2-prep
 
   - Preparatory work for enabling the 'LPA2' extension, which will
     introduce 52-bit virtual and physical addressing even with 4KiB
     pages (including for KVM guests).
 
 * for-next/misc
 
   - Remove dead code and fix a typo.
 
 * for-next/mm
 
   - Pass NUMA node information for IRQ stack allocations.
 
 * for-next/perf
 
   - Add perf support for the Synopsys DesignWare PCIe PMU.
   - Add support for event counting thresholds (FEAT_PMUv3_TH) introduced
     in Armv8.8.
   - Add support for i.MX8DXL SoCs to the IMX DDR PMU driver.
   - Minor PMU driver fixes and optimisations.
 
 * for-next/rip-vpipt
 
   - Remove what support we had for the obsolete VPIPT I-cache policy.
 
 * for-next/selftests
 
   - Improvements to the SVE and SME selftests.
 
 * for-next/stacktrace
 
   - Refactor kernel unwind logic so that it can used by BPF unwinding
     and, eventually, reliable backtracing.
 
 * for-next/sysregs
 
   - Update a bunch of register definitions based on the latest XML drop
     from Arm.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmWWvKYQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNIiTB/9agZBkEhZjP2sNDGyE4UFwawweWHkt2r8h
 WyvdwP91Z/AIsYSsGYu36J0l4pOnMKp/i6t+rt031SK4j+Q8hJYhSfDt3RvVbc0/
 Pz9D18V6cLrfq+Yxycqq9ufVdjs+m+CQ5WeLaRGmNIyEzJ/Jv/qrAN+2r603EeLP
 nq08qMZhDIQd2ZzbigCnGaNrTsVSafFfBFv1GsgDvnMZAjs1G6457A6zu+NatNUc
 +TMSG+3EawutHZZ2noXl0Ra7VOfIbVZFiUssxRPenKQByHHHR+QB2c/O1blri+dm
 XLMutvqO2/WvYGIfXO5koqZqvpVeR3zXxPwmGi5hQBsmOjtXzKd+
 =U4mo
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "CPU features:

   - Remove ARM64_HAS_NO_HW_PREFETCH copy_page() optimisation for ye
     olde Thunder-X machines

   - Avoid mapping KPTI trampoline when it is not required

   - Make CPU capability API more robust during early initialisation

  Early idreg overrides:

   - Remove dependencies on core kernel helpers from the early
     command-line parsing logic in preparation for moving this code
     before the kernel is mapped

  FPsimd:

   - Restore kernel-mode fpsimd context lazily, allowing us to run
     fpsimd code sequences in the kernel with pre-emption enabled

  KBuild:

   - Install 'vmlinuz.efi' when CONFIG_EFI_ZBOOT=y

   - Makefile cleanups

  LPA2 prep:

   - Preparatory work for enabling the 'LPA2' extension, which will
     introduce 52-bit virtual and physical addressing even with 4KiB
     pages (including for KVM guests).

  Misc:

   - Remove dead code and fix a typo

  MM:

   - Pass NUMA node information for IRQ stack allocations

  Perf:

   - Add perf support for the Synopsys DesignWare PCIe PMU

   - Add support for event counting thresholds (FEAT_PMUv3_TH)
     introduced in Armv8.8

   - Add support for i.MX8DXL SoCs to the IMX DDR PMU driver.

   - Minor PMU driver fixes and optimisations

  RIP VPIPT:

   - Remove what support we had for the obsolete VPIPT I-cache policy

  Selftests:

   - Improvements to the SVE and SME selftests

  Stacktrace:

   - Refactor kernel unwind logic so that it can used by BPF unwinding
     and, eventually, reliable backtracing

  Sysregs:

   - Update a bunch of register definitions based on the latest XML drop
     from Arm"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (87 commits)
  kselftest/arm64: Don't probe the current VL for unsupported vector types
  efi/libstub: zboot: do not use $(shell ...) in cmd_copy_and_pad
  arm64: properly install vmlinuz.efi
  arm64/sysreg: Add missing system instruction definitions for FGT
  arm64/sysreg: Add missing system register definitions for FGT
  arm64/sysreg: Add missing ExtTrcBuff field definition to ID_AA64DFR0_EL1
  arm64/sysreg: Add missing Pauth_LR field definitions to ID_AA64ISAR1_EL1
  arm64: memory: remove duplicated include
  arm: perf: Fix ARCH=arm build with GCC
  arm64: Align boot cpucap handling with system cpucap handling
  arm64: Cleanup system cpucap handling
  MAINTAINERS: add maintainers for DesignWare PCIe PMU driver
  drivers/perf: add DesignWare PCIe PMU driver
  PCI: Move pci_clear_and_set_dword() helper to PCI header
  PCI: Add Alibaba Vendor ID to linux/pci_ids.h
  docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
  arm64: irq: set the correct node for shadow call stack
  Revert "perf/arm_dmc620: Remove duplicate format attribute #defines"
  arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD
  arm64: fpsimd: Preserve/restore kernel mode NEON at context switch
  ...
2024-01-08 16:32:09 -08:00
Linus Torvalds
c604110e66 vfs-6.8.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUxRQAKCRCRxhvAZXjc
 ov/QAQDzvge3oQ9MEymmOiyzzcF+HhAXBr+9oEsYJjFc1p0TsgEA61gXjZo7F1jY
 KBqd6znOZCR+Waj0kIVJRAo/ISRBqQc=
 =0bRl
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual fses.

  Features:

   - Add Jan Kara as VFS reviewer

   - Show correct device and inode numbers in proc/<pid>/maps for vma
     files on stacked filesystems. This is now easily doable thanks to
     the backing file work from the last cycles. This comes with
     selftests

  Cleanups:

   - Remove a redundant might_sleep() from wait_on_inode()

   - Initialize pointer with NULL, not 0

   - Clarify comment on access_override_creds()

   - Rework and simplify eventfd_signal() and eventfd_signal_mask()
     helpers

   - Process aio completions in batches to avoid needless wakeups

   - Completely decouple struct mnt_idmap from namespaces. We now only
     keep the actual idmapping around and don't stash references to
     namespaces

   - Reformat maintainer entries to indicate that a given subsystem
     belongs to fs/

   - Simplify fput() for files that were never opened

   - Get rid of various pointless file helpers

   - Rename various file helpers

   - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from
     last cycle

   - Make relatime_need_update() return bool

   - Use GFP_KERNEL instead of GFP_USER when allocating superblocks

   - Replace deprecated ida_simple_*() calls with their current ida_*()
     counterparts

  Fixes:

   - Fix comments on user namespace id mapping helpers. They aren't
     kernel doc comments so they shouldn't be using /**

   - s/Retuns/Returns/g in various places

   - Add missing parameter documentation on can_move_mount_beneath()

   - Rename i_mapping->private_data to i_mapping->i_private_data

   - Fix a false-positive lockdep warning in pipe_write() for watch
     queues

   - Improve __fget_files_rcu() code generation to improve performance

   - Only notify writer that pipe resizing has finished after setting
     pipe->max_usage otherwise writers are never notified that the pipe
     has been resized and hang

   - Fix some kernel docs in hfsplus

   - s/passs/pass/g in various places

   - Fix kernel docs in ntfs

   - Fix kcalloc() arguments order reported by gcc 14

   - Fix uninitialized value in reiserfs"

* tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
  reiserfs: fix uninit-value in comp_keys
  watch_queue: fix kcalloc() arguments order
  ntfs: dir.c: fix kernel-doc function parameter warnings
  fs: fix doc comment typo fs tree wide
  selftests/overlayfs: verify device and inode numbers in /proc/pid/maps
  fs/proc: show correct device and inode numbers in /proc/pid/maps
  eventfd: Remove usage of the deprecated ida_simple_xx() API
  fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation
  fs/hfsplus: wrapper.c: fix kernel-doc warnings
  fs: add Jan Kara as reviewer
  fs/inode: Make relatime_need_update return bool
  pipe: wakeup wr_wait after setting max_usage
  file: remove __receive_fd()
  file: stop exposing receive_fd_user()
  fs: replace f_rcuhead with f_task_work
  file: remove pointless wrapper
  file: s/close_fd_get_file()/file_close_fd()/g
  Improve __fget_files_rcu() code generation (and thus __fget_light())
  file: massage cleanup of files that failed to open
  fs/pipe: Fix lockdep false-positive in watchqueue pipe_write()
  ...
2024-01-08 10:26:08 -08:00
Dan Carpenter
d24b923f1d RDMA/bnxt_re: Fix error code in bnxt_re_create_cq()
Return -ENOMEM if get_zeroed_page() fails.  Don't return success.

Fixes: e275919d96 ("RDMA/bnxt_re: Share a page to expose per CQ info with userspace")
Link: https://lore.kernel.org/r/d714306e-b7d7-4e89-b973-a9ff0f260c78@moroto.mountain
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-01-08 11:41:49 +02:00
Michael Margolin
2307157c85 RDMA/efa: Add EFA query MR support
Add EFA driver uapi definitions and register a new query MR method that
currently returns the physical interconnects the device is using to
reach the MR. Update admin definitions and efa-abi accordingly.

Reviewed-by: Anas Mousa <anasmous@amazon.com>
Reviewed-by: Firas Jahjah <firasj@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Link: https://lore.kernel.org/r/20240104095155.10676-1-mrgolin@amazon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-01-07 12:02:27 +02:00
Sergey Gorenko
2f1888281e IB/iser: Prevent invalidating wrong MR
The iser_reg_resources structure has two pointers to MR but only one
mr_valid field. The implementation assumes that we use only *sig_mr when
pi_enable is true. Otherwise, we use only *mr. However, it is only
sometimes correct. Read commands without protection information occur even
when pi_enble is true. For example, the following SCSI commands have a
Data-In buffer but never have protection information: READ CAPACITY (16),
INQUIRY, MODE SENSE(6), MAINTENANCE IN. So, we use
*sig_mr for some SCSI commands and *mr for the other SCSI commands.

In most cases, it works fine because the remote invalidation is applied.
However, there are two cases when the remote invalidation is not
applicable.
 1. Small write commands when all data is sent as an immediate.
 2. The target does not support the remote invalidation feature.

The lazy invalidation is used if the remote invalidation is impossible.
Since, at the lazy invalidation, we always invalidate the MR we want to
use, the wrong MR may be invalidated.

To fix the issue, we need a field per MR that indicates the MR needs
invalidation. Since the ib_mr structure already has such a field, let's
use ib_mr.need_inval instead of iser_reg_resources.mr_valid.

Fixes: b76a439982 ("IB/iser: Use IB_WR_REG_MR_INTEGRITY for PI handover")
Link: https://lore.kernel.org/r/20231219072311.40989-1-sergeygo@nvidia.com
Acked-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Sergey Gorenko <sergeygo@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-01-04 16:37:03 -04:00
Cheng Xu
63a43a675c RDMA/erdma: Add hardware statistics support
First, we add a new command to query hardware statistics, and then
implement two functions: ib_device_ops.alloc_hw_port_stats and
ib_device_ops.get_hw_stats to allow rdma tool can get the statistics
of erdma device.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231227084800.99091-3-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-30 17:23:17 +02:00
Cheng Xu
68cf9d82f7 RDMA/erdma: Introduce dma pool for hardware responses of CMDQ requests
Hardware response, such as the result of query statistics, may be too
long to be directly accommodated within the CQE structure. To address
this, we introduce a DMA pool to hold the hardware's responses of CMDQ
requests.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231227084800.99091-2-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-30 17:23:17 +02:00
Randy Dunlap
d42fafb895 IB/iser: iscsi_iser.h: fix kernel-doc warning and spellos
Drop one kernel-doc comment to prevent a warning:

iscsi_iser.h:313: warning: Excess struct member 'mr' description in 'iser_device'

and spell 2 words correctly (buffer and deferred).

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Max Gurtovoy <mgurtovoy@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: linux-rdma@vger.kernel.org
Link: https://lore.kernel.org/r/20231222234623.25231-1-rdunlap@infradead.org
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-26 10:40:46 +02:00
Long Li
c15d7802a4 RDMA/mana_ib: Add CQ interrupt support for RAW QP
At probing time, the MANA core code allocates EQs for supporting interrupts
on Ethernet queues. The same interrupt mechanisum is used by RAW QP.

Use the same EQs for delivering interrupts on the CQ for the RAW QP.

Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1702692255-23640-4-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-20 10:29:03 +02:00
Long Li
2c20e20b22 RDMA/mana_ib: query device capabilities
With RDMA device registered, use it to query on hardware capabilities and
cache this information for future query requests to the driver.

Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1702692255-23640-3-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-20 10:25:40 +02:00
Long Li
a7f0636d22 RDMA/mana_ib: register RDMA device with GDMA
Software client needs to register with the RDMA management interface on
the SoC to access more features, including querying device capabilities
and RC queue pair.

Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1702692255-23640-2-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-20 10:25:40 +02:00
Selvin Xavier
82a8903a9f RDMA/bnxt_re: Fix the sparse warnings
Fix the following warnings reported

drivers/infiniband/hw/bnxt_re/qplib_rcfw.c:909:27: warning: invalid assignment: |=
drivers/infiniband/hw/bnxt_re/qplib_rcfw.c:909:27:    left side has type restricted __le16
drivers/infiniband/hw/bnxt_re/qplib_rcfw.c:909:27:    right side has type unsigned long
...
drivers/infiniband/hw/bnxt_re/qplib_fp.c:1620:44: warning: invalid assignment: |=
drivers/infiniband/hw/bnxt_re/qplib_fp.c:1620:44:    left side has type restricted __le64
drivers/infiniband/hw/bnxt_re/qplib_fp.c:1620:44:    right side has type unsigned long long

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202312200537.HoNqPL5L-lkp@intel.com/
Fixes: 07f830ae49 ("RDMA/bnxt_re: Adds MSN table capability for Gen P7 adapters")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1703046717-8914-1-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-20 10:10:56 +02:00
Selvin Xavier
9248f363d0 RDMA/bnxt_re: Fix the offset for GenP7 adapters for user applications
User Doorbell page indexes start at an offset for GenP7 adapters.
Fix the offset that will be used for user doorbell page indexes.

Fixes: a62d685814 ("RDMA/bnxt_re: Update the BAR offsets")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1702987900-5363-1-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-20 09:45:45 +02:00
Selvin Xavier
e275919d96 RDMA/bnxt_re: Share a page to expose per CQ info with userspace
Gen P7 adapters needs to share a toggle bits information received
in kernel driver with the user space. User space needs this
info during the request notify call back to arm the CQ.

User space application can get this page using the
UAPI routines. Library will mmap this page and get the
toggle bits to be used in the next ARM Doorbell.

Uses a hash list to map the CQ structure from the CQ ID.
CQ structure is retrieved from the hash list while the
library calls the UAPI routine to get the toggle page
mapping. Currently the full page is mapped per CQ. This
can be optimized to enable multiple CQs from the same
application share the same page and different offsets
in the page.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1702535484-26844-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-17 15:35:58 +02:00
Selvin Xavier
9b0a7a2cb8 RDMA/bnxt_re: Add UAPI to share a page with user space
Gen P7 adapters require to share a toggle value for CQ
and SRQ. This is received by the driver as part of
interrupt notifications and needs to be shared with the
user space. Add a new UAPI infrastructure to get the
shared page for CQ and SRQ.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1702535484-26844-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-17 15:35:58 +02:00
Shuai Xue
ad6534c626 PCI: Add Alibaba Vendor ID to linux/pci_ids.h
The Alibaba Vendor ID (0x1ded) is now used by Alibaba elasticRDMA ("erdma")
and will be shared with the upcoming PCIe PMU ("dwc_pcie_pmu"). Move the
Vendor ID to linux/pci_ids.h so that it can shared by several drivers
later.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>	# pci_ids.h
Tested-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20231208025652.87192-3-xueshuai@linux.alibaba.com
Signed-off-by: Will Deacon <will@kernel.org>
2023-12-13 13:35:41 +00:00
Daniel Vacek
4f973e211b IB/ipoib: Fix mcast list locking
Releasing the `priv->lock` while iterating the `priv->multicast_list` in
`ipoib_mcast_join_task()` opens a window for `ipoib_mcast_dev_flush()` to
remove the items while in the middle of iteration. If the mcast is removed
while the lock was dropped, the for loop spins forever resulting in a hard
lockup (as was reported on RHEL 4.18.0-372.75.1.el8_6 kernel):

    Task A (kworker/u72:2 below)       | Task B (kworker/u72:0 below)
    -----------------------------------+-----------------------------------
    ipoib_mcast_join_task(work)        | ipoib_ib_dev_flush_light(work)
      spin_lock_irq(&priv->lock)       | __ipoib_ib_dev_flush(priv, ...)
      list_for_each_entry(mcast,       | ipoib_mcast_dev_flush(dev = priv->dev)
          &priv->multicast_list, list) |
        ipoib_mcast_join(dev, mcast)   |
          spin_unlock_irq(&priv->lock) |
                                       |   spin_lock_irqsave(&priv->lock, flags)
                                       |   list_for_each_entry_safe(mcast, tmcast,
                                       |                  &priv->multicast_list, list)
                                       |     list_del(&mcast->list);
                                       |     list_add_tail(&mcast->list, &remove_list)
                                       |   spin_unlock_irqrestore(&priv->lock, flags)
          spin_lock_irq(&priv->lock)   |
                                       |   ipoib_mcast_remove_list(&remove_list)
   (Here, `mcast` is no longer on the  |     list_for_each_entry_safe(mcast, tmcast,
    `priv->multicast_list` and we keep |                            remove_list, list)
    spinning on the `remove_list` of   |  >>>  wait_for_completion(&mcast->done)
    the other thread which is blocked  |
    and the list is still valid on     |
    it's stack.)

Fix this by keeping the lock held and changing to GFP_ATOMIC to prevent
eventual sleeps.
Unfortunately we could not reproduce the lockup and confirm this fix but
based on the code review I think this fix should address such lockups.

crash> bc 31
PID: 747      TASK: ff1c6a1a007e8000  CPU: 31   COMMAND: "kworker/u72:2"
--
    [exception RIP: ipoib_mcast_join_task+0x1b1]
    RIP: ffffffffc0944ac1  RSP: ff646f199a8c7e00  RFLAGS: 00000002
    RAX: 0000000000000000  RBX: ff1c6a1a04dc82f8  RCX: 0000000000000000
                                  work (&priv->mcast_task{,.work})
    RDX: ff1c6a192d60ac68  RSI: 0000000000000286  RDI: ff1c6a1a04dc8000
           &mcast->list
    RBP: ff646f199a8c7e90   R8: ff1c699980019420   R9: ff1c6a1920c9a000
    R10: ff646f199a8c7e00  R11: ff1c6a191a7d9800  R12: ff1c6a192d60ac00
                                                         mcast
    R13: ff1c6a1d82200000  R14: ff1c6a1a04dc8000  R15: ff1c6a1a04dc82d8
           dev                    priv (&priv->lock)     &priv->multicast_list (aka head)
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #5 [ff646f199a8c7e00] ipoib_mcast_join_task+0x1b1 at ffffffffc0944ac1 [ib_ipoib]
 #6 [ff646f199a8c7e98] process_one_work+0x1a7 at ffffffff9bf10967

crash> rx ff646f199a8c7e68
ff646f199a8c7e68:  ff1c6a1a04dc82f8 <<< work = &priv->mcast_task.work

crash> list -hO ipoib_dev_priv.multicast_list ff1c6a1a04dc8000
(empty)

crash> ipoib_dev_priv.mcast_task.work.func,mcast_mutex.owner.counter ff1c6a1a04dc8000
  mcast_task.work.func = 0xffffffffc0944910 <ipoib_mcast_join_task>,
  mcast_mutex.owner.counter = 0xff1c69998efec000

crash> b 8
PID: 8        TASK: ff1c69998efec000  CPU: 33   COMMAND: "kworker/u72:0"
--
 #3 [ff646f1980153d50] wait_for_completion+0x96 at ffffffff9c7d7646
 #4 [ff646f1980153d90] ipoib_mcast_remove_list+0x56 at ffffffffc0944dc6 [ib_ipoib]
 #5 [ff646f1980153de8] ipoib_mcast_dev_flush+0x1a7 at ffffffffc09455a7 [ib_ipoib]
 #6 [ff646f1980153e58] __ipoib_ib_dev_flush+0x1a4 at ffffffffc09431a4 [ib_ipoib]
 #7 [ff646f1980153e98] process_one_work+0x1a7 at ffffffff9bf10967

crash> rx ff646f1980153e68
ff646f1980153e68:  ff1c6a1a04dc83f0 <<< work = &priv->flush_light

crash> ipoib_dev_priv.flush_light.func,broadcast ff1c6a1a04dc8000
  flush_light.func = 0xffffffffc0943820 <ipoib_ib_dev_flush_light>,
  broadcast = 0x0,

The mcast(s) on the `remove_list` (the remaining part of the ex `priv->multicast_list`):

crash> list -s ipoib_mcast.done.done ipoib_mcast.list -H ff646f1980153e10 | paste - -
ff1c6a192bd0c200          done.done = 0x0,
ff1c6a192d60ac00          done.done = 0x0,

Reported-by: Yuya Fujita-bishamonten <fj-lsoft-rh-driver@dl.jp.fujitsu.com>
Signed-off-by: Daniel Vacek <neelx@redhat.com>
Link: https://lore.kernel.org/all/20231212080746.1528802-1-neelx@redhat.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-12 10:28:39 +02:00
Leon Romanovsky
afcda192db Expose c0 and SW encap ICM for RDMA
These two series from Mark and Shun extend RDMA mlx5 API.

Mark's series provides c0 register used to match egress
traffic sent by local device.

Shun's series adds new type for ICM area.

Link: https://lore.kernel.org/all/cover.1701871118.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-12 09:04:59 +02:00
Mark Bloch
d727d27db5 RDMA/mlx5: Expose register c0 for RDMA device
This patch introduces improvements for matching egress traffic sent by the
local device. When applicable, all egress traffic from the local vport is
now tagged with the provided value. This enhancement is particularly useful
for FDB steering purposes.

The primary focus of this update is facilitating the transmission of
traffic from the hypervisor to a VF. To achieve this, one must initiate an
SQ on the hypervisor and subsequently create a rule in the FDB that matches
on the eswitch manager vport and the SQN of the aforementioned SQ.

Obtaining the SQN can be had from SQ opened, and the eswitch manager vport
match can be substituted with the register c0 value exposed by this patch.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/aa4120a91c98ff1c44f1213388c744d4cb0324d6.1701871118.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-12 09:04:07 +02:00
Shun Hao
a429ec96c0 RDMA/mlx5: Support handling of SW encap ICM area
New type for this ICM area, now the user can allocate/deallocate
the new type of SW encap ICM memory, to store the encap header data
which are managed by SW.

Signed-off-by: Shun Hao <shunh@nvidia.com>
Link: https://lore.kernel.org/r/546fe43fc700240709e30acf7713ec6834d652bd.1701871118.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-12 09:03:57 +02:00
Selvin Xavier
07f830ae49 RDMA/bnxt_re: Adds MSN table capability for Gen P7 adapters
GenP7 HW expects an MSN table instead of PSN table. Check
for the HW retransmission capability and populate the MSN
table if HW retansmission is supported.

Signed-off-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1701946060-13931-7-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-11 09:56:29 +02:00
Selvin Xavier
cdae3936b2 RDMA/bnxt_re: Doorbell changes
Update the Doorbell routines to support the latest HW
definitions. Use common routine to prepare the Doorbell
key.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1701946060-13931-6-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-11 09:56:29 +02:00
Selvin Xavier
6027c20dad RDMA/bnxt_re: Get the toggle bits from CQ completions
Get the toggle bits from CQ completions. For older adapters
these values are 0.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1701946060-13931-5-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-11 09:56:28 +02:00
Selvin Xavier
880a5dd188 RDMA/bnxt_re: Update the HW interface definitions
Adds HW interface definitions to support the new
chip revision.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1701946060-13931-4-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-11 09:56:28 +02:00
Selvin Xavier
a62d685814 RDMA/bnxt_re: Update the BAR offsets
Update the BAR offsets for handling GenP7 adapters.
Use the values populated by L2 driver for getting the
Doorbell offsets.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1701946060-13931-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-11 09:56:28 +02:00
Selvin Xavier
1801d87b35 RDMA/bnxt_re: Support new 5760X P7 devices
Add basic support for 5760X P7 devices. Add new chip
revisions. The first version support is similar to
the existing P5 adapters. Extend the current support
for P5 adapters to P7 also.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1701946060-13931-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-11 09:56:28 +02:00
Chengchang Tang
288f535951 RDMA/hns: Fix memory leak in free_mr_init()
When a reserved QP fails to be created, the memory of the remaining
created reserved QPs is leaked.

Fixes: 70f9252158 ("RDMA/hns: Use the reserved loopback QPs to free MR before destroying MPT")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20231207114231.2872104-6-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-07 15:09:16 +02:00
Chengchang Tang
f31683a522 RDMA/hns: Remove unnecessary checks for NULL in mtr_alloc_bufs()
ib_umem_get() never return NULL.

Fixes: 3c873161a0 ("RDMA/hns: Add support for addressing when hopnum is 0")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20231207114231.2872104-5-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-07 15:09:16 +02:00
Junxian Huang
7243396aaf RDMA/hns: Add a max length of gid table
IB-core and rdma-core restrict the sgid_index specified by users,
which is uint8_t/u8 data type, to only be within the range of 0-255,
so it's meaningless to support excessively large gid_table_len.

On the other hand, ib-core creates as many sysfs gid files as
gid_table_len, most of which are not only useless because of the
reason above, but also greatly increase the traversal time of
the sysfs gid files for applications.

This patch limits the maximum length of gid table to 256.

Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20231207114231.2872104-4-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-07 15:09:16 +02:00
Junxian Huang
d3f4020a21 RDMA/hns: Response dmac to userspace
While creating AH, dmac is already resolved in kernel. Response dmac
to userspace so that userspace doesn't need to resolve dmac repeatedly.

Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20231207114231.2872104-3-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-07 15:09:16 +02:00
Chengchang Tang
95f6b40082 RDMA/hns: Rename the interrupts
Now, different devices may have the same interrupt name, which
makes it difficult for users to distinguish between these
interrupts.

Modify the naming style to be consistent with our network devices.
Before:
"hns-aeq-0"
"hns-ceq-0"
...

Now:
"hns-0000:35:00.0-aeq-0"
"hns-0000:35:00.0-ceq-0"
...

Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20231207114231.2872104-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-07 15:09:16 +02:00
Guoqing Jiang
b7a2768a1c RDMA/siw: Call orq_get_current if possible
Use orq_get_current() in siw_orq_empty().

Link: https://lore.kernel.org/r/20231203092655.28102-5-guoqing.jiang@linux.dev
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-12-04 20:13:30 -04:00
Guoqing Jiang
0b988c1bee RDMA/siw: Set qp_state in siw_query_qp
Run test_query_rc_qp against siw failed since siw didn't set qp_state
accordingly. To address it, introduce siw_qp_state_to_ib_qp_state
which convert SIW_QP_STATE_IDLE to IB_QPS_INIT which is similar as
in cxgb4.

rdma-core# ./build/bin/run_tests.py --dev siw0 tests.test_qp.QPTest.test_query_rc_qp -v
test_query_rc_qp (tests.test_qp.QPTest)
Queries an RC QP after creation. Verifies that its properties are as ... FAIL

======================================================================
FAIL: test_query_rc_qp (tests.test_qp.QPTest)
Queries an RC QP after creation. Verifies that its properties are as
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gjiang/rdma-core/tests/test_qp.py", line 284, in test_query_rc_qp
    self.query_qp_common_test(e.IBV_QPT_RC)
  File "/home/gjiang/rdma-core/tests/test_qp.py", line 265, in query_qp_common_test
    self.verify_qp_attrs(caps, e.IBV_QPS_INIT, qp_init_attr, qp_attr)
  File "/home/gjiang/rdma-core/tests/test_qp.py", line 239, in verify_qp_attrs
    self.assertEqual(state, attr.qp_state)
AssertionError: <ibv_qp_state.IBV_QPS_INIT: 1> != 0

----------------------------------------------------------------------
Ran 1 test in 0.057s

FAILED (failures=1)

Link: https://lore.kernel.org/r/20231203092655.28102-4-guoqing.jiang@linux.dev
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-12-04 20:13:30 -04:00
Guoqing Jiang
51ac45a663 RDMA/siw: Reduce memory usage of struct siw_rx_stream
We can reduce the memory of the struct by move some of it's
member.

Before,

	/* size: 144, cachelines: 3, members: 17 */
	/* sum members: 124, holes: 3, sum holes: 12 */
	/* sum bitfield members: 7 bits (0 bytes) */
	/* padding: 7 */
	/* bit_padding: 1 bits */

After

	/* size: 128, cachelines: 2, members: 17 */
	/* padding: 3 */
	/* bit_padding: 1 bits */

Link: https://lore.kernel.org/r/20231203092655.28102-3-guoqing.jiang@linux.dev
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-12-04 20:13:30 -04:00
Guoqing Jiang
84de14baf8 RDMA/siw: Move tx_cpu ahead
We can reduce one cacheline for the usage of struct siw_qp.

Before,

	/* size: 1928, cachelines: 31, members: 38 */
	/* sum members: 1920, holes: 2, sum holes: 8 */
	/* paddings: 4, sum paddings: 13 */
	/* forced alignments: 3 */

after

	/* size: 1920, cachelines: 30, members: 38 */
	/* paddings: 4, sum paddings: 13 */
	/* forced alignments: 3 */

Link: https://lore.kernel.org/r/20231203092655.28102-2-guoqing.jiang@linux.dev
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-12-04 20:13:30 -04:00
Shifeng Li
e3e82fcb79 RDMA/irdma: Avoid free the non-cqp_request scratch
When creating ceq_0 during probing irdma, cqp.sc_cqp will be sent as a
cqp_request to cqp->sc_cqp.sq_ring. If the request is pending when
removing the irdma driver or unplugging its aux device, cqp.sc_cqp will be
dereferenced as wrong struct in irdma_free_pending_cqp_request().

  PID: 3669   TASK: ffff88aef892c000  CPU: 28  COMMAND: "kworker/28:0"
   #0 [fffffe0000549e38] crash_nmi_callback at ffffffff810e3a34
   #1 [fffffe0000549e40] nmi_handle at ffffffff810788b2
   #2 [fffffe0000549ea0] default_do_nmi at ffffffff8107938f
   #3 [fffffe0000549eb8] do_nmi at ffffffff81079582
   #4 [fffffe0000549ef0] end_repeat_nmi at ffffffff82e016b4
      [exception RIP: native_queued_spin_lock_slowpath+1291]
      RIP: ffffffff8127e72b  RSP: ffff88aa841ef778  RFLAGS: 00000046
      RAX: 0000000000000000  RBX: ffff88b01f849700  RCX: ffffffff8127e47e
      RDX: 0000000000000000  RSI: 0000000000000004  RDI: ffffffff83857ec0
      RBP: ffff88afe3e4efc8   R8: ffffed15fc7c9dfa   R9: ffffed15fc7c9dfa
      R10: 0000000000000001  R11: ffffed15fc7c9df9  R12: 0000000000740000
      R13: ffff88b01f849708  R14: 0000000000000003  R15: ffffed1603f092e1
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
  -- <NMI exception stack> --
   #5 [ffff88aa841ef778] native_queued_spin_lock_slowpath at ffffffff8127e72b
   #6 [ffff88aa841ef7b0] _raw_spin_lock_irqsave at ffffffff82c22aa4
   #7 [ffff88aa841ef7c8] __wake_up_common_lock at ffffffff81257363
   #8 [ffff88aa841ef888] irdma_free_pending_cqp_request at ffffffffa0ba12cc [irdma]
   #9 [ffff88aa841ef958] irdma_cleanup_pending_cqp_op at ffffffffa0ba1469 [irdma]
   #10 [ffff88aa841ef9c0] irdma_ctrl_deinit_hw at ffffffffa0b2989f [irdma]
   #11 [ffff88aa841efa28] irdma_remove at ffffffffa0b252df [irdma]
   #12 [ffff88aa841efae8] auxiliary_bus_remove at ffffffff8219afdb
   #13 [ffff88aa841efb00] device_release_driver_internal at ffffffff821882e6
   #14 [ffff88aa841efb38] bus_remove_device at ffffffff82184278
   #15 [ffff88aa841efb88] device_del at ffffffff82179d23
   #16 [ffff88aa841efc48] ice_unplug_aux_dev at ffffffffa0eb1c14 [ice]
   #17 [ffff88aa841efc68] ice_service_task at ffffffffa0d88201 [ice]
   #18 [ffff88aa841efde8] process_one_work at ffffffff811c589a
   #19 [ffff88aa841efe60] worker_thread at ffffffff811c71ff
   #20 [ffff88aa841eff10] kthread at ffffffff811d87a0
   #21 [ffff88aa841eff50] ret_from_fork at ffffffff82e0022f

Fixes: 44d9e52977 ("RDMA/irdma: Implement device initialization definitions")
Link: https://lore.kernel.org/r/20231130081415.891006-1-lishifeng@sangfor.com.cn
Suggested-by: "Ismail, Mustafa" <mustafa.ismail@intel.com>
Signed-off-by: Shifeng Li <lishifeng@sangfor.com.cn>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-12-04 20:02:41 -04:00
Mike Marciniszyn
03769f72d6 RDMA/irdma: Fix support for 64k pages
Virtual QP and CQ require a 4K HW page size but the driver passes
PAGE_SIZE to ib_umem_find_best_pgsz() instead.

Fix this by using the appropriate 4k value in the bitmap passed to
ib_umem_find_best_pgsz().

Fixes: 693a5386ef ("RDMA/irdma: Split mr alloc and free into new functions")
Link: https://lore.kernel.org/r/20231129202143.1434-4-shiraz.saleem@intel.com
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-12-04 20:02:41 -04:00
Mike Marciniszyn
0a5ec366de RDMA/irdma: Ensure iWarp QP queue memory is OS paged aligned
The SQ is shared for between kernel and used by storing the kernel page
pointer and passing that to a kmap_atomic().

This then requires that the alignment is PAGE_SIZE aligned.

Fix by adding an iWarp specific alignment check.

Fixes: e965ef0e7b ("RDMA/irdma: Split QP handler into irdma_reg_user_mr_type_qp")
Link: https://lore.kernel.org/r/20231129202143.1434-3-shiraz.saleem@intel.com
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-12-04 20:02:41 -04:00
Mike Marciniszyn
4fbc3a52cd RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
64k pages introduce the situation in this diagram when the HCA 4k page
size is being used:

 +-------------------------------------------+ <--- 64k aligned VA
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+
 |                   o                       |
 |                                           |
 |                   o                       |
 |                                           |
 |                   o                       |
 +-------------------------------------------+
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+ <--- Live HCA page
 |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO| <--- offset
 |                                           | <--- VA
 |                MR data                    |
 +-------------------------------------------+
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+
 |                   o                       |
 |                                           |
 |                   o                       |
 |                                           |
 |                   o                       |
 +-------------------------------------------+
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+

The VA addresses are coming from rdma-core in this diagram can be
arbitrary, but for 64k pages, the VA may be offset by some number of HCA
4k pages and followed by some number of HCA 4k pages.

The current iterator doesn't account for either the preceding 4k pages or
the following 4k pages.

Fix the issue by extending the ib_block_iter to contain the number of DMA
pages like comment [1] says and by using __sg_advance to start the
iterator at the first live HCA page.

The changes are contained in a parallel set of iterator start and next
functions that are umem aware and specific to umem since there is one user
of the rdma_for_each_block() without umem.

These two fixes prevents the extra pages before and after the user MR
data.

Fix the preceding pages by using the __sq_advance field to start at the
first 4k page containing MR data.

Fix the following pages by saving the number of pgsz blocks in the
iterator state and downcounting on each next.

This fix allows for the elimination of the small page crutch noted in the
Fixes.

Fixes: 10c75ccb54 ("RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()")
Link: https://lore.kernel.org/r/20231129202143.1434-2-shiraz.saleem@intel.com
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-12-04 20:02:41 -04:00
Christian Brauner
3652117f85 eventfd: simplify eventfd_signal()
Ever since the eventfd type was introduced back in 2007 in commit
e1ad7468c7 ("signal/timer/event: eventfd core") the eventfd_signal()
function only ever passed 1 as a value for @n. There's no point in
keeping that additional argument.

Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org
Acked-by: Xu Yilun <yilun.xu@intel.com>
Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl
Acked-by: Eric Farman <farman@linux.ibm.com>  # s390
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-28 14:08:38 +01:00
Jack Wang
50af5d12f7 RDMA/IPoIB: Add tx timeout work to recover queue stop situation
As we sometime run into TX timeout from IPoIB, queue seems stopped
and can't recover. Diff with Mellanox OFED show Mellanox driver
has timeout work to recover in such case.

Add TX timeout work/NAPI work to recover such case.

Also increase the watchdog_timeo to 10 seconds, so more tolerant to
error.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20231121130316.126364-3-jinpu.wang@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-26 11:32:00 +02:00
Jack Wang
753fff78f4 RDMA/IPoIB: Fix error code return in ipoib_mcast_join
Return the error code in case of ib_sa_join_multicast fail.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20231121130316.126364-2-jinpu.wang@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-26 11:31:36 +02:00
Shifeng Li
2b78832f50 RDMA/irdma: Fix UAF in irdma_sc_ccq_get_cqe_info()
When removing the irdma driver or unplugging its aux device, the ccq
queue is released before destorying the cqp_cmpl_wq queue.
But in the window, there may still be completion events for wqes. That
will cause a UAF in irdma_sc_ccq_get_cqe_info().

[34693.333191] BUG: KASAN: use-after-free in irdma_sc_ccq_get_cqe_info+0x82f/0x8c0 [irdma]
[34693.333194] Read of size 8 at addr ffff889097f80818 by task kworker/u67:1/26327
[34693.333194]
[34693.333199] CPU: 9 PID: 26327 Comm: kworker/u67:1 Kdump: loaded Tainted: G           O     --------- -t - 4.18.0 #1
[34693.333200] Hardware name: SANGFOR Inspur/NULL, BIOS 4.1.13 08/01/2016
[34693.333211] Workqueue: cqp_cmpl_wq cqp_compl_worker [irdma]
[34693.333213] Call Trace:
[34693.333220]  dump_stack+0x71/0xab
[34693.333226]  print_address_description+0x6b/0x290
[34693.333238]  ? irdma_sc_ccq_get_cqe_info+0x82f/0x8c0 [irdma]
[34693.333240]  kasan_report+0x14a/0x2b0
[34693.333251]  irdma_sc_ccq_get_cqe_info+0x82f/0x8c0 [irdma]
[34693.333264]  ? irdma_free_cqp_request+0x151/0x1e0 [irdma]
[34693.333274]  irdma_cqp_ce_handler+0x1fb/0x3b0 [irdma]
[34693.333285]  ? irdma_ctrl_init_hw+0x2c20/0x2c20 [irdma]
[34693.333290]  ? __schedule+0x836/0x1570
[34693.333293]  ? strscpy+0x83/0x180
[34693.333296]  process_one_work+0x56a/0x11f0
[34693.333298]  worker_thread+0x8f/0xf40
[34693.333301]  ? __kthread_parkme+0x78/0xf0
[34693.333303]  ? rescuer_thread+0xc50/0xc50
[34693.333305]  kthread+0x2a0/0x390
[34693.333308]  ? kthread_destroy_worker+0x90/0x90
[34693.333310]  ret_from_fork+0x1f/0x40

Fixes: 44d9e52977 ("RDMA/irdma: Implement device initialization definitions")
Signed-off-by: Shifeng Li <lishifeng1992@126.com>
Link: https://lore.kernel.org/r/20231121101236.581694-1-lishifeng1992@126.com
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-22 13:52:52 +02:00
Kalesh AP
422b19f7f0 RDMA/bnxt_re: Correct module description string
The word "Driver" is repeated twice in the "modinfo bnxt_re"
output description. Fix it.

Fixes: 1ac5a40479 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1700555387-6277-1-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-22 13:51:48 +02:00
Supriti Singh
640233258e RDMA/rtrs: Use %pe to print errors
While printing error, replace %ld by %pe. %pe prints a string
whereas %ld would print an error code.

Signed-off-by: Supriti Singh <supriti.singh@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://lore.kernel.org/r/20231120154146.920486-10-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-22 13:43:34 +02:00
Supriti Singh
e76f514dc9 RDMA/rtrs-clt: Use %pe to print errors
While printing error, replace %ld by %pe. %pe prints a string
whereas %ld would print an error code.

Signed-off-by: Supriti Singh <supriti.singh@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://lore.kernel.org/r/20231120154146.920486-9-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-22 13:42:59 +02:00
Jack Wang
0c8bb6eb70 RDMA/rtrs-clt: Remove the warnings for req in_use check
As we chain the WR during write request: memory registration,
rdma write, local invalidate, if only the last WR fail to send due
to send queue overrun, the server can send back the reply, while
client mark the req->in_use to false in case of error in rtrs_clt_req
when error out from rtrs_post_rdma_write_sg.

Fixes: 6a98d71dae ("RDMA/rtrs: client: main functionality")
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://lore.kernel.org/r/20231120154146.920486-8-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-22 13:40:55 +02:00
Jack Wang
6d09f6f7d7 RDMA/rtrs-clt: Fix the max_send_wr setting
For each write request, we need Request, Response Memory Registration,
Local Invalidate.

Fixes: 6a98d71dae ("RDMA/rtrs: client: main functionality")
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://lore.kernel.org/r/20231120154146.920486-7-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-22 13:40:55 +02:00
Md Haris Iqbal
c4d32e77fc RDMA/rtrs-srv: Destroy path files after making sure no IOs in-flight
Destroying path files may lead to the freeing of rdma_stats. This creates
the following race.

An IO is in-flight, or has just passed the session state check in
process_read/process_write. The close_work gets triggered and the function
rtrs_srv_close_work() starts and does destroy path which frees the
rdma_stats. After this the function process_read/process_write resumes and
tries to update the stats through the function rtrs_srv_update_rdma_stats

This commit solves the problem by moving the destroy path function to a
later point. This point makes sure any inflights are completed. This is
done by qp drain, and waiting for all in-flights through ops_id.

Fixes: 9cb8374804 ("RDMA/rtrs: server: main functionality")
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Santosh Kumar Pradhan <santosh.pradhan@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://lore.kernel.org/r/20231120154146.920486-6-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-22 13:40:55 +02:00