With write path fast memory registration, we need less memory for
each request.
With fast memory registration, we can reduce max_send_sge to save
memory usage.
Also convert the kmalloc_array to kcalloc.
Link: https://lore.kernel.org/r/20210621055340.11789-4-jinpu.wang@ionos.com
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
With fast memory registration in write path, we can reduce
the memory consumption by using less max_send_sge, support IO bigger
than 116 KB (29 segments * 4 KB) without splitting, and it also
make the IO path more symmetric.
To avoid some times MR reg failed, waiting for the invalidation to finish
before the new mr reg. Introduce a refcount, only finish the request
when both local invalidation and io reply are there.
Link: https://lore.kernel.org/r/20210621055340.11789-3-jinpu.wang@ionos.com
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Dima Stepanov <dmitrii.stepanov@ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Introduce tail wr, we can send as the last wr, we want to send the local
invalidate wr after rdma wr in later patch.
While at it, also fix coding style issue.
Link: https://lore.kernel.org/r/20210621055340.11789-2-jinpu.wang@ionos.com
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Changing ucontext ABI response structure to pass wqe_mode to user library.
A flag in comp_mask has been set to indicate presence of wqe_mode.
Moved wqe-mode ABI to uapi/rdma/bnxt_re-abi.h
Link: https://lore.kernel.org/r/20210616202817.1185276-1-devesh.sharma@broadcom.com
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Removed port validity check from ib_get_cached_subnet_prefix() as this
check is not needed because "port_num" is valid.
Link: https://lore.kernel.org/r/20210616154509.1047-2-anand.a.khoje@oracle.com
Suggested-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Anand Khoje <anand.a.khoje@oracle.com>
Signed-off-by: Haakon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Compilation with W=1 produces warnings similar to the below.
drivers/infiniband/ulp/ipoib/ipoib_main.c:320: warning: This comment
starts with '/**', but isn't a kernel-doc comment. Refer
Documentation/doc-guide/kernel-doc.rst
All such occurrences were found with the following one line
git grep -A 1 "\/\*\*" drivers/infiniband/
Link: https://lore.kernel.org/r/e57d5f4ddd08b7a19934635b44d6d632841b9ba7.1623823612.git.leonro@nvidia.com
Reviewed-by: Jack Wang <jinpu.wang@ionos.com> #rtrs
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Switch xrcd index allocation and release from hns own bitmap interface
to IDA interface.
Link: https://lore.kernel.org/r/1623325814-55737-7-git-send-email-liweihang@huawei.com
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Switch pd index allocation and release from hns own bitmap interface
to IDA interface.
Link: https://lore.kernel.org/r/1623325814-55737-6-git-send-email-liweihang@huawei.com
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Switch mtpt index allocation and release from hns own bitmap interface
to IDA interface.
Link: https://lore.kernel.org/r/1623325814-55737-5-git-send-email-liweihang@huawei.com
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Round-robin (RR) is no longer used in the allocation of the bitmap table,
and all the function input parameters that use this mechanism are
BITMAP_NO_RR. The code that defines and uses the RR needs to be deleted.
Link: https://lore.kernel.org/r/1623325814-55737-4-git-send-email-liweihang@huawei.com
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
hns_roce_bitmap_free_range() is only called inside hns_roce_bitmap_free(),
and the input parameter "cnt" is set to a constant 1. In addition, the
driver does not use alloc_range scenarios, so free_range does not need to
exist.
Link: https://lore.kernel.org/r/1623325814-55737-3-git-send-email-liweihang@huawei.com
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Fill all QPC fileds with hr_reg_*() instead of roce_set_*(). SQPN is used
for HIP08 ES only, it should be removed.
Link: https://lore.kernel.org/r/1624262443-24528-6-git-send-email-liweihang@huawei.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In order to avoid to do bitwise operations on a boolean value, add a new
register interface to avoid sparse comlaint about "dubious: x & !y" when
calling hr_reg_write(ctx, field, !!val).
Fixes: dc50477440 ("RDMA/hns: Use new interface to set MPT related fields")
Fixes: 495c24808c ("RDMA/hns: Add XRC subtype in QPC and XRC type in SRQC")
Link: https://lore.kernel.org/r/1624262443-24528-4-git-send-email-liweihang@huawei.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
GCC may reports an running time assert error when a value calculated from
ib_mtu_enum_to_int() is using as 'val' in FIELD_PREDP:
include/linux/compiler_types.h:328:38: error: call to
'__compiletime_assert_1524' declared with attribute error: FIELD_PREP:
value too large for the field
So a check is added about whether integer mtu from ib_mtu_enum_to_int() is
negative to avoid this warning.
Link: https://lore.kernel.org/r/1624262443-24528-3-git-send-email-liweihang@huawei.com
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There is no need to use "!!" before "eq->eqe_size ==
HNS_ROCE_V3_EQE_SIZE", or sparse will complain about "dubious: x & !y".
Fixes: 782832f254 ("RDMA/hns: Simplify the function config_eqc()")
Link: https://lore.kernel.org/r/1624262443-24528-2-git-send-email-liweihang@huawei.com
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Relaxed Ordering is a capability that can only benefit users that support
it. All kernel ULPs should support Relaxed Ordering, as they are designed
to read data only after observing the CQE and use the DMA API correctly.
Hence, implicitly enable Relaxed Ordering by default for MR transfers in
kernel ULPs.
Link: https://lore.kernel.org/r/b7e820aab7402b8efa63605f4ea465831b3b1e5e.1623236426.git.leonro@nvidia.com
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Both of HIP08 and HIP09 require the extended doorbell information to be
cleared before being used.
Fixes: 6b63597d35 ("RDMA/hns: Add TSQ link table support")
Link: https://lore.kernel.org/r/1623392089-35639-1-git-send-email-liweihang@huawei.com
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently we only check device max_qp_wr limit for IO connection, but not
for service connection. We should check for both.
So save the max_qp_wr device limit in wr_limit, and use it for both IO
connections and service connections.
While at it, also remove an outdated comments.
Link: https://lore.kernel.org/r/20210614090337.29557-6-jinpu.wang@ionos.com
Suggested-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Those variables are passed to create_cq, create_qp, rtrs_iu_alloc and
rtrs_iu_free, so these *_size means the num of unit. And cq_size also
means number of cq element.
Also move the setting of cq_num to common path.
Link: https://lore.kernel.org/r/20210614090337.29557-5-jinpu.wang@ionos.com
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When using rdma_rxe, post_one_recv() returns ENOMEM error due to the full
recv queue. This patch increase the number of WR for receive queue to
support all devices.
Link: https://lore.kernel.org/r/20210614090337.29557-4-jinpu.wang@ionos.com
Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
We use device limit max_send_sge, which is suboptimal for memory usage.
We don't need that much for User Con, 1 is enough. And for IO con,
sess->max_segments + 1 is enough
Link: https://lore.kernel.org/r/20210614090337.29557-3-jinpu.wang@ionos.com
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently rtrs when create_qp use a coarse numbers (bigger in general),
which leads to hardware create more resources which only waste memory with
no benefits.
For max_send_wr, we don't really need alway max_qp_wr size when creating
qp, reduce it to cq_size.
For max_recv_wr, cq_size is enough.
With the patch when sess_queue_depth=128, per session (2 paths) memory
consumption reduced from 188 MB to 65MB
When always_invalidate is enabled, we need send more wr, so treat it
special.
Fixes: 9cb8374804 ("RDMA/rtrs: server: main functionality")
Link: https://lore.kernel.org/r/20210614090337.29557-2-jinpu.wang@ionos.com
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Now that the port_groups data is being destroyed and managed by the core
code this restriction is no longer needed. All the ib_port_attrs are
compatible with the core's sysfs lifecycle.
When the main device is destroyed and moved to another namespace the
driver's port sysfs can be created/destroyed as well due to it now being a
simple attribute list.
Link: https://lore.kernel.org/r/afd8b676eace2821692d44489ff71856277c48d1.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
init_port was only being used to register sysfs attributes against the
port kobject. Now that all users are creating static attribute_group's we
can simply set the attribute_group list in the ops and the core code can
just handle it directly.
This makes all the sysfs management quite straightforward and prevents any
driver from abusing the naked port kobject in future because no driver
code can access it.
Link: https://lore.kernel.org/r/114f68f3d921460eafe14cea5a80ca65d81729c3.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
hfi1 should not be creating a mess of kobjects to attach to the port
kobject - this is all attributes. The proper API is to create an
attribute_group list and create it against the port's kobject.
Link: https://lore.kernel.org/r/cbe0ccb6175dd22274359b6ad803a37435a70e91.1623427137.git.leonro@nvidia.com
Tested-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
qib should not be creating a mess of kobjects to attach to the port
kobject - this is all attributes. The proper API is to create an
attribute_group list and create it against the port's kobject.
Link: https://lore.kernel.org/r/911e0031e1ed495b0006e8a6efec7b67a702cd5e.1623427137.git.leonro@nvidia.com
Tested-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This code is trying to attach a list of counters grouped into 4 groups to
the ib_port sysfs. Instead of creating a bunch of kobjects simply express
everything naturally as an ib_port_attribute and add a single
attribute_groups list.
Remove all the naked kobject manipulations.
Link: https://lore.kernel.org/r/0d5a7241ee0fe66622de04fcbaafaf6a791d5c7c.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Other things outside the core code are creating attributes against the
port. This patch exposes the basic machinery to do this.
The ib_port_attribute type allows creating groups of attributes attatched
to the port and comes with the usual machinery to do this.
Link: https://lore.kernel.org/r/5c4aeae57f6fa7c59a1d6d1c5506069516ae9bbf.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This call does nothing because the ib_port kobj is nested under a struct
device kobject and the dev_uevent_filter() function of the struct device
blocks uevents for any children kobj's that are not also struct devices.
A uevent for the struct device will be triggered after
ib_setup_port_attrs() returns which causes udev to pick up all the deep
"attributes" which are implemented as kobjects nested under a struct
device and assign them to the udev object for the struct device:
$ udevadm info -a /sys/class/infiniband/ibp0s9
ATTR{ports/1/counters/excessive_buffer_overrun_errors}=="0"
Link: https://lore.kernel.org/r/49231c92c7d4c60686de18f7e20932d0c82160ee.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Instead of calling device_add_groups() add the group to the existing
groups array which is managed through device_add().
This requires setting up the hw_counters before device_add(), so it gets
split up from the already split port sysfs flow.
Move all the memory freeing to the release function.
Link: https://lore.kernel.org/r/666250d937b64f6fdf45da9e2dc0b6e5e4f7abd8.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Use the same technique as gid_attrs now uses to manage the port
sysfs. Bundle everything into three allocations and use a single
sysfs_create_groups() to build everything in one shot.
All the memory is always freed in the kobj release function, removing most
of the error unwinding.
The gid_attr technique and the hw_counters are very similar, merge the two
together and combine the sysfs_create_group() call for hw_counters with
the single sysfs group setup.
Link: https://lore.kernel.org/r/b688f3340694c59f7b44b1bde40e25559ef43cf3.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Instead of having an whole bunch of different allocations to create the
gid_attr kobjects reduce it to three, one for the kobj struct plus the
attributes, and one for the attribute list for each of the two
groups.
Move the freeing of all allocations to the release function.
Reorder the operations so all the allocations happen first then the
kobject & sysfs operations are last.
This removes the majority of the complicated error unwind since the
release function will always undo all the memory allocations. Freeing the
memory is also much simpler since there is a lot less of it.
Consolidate creating the "group of array indexes" pattern into one helper
function. Ensure kobject_del is used.
Link: https://lore.kernel.org/r/f4149d379db7178d37d11d75e3026bf550f818a1.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The gid_attrs directory is a dedicated kobj nested under the port,
construct/destruct it with its own pair of functions for
understandability. This is much more readable than having it weirdly
inlined out of order into the add_port() function.
Link: https://lore.kernel.org/r/1c9434111b6770a7aef0e644a88a16eee7e325b8.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This code creates a 'struct hw_stats_attribute' for each sysfs entry that
contains a naked 'struct attribute' inside.
It then proceeds to attach this same structure to a 'struct device' kobj
and a 'struct ib_port' kobj. However, this violates the typing
requirements. 'struct device' requires the attribute to be a 'struct
device_attribute' and 'struct ib_port' requires the attribute to be
'struct port_attribute'.
This happens to work because the show/store function pointers in all three
structures happen to be at the same offset and happen to be nearly the
same signature. This means when container_of() was used to go between the
wrong two types it still managed to work.
However clang CFI detection notices that the function pointers have a
slightly different signature. As with show/store this was only working
because the device and port struct layouts happened to have the kobj at
the front.
Correct this by have two independent sets of data structures for the port
and device case. The two different attributes correctly include the
port/device_attribute struct and everything from there up is kept
split. The show/store function call chains start with device/port unique
functions that invoke a common show/store function pointer.
Link: https://lore.kernel.org/r/a8b3864b4e722aed3657512af6aa47dc3c5033be.1623427137.git.leonro@nvidia.com
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
It is much saner to store a pointer to the kobject structure that contains
the cannonical stats pointer than to copy the stats pointers into a public
structure.
Future patches will require the sysfs pointer for other purposes.
Link: https://lore.kernel.org/r/f90551dfd296cde1cb507bbef27cca9891d19871.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This is being used to implement both the port and device global stats,
which is causing some confusion in the drivers. For instance EFA and i40iw
both seem to be misusing the device stats.
Split it into two ops so drivers that don't support one or the other can
leave the op NULL'd, making the calling code a little simpler to
understand.
Link: https://lore.kernel.org/r/1955c154197b2a159adc2dc97266ddc74afe420c.1623427137.git.leonro@nvidia.com
Tested-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Implement invalidate MW and cleaned up invalidate MR operations.
Added code to perform remote invalidate for send with invalidate. Added
code to perform local invalidation. Deleted some blank lines in rxe_loc.h.
Link: https://lore.kernel.org/r/20210608042552.33275-9-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add support for bind MW work requests from user space. Since rdma/core
does not support bind mw in ib_send_wr there is no way to support bind mw
in kernel space.
Added bind_mw local operation in rxe_req.c. Added bind_mw WR operation in
rxe_opcode.c. Added bind_mw WC in rxe_comp.c. Added additional fields to
rxe_mw in rxe_verbs.h. Added rxe_do_dealloc_mw() subroutine to cleanup an
mw when rxe_dealloc_mw is called. Added code to implement bind_mw
operation in rxe_mw.c
Link: https://lore.kernel.org/r/20210608042552.33275-8-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Simplify rxe_requester() by moving the local operations to a subroutine.
Add an error return for illegal send WR opcode. Moved next_index ahead of
rxe_run_task which fixed a small bug where work completions were delayed
until after the next wqe which was not the intended behavior. Let errors
return their own WC status. Previously all errors were reported as
protection errors which was incorrect. Changed the return of errors from
rxe_do_local_ops() to err: which causes an immediate completion. Without
this an error on a last WR may get lost. Changed fill_packet() to
finish_packet() which is more accurate.
Fixes: 8700e2e7c485 ("The software RoCE driver")
Link: https://lore.kernel.org/r/20210608042552.33275-7-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Rxe has two mask bits WR_LOCAL_MASK and WR_REG_MASK with WR_REG_MASK used
to indicate any local operation and WR_LOCAL_MASK unused. This patch
replaces both of these with one mask bit WR_LOCAL_OP_MASK which is
clearer.
Link: https://lore.kernel.org/r/20210608042552.33275-6-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>