forked from Minki/linux
2dacc1e57b
Patchces for the merge window: - Minor bug fixes in mlx5, mthca, pvrdma, rtrs, mlx4, hfi1, hns - Minor cleanups: coding style, useless includes and documentation - Reorganize how multicast processing works in rxe - Replace a red/black tree with xarray in rxe which improves performance - DSCP support and HW address handle re-use in irdma - Simplify the mailbox command handling in hns - Simplify iser now that FMR is eliminated -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAmI7d8gACgkQOG33FX4g mxq9TRAAm/3rEBtDyVan4Dm3mzq8xeJOTdYtQ29QNWZcwvhStNTo4c29jQJITZ9X Xd+m8VEPWzjkkZ4+hDFM9HhWjOzqDcCCFcbF88WDO0+P1M3HYYUE6AbIcSilAaT2 GIby3+kAnsTOAryrSzHLehYJKUYJuw5NRGsVAPY3r6hz3UECjNb/X1KTWhXaIfNy 2OTqFx5wMQ6hZO8e4a2Wz4bBYAYm95UfK+TgfRqUwhDLCDnkDELvHMPh9pXxJecG xzVg7W7Nv+6GWpUrqtM6W6YpriyjUOfbrtCJmBV3T/EdpiD6lZhKpkX23aLgu2Bi XoMM67aZ0i1Ft2trCIF8GbLaejG6epBkeKkUMMSozKqFP5DjhNbD2f3WAlMI15iW CcdyPS5re1ymPp66UlycTDSkN19wD8LfgQPLJCNM3EV8g8qs9LaKzEPWunBoZmw1 a+QX2zK8J07S21F8iq2sf/Qe1EDdmgOEOf7pmB/A5/PKqgXZPG7lYNYuhIKyFsdL mO+BqWWp0a953JJjsiutxyUCyUd8jtwwxRa0B66oqSQ1jO+xj6XDxjEL8ACuT8Iq 9wFJluTCN6lfxyV8mrSfweT0fQh/W/zVF7EJZHWdKXEzuADxMp9X06wtMQ82ea+k /wgN7DUpBZczRtlqaxt1VSkER75QwAMy/oSpA974+O120QCwLMQ= =aux1 -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma updates from Jason Gunthorpe: - Minor bug fixes in mlx5, mthca, pvrdma, rtrs, mlx4, hfi1, hns - Minor cleanups: coding style, useless includes and documentation - Reorganize how multicast processing works in rxe - Replace a red/black tree with xarray in rxe which improves performance - DSCP support and HW address handle re-use in irdma - Simplify the mailbox command handling in hns - Simplify iser now that FMR is eliminated * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (93 commits) RDMA/nldev: Prevent underflow in nldev_stat_set_counter_dynamic_doit() IB/iser: Fix error flow in case of registration failure IB/iser: Generalize map/unmap dma tasks IB/iser: Use iser_fr_desc as registration context IB/iser: Remove iser_reg_data_sg helper function RDMA/rxe: Use standard names for ref counting RDMA/rxe: Replace red-black trees by xarrays RDMA/rxe: Shorten pool names in rxe_pool.c RDMA/rxe: Move max_elem into rxe_type_info RDMA/rxe: Replace obj by elem in declaration RDMA/rxe: Delete _locked() APIs for pool objects RDMA/rxe: Reverse the sense of RXE_POOL_NO_ALLOC RDMA/rxe: Replace mr by rkey in responder resources RDMA/rxe: Fix ref error in rxe_av.c RDMA/hns: Use the reserved loopback QPs to free MR before destroying MPT RDMA/irdma: Add support for address handle re-use RDMA/qib: Fix typos in comments RDMA/mlx5: Fix memory leak in error flow for subscribe event routine Revert "RDMA/core: Fix ib_qp_usecnt_dec() called when error" RDMA/rxe: Remove useless argument for update_state() ... |
||
---|---|---|
.. | ||
Kconfig | ||
Makefile | ||
README | ||
rtrs-clt-stats.c | ||
rtrs-clt-sysfs.c | ||
rtrs-clt.c | ||
rtrs-clt.h | ||
rtrs-log.h | ||
rtrs-pri.h | ||
rtrs-srv-stats.c | ||
rtrs-srv-sysfs.c | ||
rtrs-srv.c | ||
rtrs-srv.h | ||
rtrs.c | ||
rtrs.h |
**************************** RDMA Transport (RTRS) **************************** RTRS (RDMA Transport) is a reliable high speed transport library which provides support to establish optimal number of connections between client and server machines using RDMA (InfiniBand, RoCE, iWarp) transport. It is optimized to transfer (read/write) IO blocks. In its core interface it follows the BIO semantics of providing the possibility to either write data from an sg list to the remote side or to request ("read") data transfer from the remote side into a given sg list. RTRS provides I/O fail-over and load-balancing capabilities by using multipath I/O (see "add_path" and "mp_policy" configuration entries in Documentation/ABI/testing/sysfs-class-rtrs-client). RTRS is used by the RNBD (RDMA Network Block Device) modules. ================== Transport protocol ================== Overview -------- An established connection between a client and a server is called rtrs session. A session is associated with a set of memory chunks reserved on the server side for a given client for rdma transfer. A session consists of multiple paths, each representing a separate physical link between client and server. Those are used for load balancing and failover. Each path consists of as many connections (QPs) as there are cpus on the client. When processing an incoming write or read request, rtrs client uses memory chunks reserved for him on the server side. Their number, size and addresses need to be exchanged between client and server during the connection establishment phase. Apart from the memory related information client needs to inform the server about the session name and identify each path and connection individually. On an established session client sends to server write or read messages. Server uses immediate field to tell the client which request is being acknowledged and for errno. Client uses immediate field to tell the server which of the memory chunks has been accessed and at which offset the message can be found. Module parameter always_invalidate is introduced for the security problem discussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we invalidate each rdma buffer before we hand it over to RNBD server and then pass it to the block layer. A new rkey is generated and registered for the buffer after it returns back from the block layer and RNBD server. The new rkey is sent back to the client along with the IO result. The procedure is the default behaviour of the driver. This invalidation and registration on each IO causes performance drop of up to 20%. A user of the driver may choose to load the modules with this mechanism switched off (always_invalidate=N), if he understands and can take the risk of a malicious client being able to corrupt memory of a server it is connected to. This might be a reasonable option in a scenario where all the clients and all the servers are located within a secure datacenter. Connection establishment ------------------------ 1. Client starts establishing connections belonging to a path of a session one by one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests. Those include uuid of the session and uuid of the path to be established. They are used by the server to find a persisting session/path or to create a new one when necessary. The message also contains the protocol version and magic for compatibility, total number of connections per session (as many as cpus on the client), the id of the current connection and the reconnect counter, which is used to resolve the situations where client is trying to reconnect a path, while server is still destroying the old one. 2. Server accepts the connection requests one by one and attaches RTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and protocol version, the messages include error code, queue depth supported by the server (number of memory chunks which are going to be allocated for that session) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set when always_invalidate=Y. 3. After all connections of a path are established client sends to server the RTRS_MSG_INFO_REQ message, containing the name of the session. This message requests the address information from the server. 4. Server replies to the session info request message with RTRS_MSG_INFO_RSP, which contains the addresses and keys of the RDMA buffers allocated for that session. 5. Session becomes connected after all paths to be established are connected (i.e. steps 1-4 finished for all paths requested for a session) 6. Server and client exchange periodically heartbeat messages (empty rdma messages with an immediate field) which are used to detect a crash on remote side or network outage in an absence of IO. 7. On any RDMA related error or in the case of a heartbeat timeout, the corresponding path is disconnected, all the inflight IO are failed over to a healthy path, if any, and the reconnect mechanism is triggered. CLT SRV *for each connection belonging to a path and for each path: RTRS_MSG_CON_REQ -------------------> <------------------- RTRS_MSG_CON_RSP ... *after all connections are established: RTRS_MSG_INFO_REQ -------------------> <------------------- RTRS_MSG_INFO_RSP *heartbeat is started from both sides: -------------------> [RTRS_HB_MSG_IMM] [RTRS_HB_MSG_ACK] <------------------- [RTRS_HB_MSG_IMM] <------------------- -------------------> [RTRS_HB_MSG_ACK] IO path ------- * Write (always_invalidate=N) * 1. When processing a write request client selects one of the memory chunks on the server side and rdma writes there the user data, user header and the RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only contains size of the user header. The client tells the server which chunk has been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by using the IMM field. 2. When confirming a write request server sends an "empty" rdma message with an immediate field. The 32 bit field is used to specify the outstanding inflight IO and for the error code. CLT SRV usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] [RTRS_IO_RSP_IMM] <----------------- (id + errno) * Write (always_invalidate=Y) * 1. When processing a write request client selects one of the memory chunks on the server side and rdma writes there the user data, user header and the RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only contains size of the user header. The client tells the server which chunk has been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by using the IMM field, Server invalidate rkey associated to the memory chunks first, when it finishes, pass the IO to RNBD server module. 2. When confirming a write request server sends an "empty" rdma message with an immediate field. The 32 bit field is used to specify the outstanding inflight IO and for the error code. The new rkey is sent back using SEND_WITH_IMM WR, client When it recived new rkey message, it validates the message and finished IO after update rkey for the rbuffer, then post back the recv buffer for later use. CLT SRV usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] [RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) [RTRS_IO_RSP_IMM] <----------------- (id + errno) * Read (always_invalidate=N)* 1. When processing a read request client selects one of the memory chunks on the server side and rdma writes there the user header and the RTRS_MSG_RDMA_READ message. This message contains the type (read), size of the user header, flags (specifying if memory invalidation is necessary) and the list of addresses along with keys for the data to be read into. 2. When confirming a read request server transfers the requested data first, attaches an invalidation message if requested and finally an "empty" rdma message with an immediate field. The 32 bit field is used to specify the outstanding inflight IO and the error code. CLT SRV usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] [RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) or in case client requested invalidation: [RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) * Read (always_invalidate=Y)* 1. When processing a read request client selects one of the memory chunks on the server side and rdma writes there the user header and the RTRS_MSG_RDMA_READ message. This message contains the type (read), size of the user header, flags (specifying if memory invalidation is necessary) and the list of addresses along with keys for the data to be read into. Server invalidate rkey associated to the memory chunks first, when it finishes, passes the IO to RNBD server module. 2. When confirming a read request server transfers the requested data first, attaches an invalidation message if requested and finally an "empty" rdma message with an immediate field. The 32 bit field is used to specify the outstanding inflight IO and the error code. The new rkey is sent back using SEND_WITH_IMM WR, client When it recived new rkey message, it validates the message and finished IO after update rkey for the rbuffer, then post back the recv buffer for later use. CLT SRV usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] [RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) [RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) or in case client requested invalidation: [RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) ========================================= Contributors List(in alphabetical order) ========================================= Danil Kipnis <danil.kipnis@profitbricks.com> Fabian Holler <mail@fholler.de> Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Jack Wang <jinpu.wang@profitbricks.com> Kleber Souza <kleber.souza@profitbricks.com> Lutz Pogrell <lutz.pogrell@cloud.ionos.com> Milind Dumbare <Milind.dumbare@gmail.com> Roman Penyaev <roman.penyaev@profitbricks.com>