linux

Author	SHA1	Message	Date
Julian Wiedmann	db4ffdcef7	s390/qeth: don't replace a fully completed async TX buffer For TX buffers that require an additional async notification via QAOB, the TX completion code can now manage all the necessary processing if the notification has already occurred (or is occurring concurrently). In such cases we can avoid replacing the metadata that is associated with the buffer's slot on the ring, and just keep using the current one. As qeth_clear_output_buffer() will also handle any kmem cache-allocated memory that was mapped into the TX buffer, qeth_qdio_handle_aob() doesn't need to worry about it. While at it, also remove the unneeded forward declaration for qeth_init_qdio_out_buf(). Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-07 06:55:21 -08:00
Julian Wiedmann	0b8da8110b	s390/qeth: use dev->groups for common sysfs attributes All qeth devices have a minimum set of sysfs attributes, and non-OSN devices share a group of additional attributes. Depending on whether the device is forced to use a specific discipline, the device_type then specifies further attributes. Shift the common attributes into dev->groups, so that the device_type only contains the discipline-specific attributes. This avoids exposing the common attributes to the disciplines, and nicely cleans up our sysfs code. While replacing the qeth_l__device_attributes() helpers, switch from sysfs__groups() to the more generic device__groups(). Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-07 06:55:21 -08:00
Julian Wiedmann	050663129a	s390/ccwgroup: use bus->dev_groups for bus-based sysfs attributes Bus drivers have their own way of describing the sysfs attributes that all devices on a bus should provide. Switch ccwgroup_attr_groups over to use bus->dev_groups, and thus free up dev->groups for usage by the ccwgroup device drivers. While adjusting the attribute naming, use ATTRIBUTE_GROUPS() to get rid of some boilerplate code. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-07 06:55:21 -08:00
Julian Wiedmann	04ea30c857	s390/qeth: don't call INIT_LIST_HEAD() on iob's list entry INIT_LIST_HEAD() only needs to be called on actual list heads. While at it clarify the naming of the field. Suggested-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-07 06:55:21 -08:00
Reo Shiseki	353021588c	Bluetooth: fix typo in struct name Signed-off-by: Reo Shiseki <reoshiseki@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 16:51:22 +02:00
David S. Miller	af3f4a85d9	Merge branch 'mlxsw-Misc-updates' Ido Schimmel says: ==================== mlxsw: Misc updates This patchset contains miscellaneous patches we gathered in our queue. Some of them are dependencies of larger patchsets that I will submit later this cycle. Patches #1-#3 perform small non-functional changes in mlxsw. Patch #4 adds more extended ack messages in mlxsw. Patch #5 adds devlink parameters documentation for mlxsw. To be extended with more parameters this cycle. Patches #6-#7 perform small changes in forwarding selftests infrastructure. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-06 19:22:15 -08:00
Jiri Pirko	acde33bf73	mlxsw: spectrum_router: Reduce mlxsw_sp_ipip_fib_entry_op_gre4() Turned out that mlxsw_sp_ipip_fib_entry_op_gre4() does not need to figure out the IP address and virtual router id. Those are exactly the same as in the fib_entry it is called for. So just use that and reduce mlxsw_sp_ipip_fib_entry_op_gre4() function to only call mlxsw_sp_ipip_fib_entry_op_gre4_rtdp() make the ipip decap op code similar to nve. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-06 19:22:14 -08:00
Petr Machata	f54d3c81b7	mlxsw: spectrum: Bump minimum FW version to xx.2008.2018 The indicated version fixes an issue whereby the MOMTE register would by default enable mirroring of ECN-marked traffic from all traffic classes, once the ECN mirroring was configured. This fix is necessary for offload of RED "ecn_mark" qevent. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-06 19:22:14 -08:00
Ido Schimmel	9add5f1954	mlxsw: core_acl: Use an array instead of a struct with a zero-length array Suppresses the following coccinelle warning: drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c:139:3-7: WARNING use flexible-array member instead Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-06 19:22:14 -08:00
Ido Schimmel	42c435a2ac	mlxsw: spectrum_mr: Use flexible-array member instead of zero-length array Suppresses the following coccinelle warning: drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c:18:15-19: WARNING use flexible-array member instead Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-06 19:22:14 -08:00
Ido Schimmel	4834ad8079	mlxsw: core: Trace EMAD events Currently, mlxsw triggers the 'devlink:devlink_hwmsg' tracepoint whenever a request is sent to the device and whenever a response is received from it. However, the tracepoint is not triggered when an event (e.g., port up / down) is received from the device. Also trace EMAD events in order to log a more complete picture of all the exchanged hardware messages. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-06 19:22:14 -08:00
Ido Schimmel	23fb55526d	selftests: mlxsw: Test RIF's reference count when joining a LAG Test that the reference count of a router interface (RIF) configured for a LAG is incremented / decremented when ports join / leave the LAG. Use the offload indication on routes configured on the RIF to understand if it was created / destroyed. The test fails without the previous patch. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-06 19:22:14 -08:00
Ido Schimmel	31e1de4f12	mlxsw: spectrum: Apply RIF configuration when joining a LAG In case a router interface (RIF) is configured for a LAG, make sure its configuration is applied on the new LAG member. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-06 19:22:14 -08:00
Leon Romanovsky	04b222f957	RDMA/mlx5: Remove IB representors dead code Delete dead code. Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>	2020-12-06 07:43:54 +02:00
Leon Romanovsky	e87114022e	net/mlx5: Simplify eswitch mode check Provide mlx5_core device instead of "priv" pointer while checking eswith mode. Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>	2020-12-06 07:43:54 +02:00
Leon Romanovsky	601c10c89c	net/mlx5: Delete custom device management logic After conversion to use auxiliary bus, all custom device management is not needed anymore, delete it. Signed-off-by: Leon Romanovsky <leonro@nvidia.com>	2020-12-06 07:43:54 +02:00
Leon Romanovsky	93f8244431	RDMA/mlx5: Convert mlx5_ib to use auxiliary bus The conversion to auxiliary bus solves long standing issue with existing mlx5_ib<->mlx5_core coupling. It required to have both modules in initramfs if one of them needed for the boot. Signed-off-by: Leon Romanovsky <leonro@nvidia.com>	2020-12-06 07:43:50 +02:00
Leon Romanovsky	912cebf420	net/mlx5e: Connect ethernet part to auxiliary bus Reuse auxiliary bus to perform device management of the ethernet part of the mlx5 driver. Signed-off-by: Leon Romanovsky <leonro@nvidia.com>	2020-12-06 07:37:38 +02:00
Leon Romanovsky	74c9729dd8	vdpa/mlx5: Connect mlx5_vdpa to auxiliary bus Change module registration logic to use auxiliary bus instead of custom made mlx5 register interface. Signed-off-by: Leon Romanovsky <leonro@nvidia.com>	2020-12-06 07:32:05 +02:00
David S. Miller	4054eebf0f	Merge branch 'r8169-improve-rtl_rx-and-NUM_RX_DESC-handling' Heiner Kallweit says: ==================== r8169: improve rtl_rx and NUM_RX_DESC handling This series improves rtl_rx() and the handling of NUM_RX_DESC. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-05 16:29:21 -08:00
Heiner Kallweit	ed22a8ff06	r8169: make NUM_RX_DESC a signed int After recent changes there's no need any longer to define NUM_RX_DESC as an unsigned value. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-05 16:29:21 -08:00
Heiner Kallweit	2f53e9d7bc	r8169: improve rtl_rx There's no need to check min(budget, NUM_RX_DESC). At first budget (NAPI_POLL_WEIGHT = 64) is less then NUM_RX_DESC (256). And more important: Even in case of budget > NUM_RX_DESC we could safely continue processing descriptors as long as they are owned by the CPU. In addition replace rx_left with a normal counter variable, this allows to simplify the code. Last but not least there's no need any longer to pass the budget as an u32. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-05 16:29:21 -08:00
Colin Ian King	00649542f1	net: fix spelling mistake "wil" -> "will" in Kconfig There is a spelling mistake in the Kconfig help text. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20201204194549.1153063-1-colin.king@canonical.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-05 15:17:19 -08:00
Jakub Kicinski	78d6bb584d	This cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - update include for min/max helpers, by Sven Eckelmann - add infrastructure and netlink functions for routing algo selection, by Sven Eckelmann (2 patches) - drop deprecated debugfs and sysfs support and obsoleted functionality, by Sven Eckelmann (3 patches) - drop unused include in fragmentation.c, by Simon Wunderlich -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl/KWIkWHHN3QHNpbW9u d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoTrTEACUImdOCWv4+NnEQfChQv6Y3i18 gJABXoOkWLfFpGBUlw/uYzFKpMEWZ0orHig9gucC+rmjNc8veWwAOugJoTPTKQJZ /4yndhM0x39vWex03rdDmyqzCEh1V1Q9VcdEuD6XbJDaK5F4jDu3NQVneOijIkN+ 5PzhlvtUlfe8csykOCOoC9Y5wy82fEhcEvuSq+Z6dU3Cb3EGHtEUtZ4orDkpnnml 7XEcn5C5+OFGlz/ikiszKumTtNK+dmGluOxoyfAzEjQHK7PoTorcXFS2YUoSWeqQ gmYZ56RBqEHjo4eqcaEgcqq5v8cTPCEMCB8UQjAffxrhloRKHRhQOysG1+OnzGA8 IQ2ARHLQCVPVraXF2ixE0D3BvjKmtMmcvZOCXwhCHDajn9jFKAh0+hnInDyv6Fp1 7eUfpHACL9EQDxKWXeQg37X2mk3hHJ+4zgZOYidahVeKbiiexe2heaHTYAbr9rIf 8hvtlgMg4AnwL3IxadrKwsbJ5t7TEPLTInf47hPvpRg7SmthDTcgso4VDwmWgB8W Tlug8/NoXXDCmDhXUpvyi9+idHe0J8xvHl/2xGC7aSsPAbuhqOuefMKr36YhXJTY vBA5Ih5ppylJ8Dzwa0TbonvQbOAinA8YTa6izKxY4e+xTPB2jz/WT2vciEZv2+ig vNIPFrLZ7OFMohCDfQ== =tNKF -----END PGP SIGNATURE----- Merge tag 'batadv-next-pullrequest-20201204' of git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - update include for min/max helpers, by Sven Eckelmann - add infrastructure and netlink functions for routing algo selection, by Sven Eckelmann (2 patches) - drop deprecated debugfs and sysfs support and obsoleted functionality, by Sven Eckelmann (3 patches) - drop unused include in fragmentation.c, by Simon Wunderlich * tag 'batadv-next-pullrequest-20201204' of git://git.open-mesh.org/linux-merge: batman-adv: Drop unused soft-interface.h include in fragmentation.c batman-adv: Drop legacy code for auto deleting mesh interfaces batman-adv: Drop deprecated debugfs support batman-adv: Drop deprecated sysfs support batman-adv: Allow selection of routing algorithm over rtnetlink batman-adv: Prepare infrastructure for newlink settings batman-adv: Add new include for min/max helpers batman-adv: Start new development cycle ==================== Link: https://lore.kernel.org/r/20201204154631.21063-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-05 15:08:06 -08:00
Arnd Bergmann	4560b2a3ec	enetc: Fix unused var build warning for CONFIG_OF When CONFIG_OF is disabled, there is a harmless warning about an unused variable: enetc_pf.c: In function 'enetc_phylink_create': enetc_pf.c:981:17: error: unused variable 'dev' [-Werror=unused-variable] Slightly rearrange the code to pass around the of_node as a function argument, which avoids the problem without hurting readability. Fixes: `71b77a7a27` ("enetc: Migrate to PHYLINK and PCS_LYNX") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20201204120800.17193-1-claudiu.manoil@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-05 14:56:23 -08:00
Jonathan Lemon	a7e1abad13	ptp: Add clock driver for the OpenCompute TimeCard. The OpenCompute time card is an atomic clock along with a GPS receiver that provides a Grandmaster clock source for a PTP enabled network. More information is available at http://www.timingcard.com/ Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Link: https://lore.kernel.org/r/20201204035128.2219252-2-jonathan.lemon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-05 13:59:41 -08:00
Bongsu Jeon	bcd684aace	net/nfc/nci: Support NCI 2.x initial sequence implement the NCI 2.x initial sequence to support NCI 2.x NFCC. Since NCI 2.0, CORE_RESET and CORE_INIT sequence have been changed. If NFCEE supports NCI 2.x, then NCI 2.x initial sequence will work. In NCI 1.0, Initial sequence and payloads are as below: (DH) (NFCC) \| -- CORE_RESET_CMD --> \| \| <-- CORE_RESET_RSP -- \| \| -- CORE_INIT_CMD --> \| \| <-- CORE_INIT_RSP -- \| CORE_RESET_RSP payloads are Status, NCI version, Configuration Status. CORE_INIT_CMD payloads are empty. CORE_INIT_RSP payloads are Status, NFCC Features, Number of Supported RF Interfaces, Supported RF Interface, Max Logical Connections, Max Routing table Size, Max Control Packet Payload Size, Max Size for Large Parameters, Manufacturer ID, Manufacturer Specific Information. In NCI 2.0, Initial Sequence and Parameters are as below: (DH) (NFCC) \| -- CORE_RESET_CMD --> \| \| <-- CORE_RESET_RSP -- \| \| <-- CORE_RESET_NTF -- \| \| -- CORE_INIT_CMD --> \| \| <-- CORE_INIT_RSP -- \| CORE_RESET_RSP payloads are Status. CORE_RESET_NTF payloads are Reset Trigger, Configuration Status, NCI Version, Manufacturer ID, Manufacturer Specific Information Length, Manufacturer Specific Information. CORE_INIT_CMD payloads are Feature1, Feature2. CORE_INIT_RSP payloads are Status, NFCC Features, Max Logical Connections, Max Routing Table Size, Max Control Packet Payload Size, Max Data Packet Payload Size of the Static HCI Connection, Number of Credits of the Static HCI Connection, Max NFC-V RF Frame Size, Number of Supported RF Interfaces, Supported RF Interfaces. Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com> Link: https://lore.kernel.org/r/20201202223147.3472-1-bongsu.jeon@samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 17:47:35 -08:00
Guillaume Nault	41fdfffd57	selftests: forwarding: Add MPLS L2VPN test Connect hosts H1 and H2 using two intermediate encapsulation routers (LER1 and LER2). These routers encapsulate traffic from the hosts, including the original Ethernet header, into MPLS. Use ping to test reachability between H1 and H2. Signed-off-by: Guillaume Nault <gnault@redhat.com> Link: https://lore.kernel.org/r/625f5c1aafa3a8085f8d3e082d680a82e16ffbaa.1606918980.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 17:44:06 -08:00
Tom Rix	0911d463b3	net: bna: remove trailing semicolon in macro definition The macro use will already have a semicolon. Clean up escaped newlines. Signed-off-by: Tom Rix <trix@redhat.com> Link: https://lore.kernel.org/r/20201202163622.3733506-1-trix@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 17:41:49 -08:00
Hoang Le	43fcd906d9	tipc: support 128bit node identity for peer removing We add the support to remove a specific node down with 128bit node identifier, as an alternative to legacy 32-bit node address. example: $tipc peer remove identiy <1001002\|16777777> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Link: https://lore.kernel.org/r/20201203035045.4564-1-hoang.h.le@dektech.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 17:40:27 -08:00
Simon Horman	7f356166ae	nfp: Replace zero-length array with flexible-array member There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use "flexible array members"[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.9/process/deprecated.html#zero-length-and-one-element-arrays Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Louis Peens <louis.peens@netronome.com> Link: https://lore.kernel.org/r/20201204125601.24876-1-simon.horman@netronome.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 16:00:20 -08:00
Bongsu Jeon	4fb7b98c7b	nfc: s3fwrn5: skip the NFC bootloader mode If there isn't a proper NFC firmware image, Bootloader mode will be skipped. Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com> Reviewed-by: Krzysztof Kozlowski <krzk@kernel.org> Link: https://lore.kernel.org/r/20201203225257.2446-1-bongsu.jeon@samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 15:30:47 -08:00
Jakub Kicinski	43be3a3c65	Merge branch 'perf-optimizations-for-tcp-recv-zerocopy' Arjun Roy says: ==================== Perf. optimizations for TCP Recv. Zerocopy This patchset contains several optimizations for TCP Recv. Zerocopy. Summarized: 1. It is possible that a read payload is not exactly page aligned - that there may exist "straggler" bytes that we cannot map into the caller's address space cleanly. For this, we allow the caller to provide as argument a "hybrid copy buffer", turning getsockopt(TCP_ZEROCOPY_RECEIVE) into a "hybrid" operation that allows the caller to avoid a subsequent recvmsg() call to read the stragglers. 2. Similarly, for "small" read payloads that are either below the size of a page, or small enough that remapping pages is not a performance win - we allow the user to short-circuit the remapping operations entirely and simply copy into the buffer provided. Some of the patches in the middle of this set are refactors to support this "short-circuiting" optimization. 3. We allow the user to provide a hint that performing a page zap operation (and the accompanying TLB shootdown) may not be necessary, for the provided region that the kernel will attempt to map pages into. This allows us to avoid this expensive operation while holding the socket lock, which provides a significant performance advantage. With all of these changes combined, "medium" sized receive traffic (multiple tens to few hundreds of KB) see significant efficiency gains when using TCP receive zerocopy instead of regular recvmsg(). For example, with RPC-style traffic with 32KB messages, there is a roughly 15% efficiency improvement when using zerocopy. Without these changes, there is a roughly 60-70% efficiency reduction with such messages when employing zerocopy. ==================== Link: https://lore.kernel.org/r/20201202225349.935284-1-arjunroy.kdev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:55 -08:00
Arjun Roy	94ab9eb9b2	net-zerocopy: Defer vm zap unless actually needed. Zapping pages is required only if we are calling vm_insert_page into a region where pages had previously been mapped. Receive zerocopy allows reusing such regions, and hitherto called zap_page_range() before calling vm_insert_page() in that range. zap_page_range() can also be triggered from userspace with madvise(MADV_DONTNEED). If userspace is configured to call this before reusing a segment, or if there was nothing mapped at this virtual address to begin with, we can avoid calling zap_page_range() under the socket lock. That said, if userspace does not do that, then we are still responsible for calling zap_page_range(). This patch adds a flag that the user can use to hint to the kernel that a zap is not required. If the flag is not set, or if an older user application does not have a flags field at all, then the kernel calls zap_page_range as before. Also, if the flag is set but a zap is still required, the kernel performs that zap as necessary. Thus incorrectly indicating that a zap can be avoided does not change the correctness of operation. It also increases the batchsize for vm_insert_pages and prefetches the page struct for the batch since we're about to bump the refcount. An alternative mechanism could be to not have a flag, assume by default a zap is not needed, and fall back to zapping if needed. However, this would harm performance for older applications for which a zap is necessary, and thus we implement it with an explicit flag so newer applications can opt in. When using RPC-style traffic with medium sized (tens of KB) RPCs, this change yields an efficency improvement of about 30% for QPS/CPU usage. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:53 -08:00
Arjun Roy	0c3936d32f	net-zerocopy: Set zerocopy hint when data is copied Set zerocopy hint, event when falling back to copy, so that the pending data can be efficiently received using zerocopy when possible. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:53 -08:00
Arjun Roy	f21a3c4803	net-zerocopy: Introduce short-circuit small reads. Sometimes, we may call tcp receive zerocopy when inq is 0, or inq < PAGE_SIZE, or inq is generally small enough that it is cheaper to copy rather than remap pages. In these cases, we may want to either return early (inq=0) or attempt to use the provided copy buffer to simply copy the received data. This allows us to save both system call overhead and the latency of acquiring mmap_sem in read mode for cases where it would be useless to do so. This patchset enables this behaviour by: 1. Returning quickly if inq is 0. 2. Attempting to perform a regular copy if a hybrid copybuffer is provided and it is large enough to absorb all available bytes. 3. Return quickly if no such buffer was provided and there are less than PAGE_SIZE bytes available. For small RPC ping-pong workloads, normally we would have 1 getsockopt(), 1 recvmsg() and 1 sendmsg() call per RPC. With this change, we remove the recvmsg() call entirely, reducing the syscall overhead by about 33%. In testing with small (hundreds of bytes) RPC traffic, this yields a syscall reduction of about 33% and an efficiency gain of about 3-5% when defined as QPS/CPU Util. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:53 -08:00
Arjun Roy	936ced4157	net-zerocopy: Fast return if inq < PAGE_SIZE Sometimes, we may call tcp receive zerocopy when inq is 0, or inq < PAGE_SIZE, in which case we cannot remap pages. In this case, simply return the appropriate hint for regular copying without taking mmap_sem. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:53 -08:00
Arjun Roy	98917cf0d6	net-zerocopy: Refactor frag-is-remappable test. Refactor frag-is-remappable test for tcp receive zerocopy. This is part of a patch set that introduces short-circuited hybrid copies for small receive operations, which results in roughly 33% fewer syscalls for small RPC scenarios. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:52 -08:00
Arjun Roy	7fba5309ef	net-zerocopy: Refactor skb frag fast-forward op. Refactor skb frag fast-forwarding for tcp receive zerocopy. This is part of a patch set that introduces short-circuited hybrid copies for small receive operations, which results in roughly 33% fewer syscalls for small RPC scenarios. skb_advance_to_frag(), given a skb and an offset into the skb, iterates from the first frag for the skb until we're at the frag specified by the offset. Assuming the offset provided refers to how many bytes in the skb are already read, the returned frag points to the next frag we may read from, while offset_frag is set to the number of bytes from this frag that we have already read. If frag is not null and offset_frag is equal to 0, then we may be able to map this frag's page into the process address space with vm_insert_page(). However, if offset_frag is not equal to 0, then we cannot do so. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:52 -08:00
Arjun Roy	2cd8116184	net-tcp: Introduce tcp_recvmsg_locked(). Refactor tcp_recvmsg() by splitting it into locked and unlocked portions. Callers already holding the socket lock and not using ERRQUEUE/cmsg/busy polling can simply call tcp_recvmsg_locked(). This is in preparation for a short-circuit copy performed by TCP receive zerocopy for small (< PAGE_SIZE, or otherwise requested by the user) reads. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:52 -08:00
Arjun Roy	18fb76ed53	net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy. When TCP receive zerocopy does not successfully map the entire requested space, it outputs a 'hint' that the caller should recvmsg(). Augment zerocopy to accept a user buffer that it tries to copy this hint into - if it is possible to copy the entire hint, it will do so. This elides a recvmsg() call for received traffic that isn't exactly page-aligned in size. This was tested with RPC-style traffic of arbitrary sizes. Normally, each received message required at least one getsockopt() call, and one recvmsg() call for the remaining unaligned data. With this change, almost all of the recvmsg() calls are eliminated, leading to a savings of about 25%-50% in number of system calls for RPC-style workloads. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:52 -08:00
Jakub Kicinski	4be986c824	Merge branch 'seg6-add-support-for-srv6-end-dt4-dt6-behavior' Andrea Mayer says: ==================== seg6: add support for SRv6 End.DT4/DT6 behavior This patchset provides support for the SRv6 End.DT4 and End.DT6 (VRF mode) behaviors. The SRv6 End.DT4 behavior is used to implement multi-tenant IPv4 L3 VPNs. It decapsulates the received packets and performs IPv4 routing lookup in the routing table of the tenant. The SRv6 End.DT4 Linux implementation leverages a VRF device in order to force the routing lookup into the associated routing table. The SRv6 End.DT4 behavior is defined in the SRv6 Network Programming [1]. The Linux kernel already offers an implementation of the SRv6 End.DT6 behavior which allows us to set up IPv6 L3 VPNs over SRv6 networks. This new implementation of DT6 is based on the same VRF infrastructure already exploited for implementing the SRv6 End.DT4 behavior. The aim of the new SRv6 End.DT6 in VRF mode consists in simplifying the construction of IPv6 L3 VPN services in the multi-tenant environment. Currently, the two SRv6 End.DT6 implementations (legacy and VRF mode) coexist seamlessly and can be chosen according to the context and the user preferences. - Patch 1 is needed to solve a pre-existing issue with tunneled packets when a sniffer is attached; - Patch 2 improves the management of the seg6local attributes used by the SRv6 behaviors; - Patch 3 adds support for optional attributes in SRv6 behaviors; - Patch 4 introduces two callbacks used for customizing the creation/destruction of a SRv6 behavior; - Patch 5 is the core patch that adds support for the SRv6 End.DT4 behavior; - Patch 6 introduces the VRF support for SRv6 End.DT6 behavior; - Patch 7 adds the selftest for SRv6 End.DT4 behavior; - Patch 8 adds the selftest for SRv6 End.DT6 (VRF mode) behavior. Regarding iproute2, the support for the new "vrftable" attribute, required by both SRv6 End.DT4 and End.DT6 (VRF mode) behaviors, is provided in a different patchset that will follow shortly. I would like to thank David Ahern for his support during the development of this patchset. [1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programming ==================== Link: https://lore.kernel.org/r/20201202130517.4967-1-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:30:53 -08:00
Andrea Mayer	2bc035538e	selftests: add selftest for the SRv6 End.DT6 (VRF) behavior this selftest is designed for evaluating the new SRv6 End.DT6 (VRF) behavior used, in this example, for implementing IPv6 L3 VPN use cases. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: Paolo Lungaroni <paolo.lungaroni@cnit.it> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:30:51 -08:00
Andrea Mayer	2195444e09	selftests: add selftest for the SRv6 End.DT4 behavior this selftest is designed for evaluating the new SRv6 End.DT4 behavior used, in this example, for implementing IPv4 L3 VPN use cases. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:30:51 -08:00
Andrea Mayer	20a081b798	seg6: add VRF support for SRv6 End.DT6 behavior SRv6 End.DT6 is defined in the SRv6 Network Programming [1]. The Linux kernel already offers an implementation of the SRv6 End.DT6 behavior which permits IPv6 L3 VPNs over SRv6 networks. This implementation is not particularly suitable in contexts where we need to deploy IPv6 L3 VPNs among different tenants which share the same network address schemes. The underlying problem lies in the fact that the current version of DT6 (called legacy DT6 from now on) needs a complex configuration to be applied on routers which requires ad-hoc routes and routing policy rules to ensure the correct isolation of tenants. Consequently, a new implementation of DT6 has been introduced with the aim of simplifying the construction of IPv6 L3 VPN services in the multi-tenant environment using SRv6 networks. To accomplish this task, we reused the same VRF infrastructure and SRv6 core components already exploited for implementing the SRv6 End.DT4 behavior. Currently the two End.DT6 implementations coexist seamlessly and can be used depending on the context and the user preferences. So, in order to support both versions of DT6 a new attribute (vrftable) has been introduced which allows us to differentiate the implementation of the behavior to be used. A SRv6 End.DT6 legacy behavior is still instantiated using a command like the following one: $ ip -6 route add 2001:db8::1 encap seg6local action End.DT6 table 100 dev eth0 While to instantiate the SRv6 End.DT6 in VRF mode, the command is still pretty straight forward: $ ip -6 route add 2001:db8::1 encap seg6local action End.DT6 vrftable 100 dev eth0. Obviously as in the case of SRv6 End.DT4, the VRF strict_mode parameter must be set (net.vrf.strict_mode=1) and the VRF associated with table 100 must exist. Please note that the instances of SRv6 End.DT6 legacy and End.DT6 VRF mode can coexist in the same system/configuration without problems. [1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programming Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:30:51 -08:00
Andrea Mayer	664d6f8686	seg6: add support for the SRv6 End.DT4 behavior SRv6 End.DT4 is defined in the SRv6 Network Programming [1]. The SRv6 End.DT4 is used to implement IPv4 L3VPN use-cases in multi-tenants environments. It decapsulates the received packets and it performs IPv4 routing lookup in the routing table of the tenant. The SRv6 End.DT4 Linux implementation leverages a VRF device in order to force the routing lookup into the associated routing table. To make the End.DT4 work properly, it must be guaranteed that the routing table used for routing lookup operations is bound to one and only one VRF during the tunnel creation. Such constraint has to be enforced by enabling the VRF strict_mode sysctl parameter, i.e: $ sysctl -wq net.vrf.strict_mode=1. At JANOG44, LINE corporation presented their multi-tenant DC architecture using SRv6 [2]. In the slides, they reported that the Linux kernel is missing the support of SRv6 End.DT4 behavior. The SRv6 End.DT4 behavior can be instantiated using a command similar to the following: $ ip route add 2001:db8::1 encap seg6local action End.DT4 vrftable 100 dev eth0 We introduce the "vrftable" extension in iproute2 in a following patch. [1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programming [2] https://speakerdeck.com/line_developers/line-data-center-networking-with-srv6 Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:30:50 -08:00
Andrea Mayer	cfdf64a034	seg6: add callbacks for customizing the creation/destruction of a behavior We introduce two callbacks used for customizing the creation/destruction of a SRv6 behavior. Such callbacks are defined in the new struct seg6_local_lwtunnel_ops and hereafter we provide a brief description of them: - build_state(...): used for calling the custom constructor of the behavior during its initialization phase and after all the attributes have been parsed successfully; - destroy_state(...): used for calling the custom destructor of the behavior before it is completely destroyed. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:30:50 -08:00
Andrea Mayer	0a3021f1d4	seg6: add support for optional attributes in SRv6 behaviors Before this patch, each SRv6 behavior specifies a set of required attributes that must be provided by the userspace application when such behavior is going to be instantiated. If at least one of the required attributes is not provided, the creation of the behavior fails. The SRv6 behavior framework lacks a way to manage optional attributes. By definition, an optional attribute for a SRv6 behavior consists of an attribute which may or may not be provided by the userspace. Therefore, if an optional attribute is missing (and thus not supplied by the user) the creation of the behavior goes ahead without any issue. This patch explicitly differentiates the required attributes from the optional attributes. In particular, each behavior can declare a set of required attributes and a set of optional ones. The semantic of the required attributes remains totally unaffected by this patch. The introduction of the optional attributes does NOT impact on the backward compatibility of the existing SRv6 behaviors. It is essential to note that if an (optional or required) attribute is supplied to a SRv6 behavior which does not expect it, the behavior simply discards such attribute without generating any error or warning. This operating mode remained unchanged both before and after the introduction of the optional attributes extension. The optional attributes are one of the key components used to implement the SRv6 End.DT6 behavior based on the Virtual Routing and Forwarding (VRF) framework. The optional attributes make possible the coexistence of the already existing SRv6 End.DT6 implementation with the new SRv6 End.DT6 VRF-based implementation without breaking any backward compatibility. Further details on the SRv6 End.DT6 behavior (VRF mode) are reported in subsequent patches. From the userspace point of view, the support for optional attributes DO NOT require any changes to the userspace applications, i.e: iproute2 unless new attributes (required or optional) are needed. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:30:50 -08:00
Andrea Mayer	964adce526	seg6: improve management of behavior attributes Depending on the attribute (i.e.: SEG6_LOCAL_SRH, SEG6_LOCAL_TABLE, etc), the parse() callback performs some validity checks on the provided input and updates the tunnel state (slwt) with the result of the parsing operation. However, an attribute may also need to reserve some additional resources (i.e.: memory or setting up an eBPF program) in the parse() callback to complete the parsing operation. The parse() callbacks are invoked by the parse_nla_action() for each attribute belonging to a specific behavior. Given a behavior with N attributes, if the parsing of the i-th attribute fails, the parse_nla_action() returns immediately with an error. Nonetheless, the resources acquired during the parsing of the i-1 attributes are not freed by the parse_nla_action(). Attributes which acquire resources must release them in an explicit way in both the seg6_local_{build/destroy}_state(). However, adding a new attribute of this type requires changes to seg6_local_{build/destroy}_state() to release the resources correctly. The seg6local infrastructure still lacks a simple and structured way to release the resources acquired in the parse() operations. We introduced a new callback in the struct seg6_action_param named destroy(). This callback releases any resource which may have been acquired in the parse() counterpart. Each attribute may or may not implement the destroy() callback depending on whether it needs to free some acquired resources. The destroy() callback comes with several of advantages: 1) we can have many attributes as we want for a given behavior with no need to explicitly free the taken resources; 2) As in case of the seg6_local_build_state(), the seg6_local_destroy_state() does not need to handle the release of resources directly. Indeed, it calls the destroy_attrs() function which is in charge of calling the destroy() callback for every set attribute. We do not need to patch seg6_local_{build/destroy}_state() anymore as we add new attributes; 3) the code is more readable and better structured. Indeed, all the information needed to handle a given attribute are contained in only one place; 4) it facilitates the integration with new features introduced in further patches. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:30:50 -08:00
Andrea Mayer	0489390882	vrf: add mac header for tunneled packets when sniffer is attached Before this patch, a sniffer attached to a VRF used as the receiving interface of L3 tunneled packets detects them as malformed packets and it complains about that (i.e.: tcpdump shows bogus packets). The reason is that a tunneled L3 packet does not carry any L2 information and when the VRF is set as the receiving interface of a decapsulated L3 packet, no mac header is currently set or valid. Therefore, the purpose of this patch consists of adding a MAC header to any packet which is directly received on the VRF interface ONLY IF: i) a sniffer is attached on the VRF and ii) the mac header is not set. In this case, the mac address of the VRF is copied in both the destination and the source address of the ethernet header. The protocol type is set either to IPv4 or IPv6, depending on which L3 packet is received. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:30:50 -08:00

... 2 3 4 5 6 ...

970016 Commits