linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-22 12:11:40 +00:00

Author	SHA1	Message	Date
Menglong Dong	37653a0b8a	net: ip: make fib_validate_source() support drop reasons In this commit, we make fib_validate_source() and __fib_validate_source() return -reason instead of errno on error. The return value of fib_validate_source can be -errno, 0, and 1. It's hard to make fib_validate_source() return drop reasons directly. The fib_validate_source() will return 1 if the scope of the source(revert) route is HOST. And the __mkroute_input() will mark the skb with IPSKB_DOREDIRECT in this case (combine with some other conditions). And then, a REDIRECT ICMP will be sent in ip_forward() if this flag exists. We can't pass this information to __mkroute_input if we make fib_validate_source() return drop reasons. Therefore, we introduce the wrapper fib_validate_source_reason() for fib_validate_source(), which will return the drop reasons on error. In the origin logic, LINUX_MIB_IPRPFILTER will be counted if fib_validate_source() return -EXDEV. And now, we need to adjust it by checking "reason == SKB_DROP_REASON_IP_RPFILTER". However, this will take effect only after the patch "net: ip: make ip_route_input_noref() return drop reasons", as we can't pass the drop reasons from fib_validate_source() to ip_rcv_finish_core() in this patch. Following new drop reasons are added in this patch: SKB_DROP_REASON_IP_LOCAL_SOURCE SKB_DROP_REASON_IP_INVALID_SOURCE Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-11-12 11:24:50 +01:00
Jakub Kicinski	544070db6c	Merge branch 'mlx5-esw-qos-refactor-and-shampo-cleanup' Tariq Toukan says: ==================== mlx5 esw qos refactor and SHAMPO cleanup This patchset for the mlx5 core and Eth drivers consists of 3 parts. First patch by Patrisious improves the E-switch mode change operation. The following 6 patches by Carolina introduce further refactoring for the QoS handling, to set the foundation for future extensions. In the following 5 patches by Dragos, we enhance the SHAMPO datapath flow by simplifying some logic, and cleaning up the implementation. ==================== Link: https://patch.msgid.link/20241107194357.683732-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:28:20 -08:00
Dragos Tatulea	ab4219db89	net/mlx5e: SHAMPO, Rework header allocation loop The current loop code was based on the assumption that there can be page leftovers from previous function calls. This patch changes the allocation loop to make it clearer how pages get allocated every MLX5E_SHAMPO_WQ_HEADER_PER_PAGE headers. This change has no functional implications. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-13-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:28:18 -08:00
Dragos Tatulea	945ca432bf	net/mlx5e: SHAMPO, Drop info array The info array is used to store a pointer to the dma address of the header and to the frag page. However, this array is not really required: - The frag page can be calculated from the header index frag page index = header index / headers per page. - The dma address can be calculated through a formula: dma page address + header offset. This series gets rid of the info array and uses the above formulas instead. The current_page_index was used in conjunction with the info array to store page fragment indices. This variable is dropped as well. There was no performance regression observed. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-12-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:28:18 -08:00
Dragos Tatulea	4f56868b71	net/mlx5e: SHAMPO, Change frag page setup order during allocation Now that the UMR allocation has been simplified, it is no longer possible to have a leftover page from a previous call to mlx5e_build_shampo_hd_umr(). This patch simplifies the code by switching the order of operations: first take the frag page and then increment the index. This is more straightforward and it also paves the way for dropping the info array. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-11-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:28:18 -08:00
Dragos Tatulea	1a4b588577	net/mlx5e: SHAMPO, Fix page_index calculation inconsistency When calculating the index for the next frag page slot, the divisor is incorrect: it should be the number of pages per queue not the number of headers per queue. This is currently harmless because frag pages are not used directly, but they are intermediated through the info array. But it needs to be fixed as an upcoming patch will get rid of the info array. This patch introduces a new pages per queue variable and plugs it in the formula. Now that this variable exists, additional code can be simplified in the SHAMPO initialization code. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-10-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:28:18 -08:00
Dragos Tatulea	8a0ee54027	net/mlx5e: SHAMPO, Simplify UMR allocation for headers Allocating page fragments for header data split is currently more complicated than it should be. That's because the number of KSM entries allocated is not aligned to the number of headers per page. This leads to having leftovers in the next allocation which require additional accounting and needlessly complicated code. This patch aligns (down) the number of KSM entries in the UMR WQE to the number of headers per page by: 1) Aligning the max number of entries allocated per UMR WQE (max_ksm_entries) to MLX5E_SHAMPO_WQ_HEADER_PER_PAGE. 2) Aligning the total number of free headers to MLX5E_SHAMPO_WQ_HEADER_PER_PAGE. ... and then it drops the extra accounting code from mlx5e_build_shampo_hd_umr(). Although the number of entries allocated per UMR WQE is slightly smaller due to aligning down, no performance impact was observed. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-9-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:28:18 -08:00
Carolina Jubran	be034baba8	net/mlx5: Make vport QoS enablement more flexible for future extensions Refactor esw_qos_vport_enable to support more generic configurations, allowing it to be reused for new vport node types in future patches. This refactor includes a new way to change the vport parent node by disabling the current setup and re-enabling it with the new parent. This change sets the foundation for adapting configuration based on the parent type in future patches. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-8-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:28:18 -08:00
Carolina Jubran	d67bfd10e6	net/mlx5: Integrate esw_qos_vport_enable logic into rate operations Fold the esw_qos_vport_enable function into operations for configuring maximum and minimum rates, simplifying QoS logic. This change consolidates enabling and updating the scheduling element configuration, streamlining how vport QoS is initialized and adjusted. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:28:17 -08:00
Carolina Jubran	663bc605d0	net/mlx5: Generalize scheduling element operations Introduce helper functions to create and destroy scheduling elements, allowing flexible configuration for different scheduling element types. The new helper functions streamline the process by centralizing error handling and logging through esw_qos_sched_elem_op_warn, which now accepts the operation type (create, destroy, or modify). The changes also adjust the esw_qos_vport_enable and mlx5_esw_qos_vport_disable functions to leverage the new generalized create/destroy helpers. The destroy functions now log errors with esw_warn without returning them. This prevents unnecessary error handling since the node was already destroyed and no further action is required from callers. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:28:17 -08:00
Carolina Jubran	cc4bb15ffa	net/mlx5: Refactor scheduling element configuration bitmasks Refactor esw_qos_sched_elem_config to set bitmasks only when max_rate or bw_share values change, allowing the function to configure nodes with only one of these parameters. This enables more flexible usage for nodes where only one parameter requires configuration. Remove scattered assignments and checks to centralize them within this function, removing the now redundant esw_qos_set_node_max_rate entirely. With this refactor, also remove the assignment of the vport scheduling node max rate to the parent max rate for unlimited vports (where max rate is set to zero), as firmware already handles this behavior. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:28:17 -08:00
Carolina Jubran	ac778fefed	net/mlx5: Generalize max_rate and min_rate setting for nodes Refactor max_rate and min_rate setting functions to operate on mlx5_esw_sched_node, allowing for generalized handling of both vports and nodes. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:28:17 -08:00
Carolina Jubran	5a73185765	net/mlx5: Simplify QoS normalization by removing error handling This change updates esw_qos_normalize_min_rate to not return errors, significantly simplifying the code. Normalization failures are software bugs, and it's unnecessary to handle them with rollback mechanisms. Instead, `esw_qos_update_sched_node_bw_share` and `esw_qos_normalize_min_rate` now return void, with any errors logged as warnings to indicate potential software issues. This approach avoids compensating for hidden bugs and removes error handling from all places that perform normalization, streamlining future patches. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:28:17 -08:00
Patrisious Haddad	ab85ebf437	net/mlx5: E-switch, refactor eswitch mode change The E-switch mode was previously updated before removing and re-adding the IB device, which could cause a temporary mismatch between the E-switch mode and the IB device configuration. To prevent this discrepancy, the IB device is now removed first, then the E-switch mode is updated, and finally, the IB device is re-added. This sequence ensures consistent alignment between the E-switch mode and the IB device whenever the mode changes, regardless of the new mode value. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:28:17 -08:00
Vladimir Vdovin	7d3f3b4367	net: ipv4: Cache pmtu for all packet paths if multipath enabled Check number of paths by fib_info_num_path(), and update_or_create_fnhe() for every path. Problem is that pmtu is cached only for the oif that has received icmp message "need to frag", other oifs will still try to use "default" iface mtu. An example topology showing the problem: \| host1 +---------+ \| dummy0 \| 10.179.20.18/32 mtu9000 +---------+ +-----------+----------------+ +---------+ +---------+ \| ens17f0 \| 10.179.2.141/31 \| ens17f1 \| 10.179.2.13/31 +---------+ +---------+ \| (all here have mtu 9000) \| +------+ +------+ \| ro1 \| 10.179.2.140/31 \| ro2 \| 10.179.2.12/31 +------+ +------+ \| \| ---------+------------+-------------------+------ \| +-----+ \| ro3 \| 10.10.10.10 mtu1500 +-----+ \| ======================================== some networks ======================================== \| +-----+ \| eth0\| 10.10.30.30 mtu9000 +-----+ \| host2 host1 have enabled multipath and sysctl net.ipv4.fib_multipath_hash_policy = 1: default proto static src 10.179.20.18 nexthop via 10.179.2.12 dev ens17f1 weight 1 nexthop via 10.179.2.140 dev ens17f0 weight 1 When host1 tries to do pmtud from 10.179.20.18/32 to host2, host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500. And host1 caches it in nexthop exceptions cache. Problem is that it is cached only for the iface that has received icmp, and there is no way that ro3 will send icmp msg to host1 via another path. Host1 now have this routes to host2: ip r g 10.10.30.30 sport 30000 dport 443 10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0 cache expires 521sec mtu 1500 ip r g 10.10.30.30 sport 30033 dport 443 10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0 cache So when host1 tries again to reach host2 with mtu>1500, if packet flow is lucky enough to be hashed with oif=ens17f1 its ok, if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1, until lucky day when ro3 will send it through another flow to ens17f0. Signed-off-by: Vladimir Vdovin <deliran@verdict.gg> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20241108093427.317942-1-deliran@verdict.gg Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:07:36 -08:00
Breno Leitao	43271bb5bf	net: netconsole: selftests: Check if netdevsim is available The netconsole selftest relies on the availability of the netdevsim module. To ensure the test can run correctly, we need to check if the netdevsim module is either loaded or built-in before proceeding. Update the netconsole selftest to check for the existence of the /sys/bus/netdevsim/new_device file before running the test. If the file is not found, the test is skipped with an explanation that the CONFIG_NETDEVSIM kernel config option may not be enabled. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241108-netcon_selftest_deps-v1-1-1789cbf3adcd@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:05:48 -08:00
Jakub Kicinski	874ed898a2	Merge branch 'net-phylink-phylink_resolve-cleanups' Russell King says: ==================== net: phylink: phylink_resolve() cleanups This series does a bit of clean-up in phylink_resolve() to make the code a little easier to follow. Patch 1 moves the manual flow control setting in two of the switch cases to after the switch(). Patch 2 changes the MLO_AN_FIXED case to be a simple if() statement, reducing its indentation. Patch 3 changes the MLO_AN_PHY case to also be a simple if() statment, also reducing its indentation. Patch 4 does the same for the last case. Patch 5 reformats the code and comments for the reduced indentation, making it easier to read. ==================== Link: https://patch.msgid.link/Zy411lVWe2SikuOs@shell.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:05:16 -08:00
Russell King (Oracle)	bc08ce37d9	net: phylink: clean up phylink_resolve() Now that we have reduced the indentation level, clean up the code formatting. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1t9RQz-002Ff5-EA@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:05:14 -08:00
Russell King (Oracle)	d1a16dbbd8	net: phylink: remove switch() statement in resolve handling The switch() statement doesn't sit very well with the preceeding if() statements, so let's just convert everything to if()s. As a result of the two preceding commits, there is now only one case in the switch() statement. Remove the switch statement and reduce the code indentation. Code reformatting will be in the following commit. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1t9RQu-002Fez-AA@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:05:14 -08:00
Russell King (Oracle)	f0f46c2a3d	net: phylink: move MLO_AN_PHY resolve handling to if() statement The switch() statement doesn't sit very well with the preceeding if() statements, and results in excessive indentation that spoils code readability. Continue cleaning this up by converting the MLO_AN_PHY case to use an if() statmeent. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1t9RQp-002Fet-5W@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:05:14 -08:00
Russell King (Oracle)	92abfcb4ce	net: phylink: move MLO_AN_FIXED resolve handling to if() statement The switch() statement doesn't sit very well with the preceeding if() statements, and results in excessive indentation that spoils code readability. Begin cleaning this up by converting the MLO_AN_FIXED case to an if() statement. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1t9RQk-002Fen-1A@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:05:13 -08:00
Russell King (Oracle)	8cc5f4cb94	net: phylink: move manual flow control setting Move the handling of manual flow control configuration to a common location during resolve. We currently evaluate this for all but fixed links. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1t9RQe-002Feh-T1@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 19:05:13 -08:00
Jakub Kicinski	80b6f09475	Merge branch 'suspend-irqs-during-application-busy-periods' Joe Damato says: ==================== Suspend IRQs during application busy periods This series introduces a new mechanism, IRQ suspension, which allows network applications using epoll to mask IRQs during periods of high traffic while also reducing tail latency (compared to existing mechanisms, see below) during periods of low traffic. In doing so, this balances CPU consumption with network processing efficiency. Martin Karsten (CC'd) and I have been collaborating on this series for several months and have appreciated the feedback from the community on our RFC [1]. We've updated the cover letter and kernel documentation in an attempt to more clearly explain how this mechanism works, how applications can use it, and how it compares to existing mechanisms in the kernel. I briefly mentioned this idea at netdev conf 2024 (for those who were there) and Martin described this idea in an earlier paper presented at Sigmetrics 2024 [2]. ~ The short explanation (TL;DR) We propose adding a new napi config parameter: irq_suspend_timeout to help balance CPU usage and network processing efficiency when using IRQ deferral and napi busy poll. If this parameter is set to a non-zero value and a user application has enabled preferred busy poll on a busy poll context (via the EPIOCSPARAMS ioctl introduced in commit `18e2bf0edf` ("eventpoll: Add epoll ioctl for epoll_params")), then application calls to epoll_wait for that context will cause device IRQs and softirq processing to be suspended as long as epoll_wait successfully retrieves data from the NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred. If/when network traffic subsides and epoll_wait returns no data, IRQ suspension is immediately reverted back to the existing napi_defer_hard_irqs and gro_flush_timeout mechanism which was introduced in commit `6f8b12d661` ("net: napi: add hard irqs deferral feature")). The irq_suspend_timeout serves as a safety mechanism. If userland takes a long time processing data, irq_suspend_timeout will fire and restart normal NAPI processing. For a more in depth explanation, please continue reading. ~ Comparison with existing mechanisms Interrupt mitigation can be accomplished in napi software, by setting napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing in the NIC. This can be quite efficient, but in both cases, a fixed timeout (or packet count) needs to be configured. However, a fixed timeout cannot effectively support both low- and high-load situations: At low load, an application typically processes a few requests and then waits to receive more input data. In this scenario, a large timeout will cause unnecessary latency. At high load, an application typically processes many requests before being ready to receive more input data. In this case, a small timeout will likely fire prematurely and trigger irq/softirq processing, which interferes with the application's execution. This causes overhead, most likely due to cache contention. While NICs attempt to provide adaptive interrupt coalescing schemes, these cannot properly take into account application-level processing. An alternative packet delivery mechanism is busy-polling, which results in perfect alignment of application processing and network polling. It delivers optimal performance (throughput and latency), but results in 100% cpu utilization and is thus inefficient for below-capacity workloads. We propose to add a new packet delivery mode that properly alternates between busy polling and interrupt-based delivery depending on busy and idle periods of the application. During a busy period, the system operates in busy-polling mode, which avoids interference. During an idle period, the system falls back to interrupt deferral, but with a small timeout to avoid excessive latencies. This delivery mode can also be viewed as an extension of basic interrupt deferral, but alternating between a small and a very large timeout. This delivery mode is efficient, because it avoids softirq execution interfering with application processing during busy periods. It can be used with blocking epoll_wait to conserve cpu cycles during idle periods. The effect of alternating between busy and idle periods is that performance (throughput and latency) is very close to full busy polling, while cpu utilization is lower and very close to interrupt mitigation. ~ Usage details IRQ suspension is introduced via a per-NAPI configuration parameter that controls the maximum time that IRQs can be suspended. Here's how it is intended to work: - The user application (or system administrator) uses the netdev-genl netlink interface to set the pre-existing napi_defer_hard_irqs and gro_flush_timeout NAPI config parameters to enable IRQ deferral. - The user application (or system administrator) sets the proposed irq_suspend_timeout parameter via the netdev-genl netlink interface to a larger value than gro_flush_timeout to enable IRQ suspension. - The user application issues the existing epoll ioctl to set the prefer_busy_poll flag on the epoll context. - The user application then calls epoll_wait to busy poll for network events, as it normally would. - If epoll_wait returns events to userland, IRQs are suspended for the duration of irq_suspend_timeout. - If epoll_wait finds no events and the thread is about to go to sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout is resumed. As long as epoll_wait is retrieving events, IRQs (and softirq processing) for the NAPI being polled remain disabled. When network traffic reduces, eventually a busy poll loop in the kernel will retrieve no data. When this occurs, regular IRQ deferral using gro_flush_timeout for the polled NAPI is re-enabled. Unless IRQ suspension is continued by subsequent calls to epoll_wait, it automatically times out after the irq_suspend_timeout timer expires. Regular deferral is also immediately re-enabled when the epoll context is destroyed. ~ Usage scenario The target scenario for IRQ suspension as packet delivery mode is a system that runs a dominant application with substantial network I/O. The target application can be configured to receive input data up to a certain batch size (via epoll_wait maxevents parameter) and this batch size determines the worst-case latency that application requests might experience. Because packet delivery is suspended during the target application's processing, the batch size also determines the worst-case latency of concurrent applications using the same RX queue(s). gro_flush_timeout should be set as small as possible, but large enough to make sure that a single request is likely not being interfered with. irq_suspend_timeout is largely a safety mechanism against misbehaving applications. It should be set large enough to cover the processing of an entire application batch, i.e., the factor between gro_flush_timeout and irq_suspend_timeout should roughly correspond to the maximum batch size that the target application would process in one go. ~ Important call out in the implementation - Enabling per epoll-context preferred busy poll will now effectively lead to a nonblocking iteration through napi_busy_loop, even when busy_poll_usecs is 0. See patch 4. ~ Benchmark configs & descriptions The changes were benchmarked with memcached [3] using the benchmarking tool mutilate [4]. To facilitate benchmarking, a small patch [5] was applied to memcached 1.6.29 to allow setting per-epoll context preferred busy poll and other settings via environment variables. Another small patch [6] was applied to libevent to enable full busy-polling. Multiple scenarios were benchmarked as described below and the scripts used for producing these results can be found on github [7] (note: all scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing -N to memcached): - base: - no other options enabled - deferX: - set defer_hard_irqs to 100 - set gro_flush_timeout to X,000 - napibusy: - set defer_hard_irqs to 100 - set gro_flush_timeout to 200,000 - enable busy poll via the existing ioctl (busy_poll_usecs = 64, busy_poll_budget = 64, prefer_busy_poll = true) - fullbusy: - set defer_hard_irqs to 100 - set gro_flush_timeout to 5,000,000 - enable busy poll via the existing ioctl (busy_poll_usecs = 1000, busy_poll_budget = 64, prefer_busy_poll = true) - change memcached's nonblocking epoll_wait invocation (via libevent) to using a 1 ms timeout - suspend0: - set defer_hard_irqs to 0 - set gro_flush_timeout to 0 - set irq_suspend_timeout to 20,000,000 - enable busy poll via the existing ioctl (busy_poll_usecs = 0, busy_poll_budget = 64, prefer_busy_poll = true) - suspendX: - set defer_hard_irqs to 100 - set gro_flush_timeout to X,000 - set irq_suspend_timeout to 20,000,000 - enable busy poll via the existing ioctl (busy_poll_usecs = 0, busy_poll_budget = 64, prefer_busy_poll = true) ~ Benchmark results Tested on: Single socket AMD EPYC 7662 64-Core Processor Hyperthreading disabled 4 NUMA Zones (NPS=4) 16 CPUs per NUMA zone (64 cores total) 2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC The test machine is configured such that a single interface has 8 RX queues. The queues' IRQs and memcached are pinned to CPUs that are NUMA-local to the interface which is under test. The NIC's interrupt coalescing configuration is left at boot-time defaults. Results: Results are shown below. The mechanism added by this series is represented by the 'suspend' cases. Data presented shows a summary over nearly 10 runs of each test case [8] using the scripts on github [7]. For latency, the median is shown. For throughput and CPU utilization, the average is shown. The results also include cycles-per-query (cpq) and instruction-per-query (ipq) metrics, following the methodology proposed in [2], to augment the CPU utilization numbers, which could be skewed due to frequency scaling. We find that this does not appear to be the case as CPU utilization and low-level metrics show similar trends. These results were captured using the scripts on github [7] to illustrate how this approach compares with other pre-existing mechanisms. This data is not to be interpreted as scientific data captured in a fully isolated lab setting, but instead as best effort, illustrative information comparing and contrasting tradeoffs. The absolute QPS results shift between submissions, but the relative differences are equivalent. As patches are rebased, several factors likely influence overall performance. Compare: - Throughput (MAX) and latencies of base vs suspend. - CPU usage of napibusy and fullbusy during lower load (200K, 400K for example) vs suspend. - Latency of the defer variants vs suspend as timeout and load increases. - suspend0, which sets defer_hard_irqs and gro_flush_timeout to 0, has nearly the same performance as the base case (this is FAQ item #1). The overall takeaway is that the suspend variants provide a superior combination of high throughput, low latency, and low cpu utilization compared to all other variants. Each of the suspend variants works very well, but some fine-tuning between latency and cpu utilization is still possible by tuning the small timeout (gro_flush_timeout). Note: we've reorganized the results to make comparison among testcases with the same load easier. testcase load qps avglat 95%lat 99%lat cpu cpq ipq base 200K 199946 112 239 416 26 12973 11343 defer10 200K 199971 54 124 142 29 19412 17460 defer20 200K 199986 60 130 153 26 15644 14095 defer50 200K 200025 79 144 182 23 12122 11632 defer200 200K 199999 164 254 309 19 8923 9635 fullbusy 200K 199998 46 118 133 100 43658 23133 napibusy 200K 199983 100 237 277 56 24840 24716 suspend0 200K 200020 105 249 432 30 14264 11796 suspend10 200K 199950 53 123 141 32 19518 16903 suspend20 200K 200037 58 126 151 30 16426 14736 suspend50 200K 199961 73 136 177 26 13310 12633 suspend200 200K 199998 149 251 306 21 9566 10203 testcase load qps avglat 95%lat 99%lat cpu cpq ipq base 400K 400014 139 269 707 41 9476 9343 defer10 400K 400016 59 133 166 53 13991 12989 defer20 400K 399952 67 140 172 47 12063 11644 defer50 400K 400007 87 162 198 39 9384 9880 defer200 400K 399979 181 274 330 31 7089 8430 fullbusy 400K 399987 50 123 156 100 21827 16037 napibusy 400K 400014 76 222 272 83 18185 16529 suspend0 400K 400015 127 350 776 47 10699 9603 suspend10 400K 400023 57 129 164 54 13758 13178 suspend20 400K 400043 62 135 169 49 12071 11826 suspend50 400K 400071 76 149 186 42 10011 10301 suspend200 400K 399961 154 269 327 34 7827 8774 testcase load qps avglat 95%lat 99%lat cpu cpq ipq base 600K 599951 149 266 574 61 9265 8876 defer10 600K 600006 71 147 203 76 11866 10936 defer20 600K 600123 76 152 203 66 10430 10342 defer50 600K 600162 95 172 217 54 8526 9142 defer200 600K 599942 200 301 357 46 6977 8212 fullbusy 600K 599990 55 127 177 100 14551 13983 napibusy 600K 600035 63 160 250 96 13937 14140 suspend0 600K 599903 127 320 732 68 10166 8963 suspend10 600K 599908 63 137 192 69 10902 11100 suspend20 600K 599961 66 141 194 65 9976 10370 suspend50 600K 599973 80 159 204 57 8678 9381 suspend200 600K 600010 157 277 346 48 7133 8381 testcase load qps avglat 95%lat 99%lat cpu cpq ipq base 800K 800039 181 300 536 87 9585 8304 defer10 800K 800038 181 530 939 96 10564 8970 defer20 800K 800029 112 225 329 90 10056 8935 defer50 800K 799999 120 208 296 82 9234 8562 defer200 800K 800066 227 338 401 63 7117 8129 fullbusy 800K 800040 61 134 190 100 10913 12608 napibusy 800K 799944 64 141 214 99 10828 12588 suspend0 800K 799911 126 248 509 85 9346 8498 suspend10 800K 800006 69 143 200 83 9410 9845 suspend20 800K 800120 74 150 207 78 8786 9454 suspend50 800K 799989 87 168 224 71 7946 8833 suspend200 800K 799987 160 292 357 62 6923 8229 testcase load qps avglat 95%lat 99%lat cpu cpq ipq base 1000K 906879 4079 5751 6216 98 9496 7904 defer10 1000K 860849 3643 6274 6730 99 10040 8676 defer20 1000K 896063 3298 5840 6349 98 9620 8237 defer50 1000K 919782 2962 5513 5807 97 9284 7951 defer200 1000K 970941 3059 5348 5984 95 8593 7959 fullbusy 1000K 999950 70 150 207 100 8732 10777 napibusy 1000K 999996 78 154 223 100 8722 10656 suspend0 1000K 949706 2666 5770 6660 99 9071 8046 suspend10 1000K 1000024 80 160 220 92 8137 9035 suspend20 1000K 1000059 83 165 226 89 7850 8804 suspend50 1000K 999955 95 180 240 84 7411 8459 suspend200 1000K 999914 163 299 366 77 6833 8078 testcase load qps avglat 95%lat 99%lat cpu cpq ipq base MAX `1037654` 4184 5453 5810 100 8411 7938 defer10 MAX 905607 4840 6151 6380 100 9639 8431 defer20 MAX 986463 4455 5594 5796 100 8848 8110 defer50 MAX 1077030 4000 5073 5299 100 8104 7920 defer200 MAX 1040728 4152 5385 5765 100 8379 7849 fullbusy MAX 1247536 3518 3935 3984 100 6998 7930 napibusy MAX 1136310 3799 7756 9964 100 7670 7877 suspend0 MAX 1057509 4132 5724 6185 100 8253 7918 suspend10 MAX `1215147` 3580 3957 4041 100 7185 7944 suspend20 MAX 1216469 3576 3953 3988 100 7175 7950 suspend50 MAX 1215871 3577 3961 4075 100 7181 7949 suspend200 MAX 1216882 3556 3951 3988 100 7175 7955 ~ FAQ - Why is a new parameter needed? Does irq_suspend_timeout override gro_flush_timeout? Using the suspend mechanism causes the system to alternate between polling mode and irq-driven packet delivery. During busy periods, irq_suspend_timeout overrides gro_flush_timeout and keeps the system busy polling, but when epoll finds no events, the setting of gro_flush_timeout and napi_defer_hard_irqs determine the next step. There are essentially three possible loops for network processing and packet delivery: 1) hardirq -> softirq -> napi poll; basic interrupt delivery 2) timer -> softirq -> napi poll; deferred irq processing 3) epoll -> busy-poll -> napi poll; busy looping Loop 2 can take control from Loop 1, if gro_flush_timeout and napi_defer_hard_irqs are set. If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2 and 3 "wrestle" with each other for control. During busy periods, irq_suspend_timeout is used as timer in Loop 2, which essentially tilts this in favour of Loop 3. If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3 cannot take control from Loop 1. Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the recommended usage, because otherwise setting irq_suspend_timeout might not have any discernible effect. This is shown in the results above: compare suspend0 with the base case. Note that the lack of napi_defer_hard_irqs and gro_flush_timeout produce similar results for both, which encourages the use of napi_defer_hard_irqs and gro_flush_timeout in addition to irq_suspend_timeout. - Can the new timeout value be threaded through the new epoll ioctl ? It is possible, but presents challenges for userspace. User applications must ensure that the file descriptors added to epoll contexts have the same NAPI ID to support busy polling. An epoll context is not permanently tied to any particular NAPI ID. So, a user application could decide to clear the file descriptors from the context and add a new set of file descriptors with a different NAPI ID to the context. Busy polling would work as expected, but the meaning of the suspend timeout becomes ambiguous because IRQs are not inherently associated with epoll contexts, but rather with the NAPI. The user program would need to reissue the ioctl to set the irq_suspend_timeout, but the napi_defer_hard_irqs and gro_flush_timeout settings would come from the NAPI's napi_config (which are set either by sysfs or by netlink). Such an interface seems awkard to use from a user perspective. Further, IRQs are related to NAPIs, which is why they are stored in the napi_config space. Putting the irq_suspend_timeout in the epoll context while other IRQ deferral mechanisms remain in the NAPI's napi_config space seems like an odd design choice. We've opted to keep all of the IRQ deferral parameters together and place the irq_suspend_timeout in napi_config. This has nice benefits for userspace: if a user app were to remove all file descriptors from an epoll context and add new file descriptors with a new NAPI ID, the correct suspend timeout for that NAPI ID would be used automatically without the user application needing to do anything (like re-issuing an ioctl, for example). All IRQ deferral related parameters are in one place and can all be set the same way: with netlink. - Can irq suspend be built by combining NIC coalescing and gro_flush_timeout ? No. The problem is that the long timeout must engage if and only if prefer-busy is active. When using NIC coalescing for the short timeout (without napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle period will trigger softirq, which will run napi polling. At this point, prefer-busy is not active, so NIC interrupts would be re-enabled. Then it is not possible for the longer timeout to interject to switch control back to polling. In other words, only by using the software timer for the short timeout, it is possible to extend the timeout without having to reprogram the NIC timer or reach down directly and disable interrupts. Using gro_flush_timeout for the long timeout also has problems, for the same underlying reason. In the current napi implementation, gro_flush_timeout is not tied to prefer-busy. We'd either have to change that and in the process modify the existing deferral mechanism, or introduce a state variable to determine whether gro_flush_timeout is used as long timeout for irq suspend or whether it is used for its default purpose. In an earlier version, we did try something similar to the latter and made it work, but it ends up being a lot more convoluted than our current proposal. - Isn't it already possible to combine busy looping with irq deferral? Yes, in fact enabling irq deferral via napi_defer_hard_irqs and gro_flush_timeout is a precondition for prefer_busy_poll to have an effect. If the application also uses a tight busy loop with essentially nonblocking epoll_wait (accomplished with a very short timeout parameter), this is the fullbusy case shown in the results. An application using blocking epoll_wait is shown as the napibusy case in the results. It's a hybrid approach that provides limited latency benefits compared to the base case and plain irq deferral, but not as good as fullbusy or suspend. ~ Special thanks Several people were involved in earlier stages of the development of this mechanism whom we'd like to thank: - Peter Cai (CC'd), for the initial kernel patch and his contributions to the paper. - Mohammadamin Shafie (CC'd), for testing various versions of the kernel patch and providing helpful feedback. Thanks, Martin and Joe [1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/ [2]: https://doi.org/10.1145/3626780 [3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt [4]: https://github.com/leverich/mutilate [5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/memcached.patch [6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/libevent.patch [7]: https://github.com/martinkarsten/irqsuspend [8]: https://github.com/martinkarsten/irqsuspend/tree/main/results v8: https://lore.kernel.org/20241108045337.292905-1-jdamato@fastly.com v7: https://lore.kernel.org/20241108023912.98416-1-jdamato@fastly.com v6: https://lore.kernel.org/20241104215542.215919-1-jdamato@fastly.com v5: https://lore.kernel.org/20241103052421.518856-1-jdamato@fastly.com v4: https://lore.kernel.org/20241102005214.32443-1-jdamato@fastly.com v3: https://lore.kernel.org/20241101004846.32532-1-jdamato@fastly.com v2: https://lore.kernel.org/20241021015311.95468-1-jdamato@fastly.com ==================== Link: https://patch.msgid.link/20241109050245.191288-1-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 18:45:09 -08:00
Joe Damato	a90a91e24b	docs: networking: Describe irq suspension Describe irq suspension, the epoll ioctls, and the tradeoffs of using different gro_flush_timeout values. Signed-off-by: Joe Damato <jdamato@fastly.com> Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://patch.msgid.link/20241109050245.191288-7-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 18:45:06 -08:00
Joe Damato	347fcdc414	selftests: net: Add busy_poll_test Add an epoll busy poll test using netdevsim. This test is comprised of: - busy_poller (via busy_poller.c) - busy_poll_test.sh which loads netdevsim, sets up network namespaces, and runs busy_poller to receive data and socat to send data. The selftest tests two different scenarios: - busy poll (the pre-existing version in the kernel) - busy poll with suspend enabled (what this series adds) The data transmit is a 1MiB temporary file generated from /dev/urandom and the test is considered passing if the md5sum of the input file to socat matches the md5sum of the output file from busy_poller. netdevsim was chosen instead of veth due to netdevsim's support for netdev-genl. For now, this test uses the functionality that netdevsim provides. In the future, perhaps netdevsim can be extended to emulate device IRQs to more thoroughly test all pre-existing kernel options (like defer_hard_irqs) and suspend. Signed-off-by: Joe Damato <jdamato@fastly.com> Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20241109050245.191288-6-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 18:45:06 -08:00
Martin Karsten	8a6de2627f	eventpoll: Control irq suspension for prefer_busy_poll When events are reported to userland and prefer_busy_poll is set, irqs are temporarily suspended using napi_suspend_irqs. If no events are found and ep_poll would go to sleep, irq suspension is cancelled using napi_resume_irqs. Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Tested-by: Joe Damato <jdamato@fastly.com> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Link: https://patch.msgid.link/20241109050245.191288-5-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 18:45:06 -08:00
Martin Karsten	ab5b28b007	eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set Setting prefer_busy_poll now leads to an effectively nonblocking iteration though napi_busy_loop, even when busy_poll_usecs is 0. Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Tested-by: Joe Damato <jdamato@fastly.com> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Link: https://patch.msgid.link/20241109050245.191288-4-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 18:45:06 -08:00
Martin Karsten	3fcbecbdeb	net: Add control functions for irq suspension The napi_suspend_irqs routine bootstraps irq suspension by elongating the defer timeout to irq_suspend_timeout. The napi_resume_irqs routine effectively cancels irq suspension by forcing the napi to be scheduled immediately. Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Tested-by: Joe Damato <jdamato@fastly.com> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Link: https://patch.msgid.link/20241109050245.191288-3-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 18:45:06 -08:00
Martin Karsten	5dc51ec86d	net: Add napi_struct parameter irq_suspend_timeout Add a per-NAPI IRQ suspension parameter, which can be get/set with netdev-genl. This patch doesn't change any behavior but prepares the code for other changes in the following commits which use irq_suspend_timeout as a timeout for IRQ suspension. Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Tested-by: Joe Damato <jdamato@fastly.com> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Link: https://patch.msgid.link/20241109050245.191288-2-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 18:45:05 -08:00
Vadim Fedorenko	f0fe51a043	bnxt_en: add unlocked version of bnxt_refclk_read Serialization of PHC read with FW reset mechanism uses ptp_lock which also protects timecounter updates. This means we cannot grab it when called from bnxt_cc_read(). Let's move locking into different function. Fixes: `6c0828d00f` ("bnxt_en: replace PTP spinlock with seqlock") Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://patch.msgid.link/20241107214917.2980976-1-vadfed@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:32:29 -08:00
Jakub Kicinski	5ffb7ac02d	Merge branch 'rtnetlink-convert-rtnl_newlink-to-per-netns-rtnl' Kuniyuki Iwashima says: ==================== rtnetlink: Convert rtnl_newlink() to per-netns RTNL. Patch 1 - 3 removes __rtnl_link_unregister and protect link_ops by its dedicated mutex to move synchronize_srcu() out of RTNL scope. Patch 4 introduces struct rtnl_nets and helper functions to acquire multiple per-netns RTNL in rtnl_newlink(). Patch 5 - 8 are to prefetch the peer device's netns in rtnl_newlink(). Patch 9 converts rtnl_newlink() to per-netns RTNL. Patch 10 pushes RTNL down to rtnl_dellink() and rtnl_setlink(), but the conversion will not be completed unless we support cases with peer/upper/lower devices. I confirmed v3 survived ./rtnetlink.sh; rmmod netdevsim.ko; without lockdep splat. v3: https://lore.kernel.org/20241107022900.70287-1-kuniyu@amazon.com v2: https://lore.kernel.org/20241106022432.13065-1-kuniyu@amazon.com v1: https://lore.kernel.org/20241105020514.41963-1-kuniyu@amazon.com ==================== Link: https://patch.msgid.link/20241108004823.29419-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:26:54 -08:00
Kuniyuki Iwashima	636af13f21	rtnetlink: Register rtnl_dellink() and rtnl_setlink() with RTNL_FLAG_DOIT_PERNET_WIP. Currently, rtnl_setlink() and rtnl_dellink() cannot be fully converted to per-netns RTNL due to a lack of handling peer/lower/upper devices in different netns. For example, when we change a device in rtnl_setlink() and need to propagate that to its upper devices, we want to avoid acquiring all netns locks, for which we do not know the upper limit. The same situation happens when we remove a device. rtnl_dellink() could be transformed to remove a single device in the requested netns and delegate other devices to per-netns work, and rtnl_setlink() might be ? Until we come up with a better idea, let's use a new flag RTNL_FLAG_DOIT_PERNET_WIP for rtnl_dellink() and rtnl_setlink(). This will unblock converting RTNL users where such devices are not related. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241108004823.29419-11-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:26:52 -08:00
Kuniyuki Iwashima	d91191ffe2	rtnetlink: Convert RTM_NEWLINK to per-netns RTNL. Now, we are ready to convert rtnl_newlink() to per-netns RTNL; rtnl_link_ops is protected by SRCU and netns is prefetched in rtnl_newlink(). Let's register rtnl_newlink() with RTNL_FLAG_DOIT_PERNET and push RTNL down as rtnl_nets_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241108004823.29419-10-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:26:52 -08:00
Kuniyuki Iwashima	fefd5d0821	netkit: Set IFLA_NETKIT_PEER_INFO to netkit_link_ops.peer_type. For per-netns RTNL, we need to prefetch the peer device's netns. Let's set rtnl_link_ops.peer_type and accordingly remove duplicated validation in ->newlink(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241108004823.29419-9-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:26:52 -08:00
Kuniyuki Iwashima	6b84e558e9	vxcan: Set VXCAN_INFO_PEER to vxcan_link_ops.peer_type. For per-netns RTNL, we need to prefetch the peer device's netns. Let's set rtnl_link_ops.peer_type and accordingly remove duplicated validation in ->newlink(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241108004823.29419-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:26:52 -08:00
Kuniyuki Iwashima	0eb87b02a7	veth: Set VETH_INFO_PEER to veth_link_ops.peer_type. For per-netns RTNL, we need to prefetch the peer device's netns. Let's set rtnl_link_ops.peer_type and accordingly remove duplicated validation in ->newlink(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241108004823.29419-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:26:52 -08:00
Kuniyuki Iwashima	28690e5361	rtnetlink: Add peer_type in struct rtnl_link_ops. In ops->newlink(), veth, vxcan, and netkit call rtnl_link_get_net() with a net pointer, which is the first argument of ->newlink(). rtnl_link_get_net() could return another netns based on IFLA_NET_NS_PID and IFLA_NET_NS_FD in the peer device's attributes. We want to get it and fill rtnl_nets->nets[] in advance in rtnl_newlink() for per-netns RTNL. All of the three get the peer netns in the same way: 1. Call rtnl_nla_parse_ifinfomsg() 2. Call ops->validate() (vxcan doesn't have) 3. Call rtnl_link_get_net_tb() Let's add a new field peer_type to struct rtnl_link_ops and prefetch netns in the peer ifla to add it to rtnl_nets in rtnl_newlink(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241108004823.29419-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:26:52 -08:00
Kuniyuki Iwashima	cbaaa6326b	rtnetlink: Introduce struct rtnl_nets and helpers. rtnl_newlink() needs to hold 3 per-netns RTNL: 2 for a new device and 1 for its peer. We will add rtnl_nets_lock() later, which performs the nested locking based on struct rtnl_nets, which has an array of struct net pointers. rtnl_nets_add() adds a net pointer to the array and sorts it so that rtnl_nets_lock() can simply acquire per-netns RTNL from array[0] to [2]. Before calling rtnl_nets_add(), get_net() must be called for the net, and rtnl_nets_destroy() will call put_net() for each. Let's apply the helpers to rtnl_newlink(). When CONFIG_DEBUG_NET_SMALL_RTNL is disabled, we do not call rtnl_net_lock() thus do not care about the array order, so rtnl_net_cmp_locks() returns -1 so that the loop in rtnl_nets_add() can be optimised to NOP. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241108004823.29419-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:26:51 -08:00
Kuniyuki Iwashima	68297dbb96	rtnetlink: Remove __rtnl_link_register() link_ops is protected by link_ops_mutex and no longer needs RTNL, so we have no reason to have __rtnl_link_register() separately. Let's remove it and call rtnl_link_register() from ifb.ko and dummy.ko. Note that both modules' init() work on init_net only, so we need not export pernet_ops_rwsem and can use rtnl_net_lock() there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241108004823.29419-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:26:51 -08:00
Kuniyuki Iwashima	6b57ff21a3	rtnetlink: Protect link_ops by mutex. rtnl_link_unregister() holds RTNL and calls synchronize_srcu(), but rtnl_newlink() will acquire SRCU frist and then RTNL. Then, we need to unlink ops and call synchronize_srcu() outside of RTNL to avoid the deadlock. rtnl_link_unregister() rtnl_newlink() ---- ---- lock(rtnl_mutex); lock(&ops->srcu); lock(rtnl_mutex); sync(&ops->srcu); Let's move as such and add a mutex to protect link_ops. Now, link_ops is protected by its dedicated mutex and rtnl_link_register() no longer needs to hold RTNL. While at it, we move the initialisation of ops->dellink and ops->srcu out of the mutex scope. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241108004823.29419-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:26:51 -08:00
Kuniyuki Iwashima	d5ec8d91f8	rtnetlink: Remove __rtnl_link_unregister(). rtnl_link_unregister() holds RTNL and calls __rtnl_link_unregister(), where we call synchronize_srcu() to wait inflight RTM_NEWLINK requests for per-netns RTNL. We put synchronize_srcu() in __rtnl_link_unregister() due to ifb.ko and dummy.ko. However, rtnl_newlink() will acquire SRCU before RTNL later in this series. Then, lockdep will detect the deadlock: rtnl_link_unregister() rtnl_newlink() ---- ---- lock(rtnl_mutex); lock(&ops->srcu); lock(rtnl_mutex); sync(&ops->srcu); To avoid the problem, we must call synchronize_srcu() before RTNL in rtnl_link_unregister(). As a preparation, let's remove __rtnl_link_unregister(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241108004823.29419-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:26:51 -08:00
Heiner Kallweit	7a3bcd39ae	r8169: use helper r8169_mod_reg8_cond to simplify rtl_jumbo_config Use recently added helper r8169_mod_reg8_cond() to simplify jumbo mode configuration. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/3df1d484-a02e-46e7-8f75-db5b428e422e@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:14:06 -08:00
Jakub Kicinski	7762876fdb	Merge branch 'selftests-ncdevmem-add-ncdevmem-to-ksft' Stanislav Fomichev says: ==================== selftests: ncdevmem: Add ncdevmem to ksft The goal of the series is to simplify and make it possible to use ncdevmem in an automated way from the ksft python wrapper. ncdevmem is slowly mutated into a state where it uses stdout to print the payload and the python wrapper is added to make sure the arrived payload matches the expected one. ==================== Link: https://patch.msgid.link/20241107181211.3934153-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:01:08 -08:00
Stanislav Fomichev	80230864b7	selftests: ncdevmem: Add automated test Only RX side for now and small message to test the setup. In the future, we can extend it to TX side and to testing both sides with a couple of megs of data. make \ -C tools/testing/selftests \ TARGETS="drivers/hw/net" \ install INSTALL_PATH=~/tmp/ksft scp ~/tmp/ksft ${HOST}: scp ~/tmp/ksft ${PEER}: cfg+="NETIF=${DEV}\n" cfg+="LOCAL_V6=${HOST_IP}\n" cfg+="REMOTE_V6=${PEER_IP}\n" cfg+="REMOTE_TYPE=ssh\n" cfg+="REMOTE_ARGS=root@${PEER}\n" echo -e "$cfg" \| ssh root@${HOST} "cat > ksft/drivers/net/net.config" ssh root@${HOST} "cd ksft && ./run_kselftest.sh -t drivers/net:devmem.py" Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20241107181211.3934153-13-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:01:05 -08:00
Stanislav Fomichev	be43a6b238	selftests: ncdevmem: Move ncdevmem under drivers/net/hw This is where all the tests that depend on the HW functionality live in and this is where the automated test is gonna be added in the next patch. Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20241107181211.3934153-12-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:01:04 -08:00
Stanislav Fomichev	77f870a000	selftests: ncdevmem: Run selftest when none of the -s or -c has been provided This will be used as a 'probe' mode in the selftest to check whether the device supports the devmem or not. Use hard-coded queue layout (two last queues) and prevent user from passing custom -q and/or -t. Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20241107181211.3934153-11-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:01:04 -08:00
Stanislav Fomichev	d4ef05d211	selftests: ncdevmem: Remove hard-coded queue numbers Use single last queue of the device and probe it dynamically. Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20241107181211.3934153-10-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:01:04 -08:00
Stanislav Fomichev	798d822e5d	selftests: ncdevmem: Use YNL to enable TCP header split In the next patch the hard-coded queue numbers are gonna be removed. So introduce some initial support for ethtool YNL and use it to enable header split. Also, tcp-data-split requires latest ethtool which is unlikely to be present in the distros right now. (ideally, we should not shell out to ethtool at all). Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20241107181211.3934153-9-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:01:04 -08:00
Stanislav Fomichev	e3c09623a5	selftests: ncdevmem: Properly reset flow steering ntuple off/on might be not enough to do it on all NICs. Add a bunch of shell crap to explicitly remove the rules. Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20241107181211.3934153-8-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:01:04 -08:00
Stanislav Fomichev	933056357a	selftests: ncdevmem: Switch to AF_INET6 Use dualstack socket to support both v4 and v6. v4-mapped-v6 address can be used to do v4. Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20241107181211.3934153-7-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 17:01:04 -08:00

1 2 3 4 5 ...

1312060 Commits