linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-03 17:41:22 +00:00

Author	SHA1	Message	Date
Lee Jones	257382c54e	ptp_pch: Remove unused function 'pch_ch_control_read()' Fixes the following W=1 kernel build warning(s): drivers/ptp/ptp_pch.c:182:5: warning: no previous prototype for ‘pch_ch_control_read’ [-Wmissing-prototypes] Cc: Richard Cochran <richardcochran@gmail.com> (maintainer:PTP HARDWARE CLOCK SUPPORT) Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Flavio Suligoi <f.suligoi@asem.it> Cc: netdev@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 17:09:34 -08:00
Rafał Miłecki	a9349f08ec	net: dsa: bcm_sf2: setup BCM4908 internal crossbar On some SoCs (e.g. BCM4908, BCM631[345]8) SF2 has an integrated crossbar. It allows connecting its selected external ports to internal ports. It's used by vendors to handle custom Ethernet setups. BCM4908 has following 3x2 crossbar. On Asus GT-AC5300 rgmii is used for connecting external BCM53134S switch. GPHY4 is usually used for WAN port. More fancy devices use SerDes for 2.5 Gbps Ethernet. ┌──────────┐ SerDes ─── 0 ─┤ │ │ 3x2 ├─ 0 ─── switch port 7 GPHY4 ─── 1 ─┤ │ │ crossbar ├─ 1 ─── runner (accelerator) rgmii ─── 2 ─┤ │ └──────────┘ Use setup data based on DT info to configure BCM4908's switch port 7. Right now only GPHY and rgmii variants are supported. Handling SerDes can be implemented later. Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 17:06:37 -08:00
Rafał Miłecki	01488a0ccd	net: dsa: bcm_sf2: store PHY interface/mode in port structure It's needed later for proper switch / crossbar setup. Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 17:06:37 -08:00
Shubhankar Kuranagatti	6ad086009f	net: ipv4: route.c: Fix indentation of multi line comment. All comment lines inside the comment block have been aligned. Every line of comment starts with a * (uniformity in code). Signed-off-by: Shubhankar Kuranagatti <shubhankarvk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 17:02:30 -08:00
Rafał Miłecki	12bb508bfe	net: broadcom: bcm4908_enet: support TX interrupt It appears that each DMA channel has its own interrupt and both rings can be configured (the same way) to handle interrupts. 1. Make ring interrupts code generic (make it operate on given ring) 2. Move napi to ring (so each has its own) 3. Make IRQ handler generic (match ring against received IRQ number) 4. Add (optional) support for TX interrupt Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 16:48:38 -08:00
Rafał Miłecki	ab4dda7a8c	dt-bindings: net: bcm4908-enet: add optional TX interrupt I discovered that hardware actually supports two interrupts, one per DMA channel (RX and TX). Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 16:48:38 -08:00
David S. Miller	26d2e0426a	Merge branch 'macb-fixed-link-fixes' Robert Hancock says: ==================== macb SGMII fixed-link fixes Some fixes to the macb driver for use in SGMII mode with a fixed-link (such as for chip-to-chip connectivity). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 16:44:46 -08:00
Robert Hancock	e276e5e40e	net: macb: Disable PCS auto-negotiation for SGMII fixed-link mode When using a fixed-link configuration in SGMII mode, it's not really sensible to have auto-negotiation enabled since the link settings are fixed by definition. In other configurations, such as an SGMII connection to a PHY, it should generally be enabled. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 16:44:45 -08:00
Robert Hancock	8fab174b78	net: macb: poll for fixed link state in SGMII mode When using a fixed-link configuration with GEM in SGMII mode, such as for a chip-to-chip interconnect, the link state was always showing as established regardless of the actual connectivity state. We can monitor the pcs_link_state bit in the Network Status register to determine whether the PCS link state is actually up. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 16:44:45 -08:00
David S. Miller	c232f81b0a	mlx5-updates-2021-03-12 1) TC support for ICMP parameters 2) TC connection tracking with mirroring 3) A round of trivial fixups and cleanups -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmBL+V4ACgkQSD+KveBX +j68tgf+Lzq6QkpGl4dpg87pjuKIwyykGVJ0NW/Yig+bNe9V3pbFCv785xymD16N ALcLKNtxUovE6p+Gy6JgzgBN7V9ZsM6bn564487ycYIqewz9c2KIQneJdGEUUDgi S8NkuCaLQst36Wl8gdiygApZmw8TAafOQYYzKOZ7HdZ+aRH+fIwbVABYreAYwjGa 2iz5yu1AEzG5/10bTEx69rQHY+309EPy4N4na/jo/tXRL3fWQSQz6hzEnoi3EjAU YsktBB72eBX/NogqLIHpSzzjPgm5IFRQbNSHyL71TH9KEr0KVpzJd9VCbaic6goL gQn0TqoE74X7r4ZPWP3AHcGOL9L5Rg== =L2PO -----END PGP SIGNATURE----- Merge tag 'mlx5-updates-2021-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2021-03-12 1) TC support for ICMP parameters 2) TC connection tracking with mirroring 3) A round of trivial fixups and cleanups ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 16:36:07 -08:00
Maor Dickman	a3222a2da0	net/mlx5e: Allow to match on ICMP parameters Support matching on ICMPv4/6 type and code parameters using misc3 section of match parameters. Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-12 15:29:34 -08:00
Paul Blakey	69e2916ebc	net/mlx5: CT: Add support for mirroring Add support for mirroring before the CT action by spliting the pre ct rule. Mirror outputs are done first on the tc chain,prio table rule (the fwd rule), which will then forward to a per port fwd table. On this fwd table, we insert the original pre ct rule that forwards to ct/ct nat table. Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-12 15:29:33 -08:00
Alaa Hleihel	287e0df021	net/mlx5: Display the command index in command mailbox dump Multiple commands can be printed at the same time which can lead to wrong order of their lines in dmesg output. As a result, it's hard to match data dumps to the correct command or which command was fully dumped at some point. Fix this by displaying the corresponding command index, and also indicate when a command was fully dumped. Signed-off-by: Alaa Hleihel <alaa@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-12 15:29:33 -08:00
Arnd Bergmann	2119bda642	net/mlx5e: allocate 'indirection_rqt' buffer dynamically Increasing the size of the indirection_rqt array from 128 to 256 bytes pushed the stack usage of the mlx5e_hairpin_fill_rqt_rqns() function over the warning limit when building with clang and CONFIG_KASAN: drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:970:1: error: stack frame size of 1180 bytes in function 'mlx5e_tc_add_nic_flow' [-Werror,-Wframe-larger-than=] Using dynamic allocation here is safe because the caller does the same, and it reduces the stack usage of the function to just a few bytes. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-12 15:29:32 -08:00
Tariq Toukan	e16cf9d754	net/mlx5e: Dump ICOSQ WQE descriptor on CQE with error events Dump the ICOSQ's WQE descriptor when a completion with error is received. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-12 15:29:32 -08:00
Maxim Mikityanskiy	991b265460	net/mlx5e: Use net_prefetchw instead of prefetchw in MPWQE TX datapath Commit `e20f0dbf20` ("net/mlx5e: RX, Add a prefetch command for small L1_CACHE_BYTES") switched to using net_prefetchw at all places in mlx5e. In the same time frame, commit `5af75c747e` ("net/mlx5e: Enhanced TX MPWQE for SKBs") added one more usage of prefetchw. When these two changes were merged, this new occurrence of prefetchw wasn't replaced with net_prefetchw. This commit fixes this last occurrence of prefetchw in mlx5e_tx_mpwqe_session_start, making the same change that was done in mlx5e_xdp_mpwqe_session_start. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-12 15:29:31 -08:00
Roi Dayan	bca08a9145	net/mlx5e: Remove redundant newline in NL_SET_ERR_MSG_MOD Fix the following coccicheck warnings: drivers/net/ethernet/mellanox/mlx5/core/devlink.c:145:29-66: WARNING avoid newline at end of message in NL_SET_ERR_MSG_MOD drivers/net/ethernet/mellanox/mlx5/core/devlink.c:140:29-77: WARNING avoid newline at end of message in NL_SET_ERR_MSG_MOD Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-12 15:29:31 -08:00
Mark Zhang	093bd76469	net/mlx5: Read congestion counters from all ports when lag is active Read congestion counters from all ports in any lag mode rather than only in RoCE lag mode (e.g., VF lag). Signed-off-by: Mark Zhang <markzhang@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-12 15:29:31 -08:00
Jiapeng Chong	7976092241	net/mlx5: remove unneeded semicolon Fix the following coccicheck warnings: ./drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c:495:2-3: Unneeded semicolon. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-12 15:29:30 -08:00
Junlin Yang	ad2c99ca75	net/mlx5: use kvfree() for memory allocated with kvzalloc() It is allocated with kvzalloc(), the corresponding release function should not be kfree(), use kvfree() instead. Generated by: scripts/coccinelle/api/kfree_mismatch.cocci Signed-off-by: Junlin Yang <yangjunlin@yulong.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-12 15:29:30 -08:00
Yevgeny Kliteynik	cc82a2e6c8	net/mlx5: DR, Add missing vhca_id consume from STEv1 The field source_eswitch_owner_vhca_id was not consumed in the same way as in STEv0. Added the missing set. Fixes: `10b6941864` ("net/mlx5: DR, Add HW STEv1 match logic") Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-12 15:29:30 -08:00
Yevgeny Kliteynik	1412477882	net/mlx5: DR, Remove unneeded rx_decap_l3 function for STEv1 Remove the dr_ste_v1_set_rx_decap_l3 function that was replaced by another function - fixing a rebase error. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-12 15:29:29 -08:00
Yevgeny Kliteynik	0142f09764	net/mlx5: DR, Fixed typo in STE v0 "reforamt" -> "reformat" Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-03-12 15:29:29 -08:00
Jonathan Neuschäfer	bfdfe7fc1b	docs: networking: phy: Improve placement of parenthesis "either" is outside the parentheses, so the matching "or" should be too. Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 12:29:11 -08:00
David S. Miller	5215206d8b	Merge branch 'tcp-delayed-completions' Eric Dumazet says: ==================== tcp: better deal with delayed TX completions Jakub and Neil reported an increase of RTO timers whenever TX completions are delayed a bit more (by increasing NIC TX coalescing parameters) While problems have been there forever, second patch might introduce some regressions so I prefer not backport them to stable releases before things settle. Many thanks to FB team for their help and tests. Few packetdrill tests need to be changed to reflect the improvements brought by this series. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 18:35:31 -08:00
Eric Dumazet	ac3959fd0d	tcp: remove obsolete check in __tcp_retransmit_skb() TSQ provides a nice way to avoid bufferbloat on individual socket, including retransmit packets. We can get rid of the old heuristic: /* Do not sent more than we queued. 1/4 is reserved for possible * copying overhead: fragmentation, tunneling, mangling etc. */ if (refcount_read(&sk->sk_wmem_alloc) > min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), sk->sk_sndbuf)) return -EAGAIN; This heuristic was giving false positives according to Jakub, whenever TX completions are delayed above RTT. (Ack packets are processed by TCP stack before clones are orphaned/freed) Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 18:35:31 -08:00
Eric Dumazet	a7abf3cd76	tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack() Jakub reported Data included in a Fastopen SYN that had to be retransmit would have to wait for an RTO if TX completions are slow, even with prior fix. This is because tcp_rcv_fastopen_synack() does not use standard rtx logic, meaning TSQ handler exits early in tcp_tsq_write() because tp->lost_out == tp->retrans_out Lets make tcp_rcv_fastopen_synack() use standard rtx logic, by using tcp_mark_skb_lost() on the skb thats needs to be sent again. Not this raised a warning in tcp_fastretrans_alert() during my tests since we consider the data not being aknowledged by the receiver does not mean packet was lost on the network. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 18:35:31 -08:00
Eric Dumazet	f4dae54e48	tcp: plug skb_still_in_host_queue() to TSQ Jakub and Neil reported an increase of RTO timers whenever TX completions are delayed a bit more (by increasing NIC TX coalescing parameters) Main issue is that TCP stack has a logic preventing a packet being retransmit if the prior clone has not yet been orphaned or freed. This logic came with commit `1f3279ae0c` ("tcp: avoid retransmits of TCP packets hanging in host queues") Thankfully, in the case skb_still_in_host_queue() detects the initial clone is still in flight, it can use TSQ logic that will eventually retry later, at the moment the clone is freed or orphaned. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Neil Spring <ntspring@fb.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 18:35:31 -08:00
Tong Zhang	8176f8c0f0	isdn: remove extra spaces in the header file fix some coding style issues in the isdn header Signed-off-by: Tong Zhang <ztong0001@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 18:23:55 -08:00
Hoang Huu Le	97bc84bbd4	tipc: clean up warnings detected by sparse This patch fixes the following warning from sparse: net/tipc/monitor.c:263:35: warning: incorrect type in assignment (different base types) net/tipc/monitor.c:263:35: expected unsigned int net/tipc/monitor.c:263:35: got restricted __be32 [usertype] [...] net/tipc/node.c:374:13: warning: context imbalance in 'tipc_node_read_lock' - wrong count at exit net/tipc/node.c:379:13: warning: context imbalance in 'tipc_node_read_unlock' - unexpected unlock net/tipc/node.c:384:13: warning: context imbalance in 'tipc_node_write_lock' - wrong count at exit net/tipc/node.c:389:13: warning: context imbalance in 'tipc_node_write_unlock_fast' - unexpected unlock net/tipc/node.c:404:17: warning: context imbalance in 'tipc_node_write_unlock' - unexpected unlock [...] net/tipc/crypto.c:1201:9: warning: incorrect type in initializer (different address spaces) net/tipc/crypto.c:1201:9: expected struct tipc_aead [noderef] __rcu __tmp net/tipc/crypto.c:1201:9: got struct tipc_aead [...] Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 18:06:54 -08:00
Hoang Le	1980d37565	tipc: convert dest node's address to network order (struct tipc_link_info)->dest is in network order (__be32), so we must convert the value to network order before assigning. The problem detected by sparse: net/tipc/netlink_compat.c:699:24: warning: incorrect type in assignment (different base types) net/tipc/netlink_compat.c:699:24: expected restricted __be32 [usertype] dest net/tipc/netlink_compat.c:699:24: got int Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 18:06:54 -08:00
David S. Miller	1520929e26	Merge branch 'mlxsw-Implement-sampling-using-mirroring' Ido Schimmel says: ==================== mlxsw: Implement sampling using mirroring So far, sampling was implemented using a dedicated sampling mechanism that is available on all Spectrum ASICs. Spectrum-2 and later ASICs support sampling by mirroring packets to the CPU port with probability. This method has a couple of advantages compared to the legacy method: * Extra metadata per-packet: Egress port, egress traffic class, traffic class occupancy and end-to-end latency * Ability to sample packets on egress / per-flow as opposed to only ingress This series should not result in any user-visible changes and its aim is to convert Spectrum-2 and later ASICs to perform sampling by mirroring to the CPU port with probability. Future submissions will expose the additional metadata and enable sampling using more triggers (e.g., egress). Series overview: Patches #1-#3 extend the SPAN (mirroring) module to accept new parameters required for sampling. See individual commit messages for detailed explanation. Patch #4-#5 split sampling support between Spectrum-1 and later ASIC while still using the legacy method for all ASIC generations. Patch #6 converts Spectrum-2 and later ASICs to perform sampling by mirroring to the CPU port with probability. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:22:39 -08:00
Ido Schimmel	cf31190ae0	mlxsw: spectrum_matchall: Implement sampling using mirroring Spectrum-2 and later ASICs support sampling of packets by mirroring to the CPU with probability. There are several advantages compared to the legacy dedicated sampling mechanism: * Extra metadata per-packet: Egress port, egress traffic class, traffic class occupancy and end-to-end latency * Ability to sample packets on egress / per-flow Convert Spectrum-2 and later ASICs to perform sampling by mirroring to the CPU with probability. Subsequent patches will add support for egress / per-flow sampling and expose the extra metadata. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:22:39 -08:00
Ido Schimmel	34a277212c	mlxsw: spectrum_trap: Split sampling traps between ASICs Sampling of ingress packets is supported using a dedicated sampling mechanism on all Spectrum ASICs. However, Spectrum-2 and later ASICs support more sophisticated sampling by mirroring packets to the CPU. As a preparation for more advanced sampling configurations, split the trap configuration used for sampled packets between Spectrum-1 and later ASICs. This is needed since packets that are mirrored to the CPU are trapped via a different trap identifier compared to packets that are sampled using the dedicated sampling mechanism. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:22:39 -08:00
Ido Schimmel	20afb9bc48	mlxsw: spectrum_matchall: Split sampling support between ASICs Sampling of ingress packets is supported using a dedicated sampling mechanism on all Spectrum ASICs. However, Spectrum-2 and later ASICs support more sophisticated sampling by mirroring packets to the CPU. As a preparation for more advanced sampling configurations, split the sampling operations between Spectrum-1 and later ASICs. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:22:39 -08:00
Ido Schimmel	2dcbd9207b	mlxsw: spectrum_span: Add SPAN probability rate support Currently, every packet that matches a mirroring trigger (e.g., received packets, buffer dropped packets) is mirrored. Spectrum-2 and later ASICs support mirroring with probability, where every 1 in N matched packets is mirrored. Extend the API that creates the binding between the trigger and the SPAN agent with a probability rate parameter, which is an attribute of the trigger. Set it to '1' to maintain existing behavior. Subsequent patches will use it to perform more sophisticated sampling, by mirroring packets to the CPU with probability. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:22:39 -08:00
Ido Schimmel	fa3faeb7ae	mlxsw: reg: Extend mirroring registers with probability rate field The MPAR and MPAGR registers are used to configure the binding between the mirroring trigger (e.g., received packet) and the SPAN agent. Add probability rate field, which will allow us to support sampling by mirroring to the CPU. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:22:39 -08:00
Ido Schimmel	5c7659eba8	mlxsw: spectrum_span: Add SPAN session identifier support When packets are mirrored to the CPU, the trap identifier with which the packets are trapped is determined according to the session identifier of the SPAN agent performing the mirroring. Packets that are trapped for the same logical reason (e.g., buffer drops) should use the same session identifier. Currently, a single session is implicitly supported (identifier 0) and is used for packets that are mirrored to the CPU due to buffer drops (e.g., early drop). Subsequent patches are going to mirror packets to the CPU due to sampling, which will require a different session identifier. Prepare for that by making the session identifier an attribute of the SPAN agent. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:22:39 -08:00
David S. Miller	1bc61c9dd4	mlx5-updates-2021-03-11 Cleanups for mlx5 driver 1) Fix build warnings form Arnd and Vlad 2) Leon improves locking for driver load/unload flows 3) From Roi, Lockdep false dependency warning 4) Other trivial cleanups -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmBKmyMACgkQSD+KveBX +j6n8ggAr45d0+MbXE8jAfLNS5lu7S6a5mTZ1nZhQPHn3ki9N3XnkOXW6H0MKIUP mLb5b1Q1MA1h8XcwJBeXjrsSVl5lJTaxK6guHkhtnrIZ/+7das6tYxry5pw7S6Yq LVzZD6gnyz/kFoF0pCPZtdYz5pX3EBy8j7fTlMDhP1hTgaSoru09HSSz2oqMuqkp Msrra9ba+qBnSJcB+nhRExygdQKiwiJKXNg1kmpCQcWoy1JlkUxgJrKx4Rq3ofVU +zaiZes63OZfrJhro5pvuJvxz0x3IcONL+ZL4C2YHDslsA05qpL7U3aRQbwThZHn NKg1dKX8nX0fdkIuMfBYcaeQo+GBow== =O521 -----END PGP SIGNATURE----- Merge tag 'mlx5-updates-2021-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== This series provides some cleanups to mlx5 driver For more information please see tag log below. Please pull and let me know if there is any problem. mlx5-updates-2021-03-11 Cleanups for mlx5 driver 1) Fix build warnings form Arnd and Vlad 2) Leon improves locking for driver load/unload flows 3) From Roi, Lockdep false dependency warning 4) Other trivial cleanups ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:17:43 -08:00
David S. Miller	2a0186a377	Merge branch 'nexthop-Resilient-next-hop-groups' Petr Machata says: ==================== nexthop: Resilient next-hop groups At this moment, there is only one type of next-hop group: an mpath group. Mpath groups implement the hash-threshold algorithm, described in RFC 2992[1]. To select a next hop, hash-threshold algorithm first assigns a range of hashes to each next hop in the group, and then selects the next hop by comparing the SKB hash with the individual ranges. When a next hop is removed from the group, the ranges are recomputed, which leads to reassignment of parts of hash space from one next hop to another. RFC 2992 illustrates it thus: +-------+-------+-------+-------+-------+ \| 1 \| 2 \| 3 \| 4 \| 5 \| +-------+-+-----+---+---+-----+-+-------+ \| 1 \| 2 \| 4 \| 5 \| +---------+---------+---------+---------+ Before and after deletion of next hop 3 under the hash-threshold algorithm. Note how next hop 2 gave up part of the hash space in favor of next hop 1, and 4 in favor of 5. While there will usually be some overlap between the previous and the new distribution, some traffic flows change the next hop that they resolve to. If a multipath group is used for load-balancing between multiple servers, this hash space reassignment causes an issue that packets from a single flow suddenly end up arriving at a server that does not expect them, which may lead to TCP reset. If a multipath group is used for load-balancing among available paths to the same server, the issue is that different latencies and reordering along the way causes the packets to arrive in the wrong order. Resilient hashing is a technique to address the above problem. Resilient next-hop group has another layer of indirection between the group itself and its constituent next hops: a hash table. The selection algorithm uses a straightforward modulo operation on the SKB hash to choose a hash table bucket, then reads the next hop that this bucket contains, and forwards traffic there. This indirection brings an important feature. In the hash-threshold algorithm, the range of hashes associated with a next hop must be continuous. With a hash table, mapping between the hash table buckets and the individual next hops is arbitrary. Therefore when a next hop is deleted the buckets that held it are simply reassigned to other next hops: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \|1\|1\|1\|1\|2\|2\|2\|2\|3\|3\|3\|3\|4\|4\|4\|4\|5\|5\|5\|5\| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ v v v v +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \|1\|1\|1\|1\|2\|2\|2\|2\|1\|2\|4\|5\|4\|4\|4\|4\|5\|5\|5\|5\| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Before and after deletion of next hop 3 under the resilient hashing algorithm. When weights of next hops in a group are altered, it may be possible to choose a subset of buckets that are currently not used for forwarding traffic, and use those to satisfy the new next-hop distribution demands, keeping the "busy" buckets intact. This way, established flows are ideally kept being forwarded to the same endpoints through the same paths as before the next-hop group change. This patch set adds the implementation of resilient next-hop groups. In a nutshell, the algorithm works as follows. Each next hop has a number of buckets that it wants to have, according to its weight and the number of buckets in the hash table. In case of an event that might cause bucket allocation change, the numbers for individual next hops are updated, similarly to how ranges are updated for mpath group next hops. Following that, a new "upkeep" algorithm runs, and for idle buckets that belong to a next hop that is currently occupying more buckets than it wants (it is "overweight"), it migrates the buckets to one of the next hops that has fewer buckets than it wants (it is "underweight"). If, after this, there are still underweight next hops, another upkeep run is scheduled to a future time. Chances are there are not enough "idle" buckets to satisfy the new demands. The algorithm has knobs to select both what it means for a bucket to be idle, and for whether and when to forcefully migrate buckets if there keeps being an insufficient number of idle ones. To illustrate the usage, consider the following commands: # ip nexthop add id 1 via 192.0.2.2 dev dummy1 # ip nexthop add id 2 via 192.0.2.3 dev dummy1 # ip nexthop add id 10 group 1/2 type resilient \ buckets 8 idle_timer 60 unbalanced_timer 300 The last command creates a resilient next-hop group. It will have 8 buckets, each bucket will be considered idle when no traffic hits it for at least 60 seconds, and if the table remains out of balance for 300 seconds, it will be forcefully brought into balance. If not present in netlink message, the idle timer defaults to 120 seconds, and there is no unbalanced timer, meaning the group may remain unbalanced indefinitely. The value of 120 is the default in Cumulus implementation of resilient next-hop groups. To a degree the default is arbitrary, the only value that certainly does not make sense is 0. Therefore going with an existing deployed implementation is reasonable. Unbalanced time, i.e. how long since the last time that all nexthops had as many buckets as they should according to their weights, is reported when the group is dumped: # ip nexthop show id 10 id 10 group 1/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0 When replacing next hops or changing weights, if one does not specify some parameters, their value is left as it was: # ip nexthop replace id 10 group 1,2/2 type resilient # ip nexthop show id 10 id 10 group 1,2/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0 It is also possible to do a dump of individual buckets (and now you know why there were only 8 of them in the example above): # ip nexthop bucket show id 10 id 10 index 0 idle_time 5.59 nhid 1 id 10 index 1 idle_time 5.59 nhid 1 id 10 index 2 idle_time 8.74 nhid 2 id 10 index 3 idle_time 8.74 nhid 2 id 10 index 4 idle_time 8.74 nhid 1 id 10 index 5 idle_time 8.74 nhid 1 id 10 index 6 idle_time 8.74 nhid 1 id 10 index 7 idle_time 8.74 nhid 1 Note the two buckets that have a shorter idle time. Those are the ones that were migrated after the nexthop replace command to satisfy the new demand that nexthop 1 be given 6 buckets instead of 4. The patchset proceeds as follows: - Patches #1 and #2 are small refactoring patches. - Patch #3 adds a new flag to struct nh_group, is_multipath. This flag is meant to be set for all nexthop groups that in general have several nexthops from which they choose, and avoids a more expensive dispatch based on reading several flags, one for each nexthop group type. - Patch #4 contains defines of new UAPI attributes and the new next-hop group type. At this point, the nexthop code is made to bounce the new type. As the resilient hashing code is gradually added in the following patch sets, it will remain dead. The last patch will make it accessible. This patch also adds a suite of new messages related to next hop buckets. This approach was taken instead of overloading the information on the existing RTM_{NEW,DEL,GET}NEXTHOP messages for the following reasons. First, a next-hop group can contain a large number of next-hop buckets (4k is not unheard of). This imposes limits on the amount of information that can be encoded for each next-hop bucket given a netlink message is limited to 64k bytes. Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at this point, in the future it can be extended to provide user space with control over next-hop buckets configuration. - Patch #5 contains the meat of the resilient next-hop group support. - Patches #6 and #7 implement support for notifications towards the drivers. - Patch #8 adds an interface for the drivers to report resilient hash table bucket activity. Drivers will be able to report through this interface whether traffic is hitting a given bucket. - Patch #9 adds an interface for the drivers to report whether a given hash table bucket is offloaded or trapping traffic. - In patches #10, #11, #12 and #13, UAPI is implemented. This includes all the code necessary for creation of resilient groups, bucket dumping and getting, and bucket migration notifications. - In patch #14 the next-hop groups are finally made available. The overall plan is to contribute approximately the following patchsets: 1) Nexthop policy refactoring (already pushed) 2) Preparations for resilient next-hop groups (already pushed) 3) Implementation of resilient next-hop groups (this patchset) 4) Netdevsim offload plus a suite of selftests 5) Preparations for mlxsw offload of resilient next-hop groups 6) mlxsw offload including selftests Interested parties can look at the current state of the code at [2] and [3]. [1] https://tools.ietf.org/html/rfc2992 [2] https://github.com/idosch/linux/commits/submit/res_integ_v1 [3] https://github.com/idosch/iproute2/commits/submit/res_v1 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:13:00 -08:00
Petr Machata	15e1dd5703	nexthop: Enable resilient next-hop groups Now that all the code is in place, stop rejecting requests to create resilient next-hop groups. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:13:00 -08:00
Petr Machata	0b4818aabc	nexthop: Notify userspace about bucket migrations Nexthop replacements et.al. are notified through netlink, but if a delayed work migrates buckets on the background, userspace will stay oblivious. Notify these as RTM_NEWNEXTHOPBUCKET events. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:13:00 -08:00
Petr Machata	187d4c6b97	nexthop: Add netlink handlers for bucket get Allow getting (but not setting) individual buckets to inspect the next hop mapped therein, idle time, and flags. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:13:00 -08:00
Petr Machata	8a1bbabb03	nexthop: Add netlink handlers for bucket dump Add a dump handler for resilient next hop buckets. When next-hop group ID is given, it walks buckets of that group, otherwise it walks buckets of all groups. It then dumps the buckets whose next hops match the given filtering criteria. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:13:00 -08:00
Petr Machata	a2601e2b1e	nexthop: Add netlink handlers for resilient nexthop groups Implement the netlink messages that allow creation and dumping of resilient nexthop groups. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:13:00 -08:00
Ido Schimmel	cfc15c1dbb	nexthop: Allow reporting activity of nexthop buckets The kernel periodically checks the idle time of nexthop buckets to determine if they are idle and can be re-populated with a new nexthop. When the resilient nexthop group is offloaded to hardware, the kernel will not see activity on nexthop buckets unless it is reported from hardware. Add a function that can be periodically called by device drivers to report activity on nexthop buckets after querying it from the underlying device. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:13:00 -08:00
Ido Schimmel	56ad5ba344	nexthop: Allow setting "offload" and "trap" indication of nexthop buckets Add a function that can be called by device drivers to set "offload" or "trap" indication on nexthop buckets following nexthop notifications and other changes such as a neighbour becoming invalid. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:12:59 -08:00
Petr Machata	7c37c7e004	nexthop: Implement notifiers for resilient nexthop groups Implement the following notifications towards drivers: - NEXTHOP_EVENT_REPLACE, when a resilient nexthop group is created. - NEXTHOP_EVENT_BUCKET_REPLACE any time there is a change in assignment of next hops to hash table buckets. That includes replacements, deletions, and delayed upkeep cycles. Some bucket notifications can be vetoed by the driver, to make it possible to propagate bucket busy-ness flags from the HW back to the algorithm. Some are however forced, e.g. if a next hop is deleted, all buckets that use this next hop simply must be migrated, whether the HW wishes so or not. - NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE, before a resilient nexthop group is replaced. Usually the driver will get the bucket notifications as well, and could veto those. But in some cases, a bucket may not be migrated immediately, but during delayed upkeep, and that is too late to roll the transaction back. This notification allows the driver to take a look and veto the new proposed group up front, before anything is committed. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:12:59 -08:00
Ido Schimmel	b8f090d0be	nexthop: Add data structures for resilient group notifications Add data structures that will be used for in-kernel notifications about addition / deletion of a resilient nexthop group and about changes to a hash bucket within a resilient group. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:12:59 -08:00
Petr Machata	283a72a559	nexthop: Add implementation of resilient next-hop groups At this moment, there is only one type of next-hop group: an mpath group, which implements the hash-threshold algorithm. To select a next hop, hash-threshold algorithm first assigns a range of hashes to each next hop in the group, and then selects the next hop by comparing the SKB hash with the individual ranges. When a next hop is removed from the group, the ranges are recomputed, which leads to reassignment of parts of hash space from one next hop to another. While there will usually be some overlap between the previous and the new distribution, some traffic flows change the next hop that they resolve to. That causes problems e.g. as established TCP connections are reset, because the traffic is forwarded to a server that is not familiar with the connection. Resilient hashing is a technique to address the above problem. Resilient next-hop group has another layer of indirection between the group itself and its constituent next hops: a hash table. The selection algorithm uses a straightforward modulo operation to choose a hash bucket, and then reads the next hop that this bucket contains, and forwards traffic there. This indirection brings an important feature. In the hash-threshold algorithm, the range of hashes associated with a next hop must be continuous. With a hash table, mapping between the hash table buckets and the individual next hops is arbitrary. Therefore when a next hop is deleted the buckets that held it are simply reassigned to other next hops. When weights of next hops in a group are altered, it may be possible to choose a subset of buckets that are currently not used for forwarding traffic, and use those to satisfy the new next-hop distribution demands, keeping the "busy" buckets intact. This way, established flows are ideally kept being forwarded to the same endpoints through the same paths as before the next-hop group change. In a nutshell, the algorithm works as follows. Each next hop has a number of buckets that it wants to have, according to its weight and the number of buckets in the hash table. In case of an event that might cause bucket allocation change, the numbers for individual next hops are updated, similarly to how ranges are updated for mpath group next hops. Following that, a new "upkeep" algorithm runs, and for idle buckets that belong to a next hop that is currently occupying more buckets than it wants (it is "overweight"), it migrates the buckets to one of the next hops that has fewer buckets than it wants (it is "underweight"). If, after this, there are still underweight next hops, another upkeep run is scheduled to a future time. Chances are there are not enough "idle" buckets to satisfy the new demands. The algorithm has knobs to select both what it means for a bucket to be idle, and for whether and when to forcefully migrate buckets if there keeps being an insufficient number of idle buckets. There are three users of the resilient data structures. - The forwarding code accesses them under RCU, and does not modify them except for updating the time a selected bucket was last used. - Netlink code, running under RTNL, which may modify the data. - The delayed upkeep code, which may modify the data. This runs unlocked, and mutual exclusion between the RTNL code and the delayed upkeep is maintained by canceling the delayed work synchronously before the RTNL code touches anything. Later it restarts the delayed work if necessary. The RTNL code has to implement next-hop group replacement, next hop removal, etc. For removal, the mpath code uses a neat trick of having a backup next hop group structure, doing the necessary changes offline, and then RCU-swapping them in. However, the hash tables for resilient hashing are about an order of magnitude larger than the groups themselves (the size might be e.g. 4K entries), and it was felt that keeping two of them is an overkill. Both the primary next-hop group and the spare therefore use the same resilient table, and writers are careful to keep all references valid for the forwarding code. The hash table references next-hop group entries from the next-hop group that is currently in the primary role (i.e. not spare). During the transition from primary to spare, the table references a mix of both the primary group and the spare. When a next hop is deleted, the corresponding buckets are not set to NULL, but instead marked as empty, so that the pointer is valid and can be used by the forwarding code. The buckets are then migrated to a new next-hop group entry during upkeep. The only times that the hash table is invalid is the very beginning and very end of its lifetime. Between those points, it is always kept valid. This patch introduces the core support code itself. It does not handle notifications towards drivers, which are kept as if the group were an mpath one. It does not handle netlink either. The only bit currently exposed to user space is the new next-hop group type, and that is currently bounced. There is therefore no way to actually access this code. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:12:59 -08:00

1 2 3 4 5 ...

996683 Commits