linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-25 05:32:00 +00:00

Author	SHA1	Message	Date
Eric Dumazet	b01e1c0307	ipv6: fix possible race in __fib6_drop_pcpu_from() syzbot found a race in __fib6_drop_pcpu_from() [1] If compiler reads more than once (*ppcpu_rt), second read could read NULL, if another cpu clears the value in rt6_get_pcpu_route(). Add a READ_ONCE() to prevent this race. Also add rcu_read_lock()/rcu_read_unlock() because we rely on RCU protection while dereferencing pcpu_rt. [1] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000012: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000090-0x0000000000000097] CPU: 0 PID: 7543 Comm: kworker/u8:17 Not tainted 6.10.0-rc1-syzkaller-00013-g2bfcfd584ff5 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024 Workqueue: netns cleanup_net RIP: 0010:__fib6_drop_pcpu_from.part.0+0x10a/0x370 net/ipv6/ip6_fib.c:984 Code: f8 48 c1 e8 03 80 3c 28 00 0f 85 16 02 00 00 4d 8b 3f 4d 85 ff 74 31 e8 74 a7 fa f7 49 8d bf 90 00 00 00 48 89 f8 48 c1 e8 03 <80> 3c 28 00 0f 85 1e 02 00 00 49 8b 87 90 00 00 00 48 8b 0c 24 48 RSP: 0018:ffffc900040df070 EFLAGS: 00010206 RAX: 0000000000000012 RBX: 0000000000000001 RCX: ffffffff89932e16 RDX: ffff888049dd1e00 RSI: ffffffff89932d7c RDI: 0000000000000091 RBP: dffffc0000000000 R08: 0000000000000005 R09: 0000000000000007 R10: 0000000000000001 R11: 0000000000000006 R12: ffff88807fa080b8 R13: fffffbfff1a9a07d R14: ffffed100ff41022 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff8880b9200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b32c26000 CR3: 000000005d56e000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __fib6_drop_pcpu_from net/ipv6/ip6_fib.c:966 [inline] fib6_drop_pcpu_from net/ipv6/ip6_fib.c:1027 [inline] fib6_purge_rt+0x7f2/0x9f0 net/ipv6/ip6_fib.c:1038 fib6_del_route net/ipv6/ip6_fib.c:1998 [inline] fib6_del+0xa70/0x17b0 net/ipv6/ip6_fib.c:2043 fib6_clean_node+0x426/0x5b0 net/ipv6/ip6_fib.c:2205 fib6_walk_continue+0x44f/0x8d0 net/ipv6/ip6_fib.c:2127 fib6_walk+0x182/0x370 net/ipv6/ip6_fib.c:2175 fib6_clean_tree+0xd7/0x120 net/ipv6/ip6_fib.c:2255 __fib6_clean_all+0x100/0x2d0 net/ipv6/ip6_fib.c:2271 rt6_sync_down_dev net/ipv6/route.c:4906 [inline] rt6_disable_ip+0x7ed/0xa00 net/ipv6/route.c:4911 addrconf_ifdown.isra.0+0x117/0x1b40 net/ipv6/addrconf.c:3855 addrconf_notify+0x223/0x19e0 net/ipv6/addrconf.c:3778 notifier_call_chain+0xb9/0x410 kernel/notifier.c:93 call_netdevice_notifiers_info+0xbe/0x140 net/core/dev.c:1992 call_netdevice_notifiers_extack net/core/dev.c:2030 [inline] call_netdevice_notifiers net/core/dev.c:2044 [inline] dev_close_many+0x333/0x6a0 net/core/dev.c:1585 unregister_netdevice_many_notify+0x46d/0x19f0 net/core/dev.c:11193 unregister_netdevice_many net/core/dev.c:11276 [inline] default_device_exit_batch+0x85b/0xae0 net/core/dev.c:11759 ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 cleanup_net+0x5b7/0xbf0 net/core/net_namespace.c:640 process_one_work+0x9fb/0x1b60 kernel/workqueue.c:3231 process_scheduled_works kernel/workqueue.c:3312 [inline] worker_thread+0x6c8/0xf70 kernel/workqueue.c:3393 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Fixes: `d52d3997f8` ("ipv6: Create percpu rt6_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20240604193549.981839-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 13:05:54 +02:00
Paolo Abeni	411c0ea696	Merge branch 'af_unix-fix-lockless-access-of-sk-sk_state-and-others-fields' Kuniyuki Iwashima says: ==================== af_unix: Fix lockless access of sk->sk_state and others fields. The patch 1 fixes a bug where SOCK_DGRAM's sk->sk_state is changed to TCP_CLOSE even if the socket is connect()ed to another socket. The rest of this series annotates lockless accesses to the following fields. * sk->sk_state * sk->sk_sndbuf * net->unx.sysctl_max_dgram_qlen * sk->sk_receive_queue.qlen * sk->sk_shutdown Note that with this series there is skb_queue_empty() left in unix_dgram_disconnected() that needs to be changed to lockless version, and unix_peer(other) access there should be protected by unix_state_lock(). This will require some refactoring, so another series will follow. Changes: v2: * Patch 1: Fix wrong double lock v1: https://lore.kernel.org/netdev/20240603143231.62085-1-kuniyu@amazon.com/ ==================== Link: https://lore.kernel.org/r/20240604165241.44758-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:18 +02:00
Kuniyuki Iwashima	efaf24e30e	af_unix: Annotate data-race of sk->sk_shutdown in sk_diag_fill(). While dumping sockets via UNIX_DIAG, we do not hold unix_state_lock(). Let's use READ_ONCE() to read sk->sk_shutdown. Fixes: `e4e541a848` ("sock-diag: Report shutdown for inet and unix sockets (v2)") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:15 +02:00
Kuniyuki Iwashima	5d915e584d	af_unix: Use skb_queue_len_lockless() in sk_diag_show_rqlen(). We can dump the socket queue length via UNIX_DIAG by specifying UDIAG_SHOW_RQLEN. If sk->sk_state is TCP_LISTEN, we return the recv queue length, but here we do not hold recvq lock. Let's use skb_queue_len_lockless() in sk_diag_show_rqlen(). Fixes: `c9da99e647` ("unix_diag: Fixup RQLEN extension report") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:15 +02:00
Kuniyuki Iwashima	83690b82d2	af_unix: Use skb_queue_empty_lockless() in unix_release_sock(). If the socket type is SOCK_STREAM or SOCK_SEQPACKET, unix_release_sock() checks the length of the peer socket's recvq under unix_state_lock(). However, unix_stream_read_generic() calls skb_unlink() after releasing the lock. Also, for SOCK_SEQPACKET, __skb_try_recv_datagram() unlinks skb without unix_state_lock(). Thues, unix_state_lock() does not protect qlen. Let's use skb_queue_empty_lockless() in unix_release_sock(). Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:15 +02:00
Kuniyuki Iwashima	45d872f0e6	af_unix: Use unix_recvq_full_lockless() in unix_stream_connect(). Once sk->sk_state is changed to TCP_LISTEN, it never changes. unix_accept() takes advantage of this characteristics; it does not hold the listener's unix_state_lock() and only acquires recvq lock to pop one skb. It means unix_state_lock() does not prevent the queue length from changing in unix_stream_connect(). Thus, we need to use unix_recvq_full_lockless() to avoid data-race. Now we remove unix_recvq_full() as no one uses it. Note that we can remove READ_ONCE() for sk->sk_max_ack_backlog in unix_recvq_full_lockless() because of the following reasons: (1) For SOCK_DGRAM, it is a written-once field in unix_create1() (2) For SOCK_STREAM and SOCK_SEQPACKET, it is changed under the listener's unix_state_lock() in unix_listen(), and we hold the lock in unix_stream_connect() Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:15 +02:00
Kuniyuki Iwashima	bd9f2d0573	af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen. net->unx.sysctl_max_dgram_qlen is exposed as a sysctl knob and can be changed concurrently. Let's use READ_ONCE() in unix_create1(). Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:15 +02:00
Kuniyuki Iwashima	b0632e53e0	af_unix: Annotate data-races around sk->sk_sndbuf. sk_setsockopt() changes sk->sk_sndbuf under lock_sock(), but it's not used in af_unix.c. Let's use READ_ONCE() to read sk->sk_sndbuf in unix_writable(), unix_dgram_sendmsg(), and unix_stream_sendmsg(). Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:15 +02:00
Kuniyuki Iwashima	0aa3be7b3e	af_unix: Annotate data-races around sk->sk_state in UNIX_DIAG. While dumping AF_UNIX sockets via UNIX_DIAG, sk->sk_state is read locklessly. Let's use READ_ONCE() there. Note that the result could be inconsistent if the socket is dumped during the state change. This is common for other SOCK_DIAG and similar interfaces. Fixes: `c9da99e647` ("unix_diag: Fixup RQLEN extension report") Fixes: `2aac7a2cb0` ("unix_diag: Pending connections IDs NLA") Fixes: `45a96b9be6` ("unix_diag: Dumping all sockets core") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:15 +02:00
Kuniyuki Iwashima	af4c733b6b	af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb(). unix_stream_read_skb() is called from sk->sk_data_ready() context where unix_state_lock() is not held. Let's use READ_ONCE() there. Fixes: `77462de14a` ("af_unix: Add read_sock for stream socket types") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:14 +02:00
Kuniyuki Iwashima	8a34d4e8d9	af_unix: Annotate data-races around sk->sk_state in sendmsg() and recvmsg(). The following functions read sk->sk_state locklessly and proceed only if the state is TCP_ESTABLISHED. * unix_stream_sendmsg * unix_stream_read_generic * unix_seqpacket_sendmsg * unix_seqpacket_recvmsg Let's use READ_ONCE() there. Fixes: `a05d2ad1c1` ("af_unix: Only allow recv on connected seqpacket sockets.") Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:14 +02:00
Kuniyuki Iwashima	1b536948e8	af_unix: Annotate data-race of sk->sk_state in unix_accept(). Once sk->sk_state is changed to TCP_LISTEN, it never changes. unix_accept() takes the advantage and reads sk->sk_state without holding unix_state_lock(). Let's use READ_ONCE() there. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:14 +02:00
Kuniyuki Iwashima	a9bf9c7dc6	af_unix: Annotate data-race of sk->sk_state in unix_stream_connect(). As small optimisation, unix_stream_connect() prefetches the client's sk->sk_state without unix_state_lock() and checks if it's TCP_CLOSE. Later, sk->sk_state is checked again under unix_state_lock(). Let's use READ_ONCE() for the first check and TCP_CLOSE directly for the second check. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:14 +02:00
Kuniyuki Iwashima	eb0718fb3e	af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll(). unix_poll() and unix_dgram_poll() read sk->sk_state locklessly and calls unix_writable() which also reads sk->sk_state without holding unix_state_lock(). Let's use READ_ONCE() in unix_poll() and unix_dgram_poll() and pass it to unix_writable(). While at it, we remove TCP_SYN_SENT check in unix_dgram_poll() as that state does not exist for AF_UNIX socket since the code was added. Fixes: `1586a5877d` ("af_unix: do not report POLLOUT on listeners") Fixes: `3c73419c09` ("af_unix: fix 'poll for write'/ connected DGRAM sockets") Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:14 +02:00
Kuniyuki Iwashima	3a0f38eb28	af_unix: Annotate data-race of sk->sk_state in unix_inq_len(). ioctl(SIOCINQ) calls unix_inq_len() that checks sk->sk_state first and returns -EINVAL if it's TCP_LISTEN. Then, for SOCK_STREAM sockets, unix_inq_len() returns the number of bytes in recvq. However, unix_inq_len() does not hold unix_state_lock(), and the concurrent listen() might change the state after checking sk->sk_state. If the race occurs, 0 is returned for the listener, instead of -EINVAL, because the length of skb with embryo is 0. We could hold unix_state_lock() in unix_inq_len(), but it's overkill given the result is true for pre-listen() TCP_CLOSE state. So, let's use READ_ONCE() for sk->sk_state in unix_inq_len(). Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:14 +02:00
Kuniyuki Iwashima	942238f973	af_unix: Annodate data-races around sk->sk_state for writers. sk->sk_state is changed under unix_state_lock(), but it's read locklessly in many places. This patch adds WRITE_ONCE() on the writer side. We will add READ_ONCE() to the lockless readers in the following patches. Fixes: `83301b5367` ("af_unix: Set TCP_ESTABLISHED for datagram sockets too") Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:14 +02:00
Kuniyuki Iwashima	26bfb8b570	af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer. When a SOCK_DGRAM socket connect()s to another socket, the both sockets' sk->sk_state are changed to TCP_ESTABLISHED so that we can register them to BPF SOCKMAP. When the socket disconnects from the peer by connect(AF_UNSPEC), the state is set back to TCP_CLOSE. Then, the peer's state is also set to TCP_CLOSE, but the update is done locklessly and unconditionally. Let's say socket A connect()ed to B, B connect()ed to C, and A disconnects from B. After the first two connect()s, all three sockets' sk->sk_state are TCP_ESTABLISHED: $ ss -xa Netid State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess u_dgr ESTAB 0 0 @A 641 * 642 u_dgr ESTAB 0 0 @B 642 * 643 u_dgr ESTAB 0 0 @C 643 * 0 And after the disconnect, B's state is TCP_CLOSE even though it's still connected to C and C's state is TCP_ESTABLISHED. $ ss -xa Netid State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess u_dgr UNCONN 0 0 @A 641 * 0 u_dgr UNCONN 0 0 @B 642 * 643 u_dgr ESTAB 0 0 @C 643 * 0 In this case, we cannot register B to SOCKMAP. So, when a socket disconnects from the peer, we should not set TCP_CLOSE to the peer if the peer is connected to yet another socket, and this must be done under unix_state_lock(). Note that we use WRITE_ONCE() for sk->sk_state as there are many lockless readers. These data-races will be fixed in the following patches. Fixes: `83301b5367` ("af_unix: Set TCP_ESTABLISHED for datagram sockets too") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:57:14 +02:00
Eric Dumazet	98aa546af5	inet: remove (struct uncached_list)->quarantine This list is used to tranfert dst that are handled by rt_flush_dev() and rt6_uncached_list_flush_dev() out of the per-cpu lists. But quarantine list is not used later. If we simply use list_del_init(&rt->dst.rt_uncached), this also removes the dst from per-cpu list. This patch also makes the future calls to rt_del_uncached_list() and rt6_uncached_list_del() faster, because no spinlock acquisition is needed anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240604165150.726382-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 12:33:25 +02:00
Eric Dumazet	b4cb4a1391	net: use unrcu_pointer() helper Toke mentioned unrcu_pointer() existence, allowing to remove some of the ugly casts we have when using xchg() for rcu protected pointers. Also make inet_rcv_compat const. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20240604111603.45871-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 11:52:52 +02:00
Aleksandr Mishin	b0c9a26435	net: wwan: iosm: Fix tainted pointer delete is case of region creation fail In case of region creation fail in ipc_devlink_create_region(), previously created regions delete process starts from tainted pointer which actually holds error code value. Fix this bug by decreasing region index before delete. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: `4dcd183fbd` ("net: wwan: iosm: devlink registration") Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru> Acked-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240604082500.20769-1-amishin@t-argos.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 10:15:14 +02:00
Paolo Abeni	59d0f48160	Merge branch 'improve-gbeth-performance-on-renesas-rz-g2l-and-related-socs' Paul Barker says: ==================== Improve GbEth performance on Renesas RZ/G2L and related SoCs This series aims to improve performance of the GbEth IP in the Renesas RZ/G2L SoC family and the RZ/G3S SoC, which use the ravb driver. Along the way, we do some refactoring and ensure that napi_complete_done() is used in accordance with the NAPI documentation for both GbEth and R-Car code paths. Much of the performance improvement comes from enabling SW IRQ Coalescing for all SoCs using the GbEth IP, and NAPI Threaded mode for single core SoCs using the GbEth IP. These can be enabled/disabled at runtime via sysfs, but our goal is to set sensible defaults which get good performance on the affected SoCs. The rest of the performance improvement comes from using a page pool to allocate RX buffers, and reducing the allocation size from >8kB to 2kB. The overall performance impact of this patch series seen in testing with iperf3 is as follows (see patches 5-7 for more detailed results): * RZ/G2L: * TCP TX: +1.8% bandwidth * TCP RX: +1% bandwidth at 47% less CPU load * UDP RX: +1% bandwidth at 26% less CPU load * RZ/G2UL: * TCP TX: +37% bandwidth * TCP RX: +43% bandwidth * UDP TX: -8% bandwidth * UDP RX: +32500% bandwidth (!) * RZ/G3S: * TCP TX: +25% bandwidth * TCP RX: +76% bandwidth * UDP TX: -9% bandwidth * UDP RX: +37900% bandwidth (!) * RZ/Five: * TCP TX: +18% bandwidth * TCP RX: +212% bandwidth * UDP TX: +2% bandwidth * UDP RX: +inf bandwidth (test no longer crashes) There is no significant impact on bandwidth or CPU load in testing on RZ/G2H or R-Car M3N. Fixing the crash in UDP RX testing for RZ/Five is a cumulative effect of patches 1, 2, 5 & 6 so this is very difficult to break out as a bugfix for backporting. Changes v4->v5: * Added Sergey's Reviewed-by tags. * Improved the commit message for patch 2/7. * Re-wrapped to 80 cols, except where this would significantly impact readability. * Use lower case `skb` consistently in comments. * Included <net/page_pool/types.h> in ravb.h. * Moved rx_buffer_size so it is in the same place in ravb_hw_info as rx_max_desc_use was previously. * Used reverse xmas tree ordering in variable declarations. * Split lines after binary operators, instead of before. * Factor subtraction of sizeof(__sum16) out of the if condition in ravb_rx_csum_gbeth(). * Add blank lines after variable declarations where needed. * Used goto instead of break to handle napi_build_skb() failure in ravb_rx_gbeth(). Break was incorrectly scoped to the surrounding switch statement, when it's the outer loop we really want to break out of. * Used continue instead of break to handle NULL priv->rx_1st_skb in ravb_rx_gbeth() as we may still be able to process further descriptors. * Unconditionally set priv->rx_1st_skb = NULL after processing a packet in ravb_rx_gbeth(). We don't need to check die_dt as this will be a no-op for single descriptor packets. * Moved napi_build_skb() call after dma_sync_single_for_cpu() in ravb_rx_rcar() to align the order of operations with ravb_rx_gbeth() and ensure the data is sync'd before it is accessed. * Moved zeroing of rx_buff->page to the end of packet processing in ravb_rx_rcar() to align the order of operations with ravb_rx_gbeth(). Changes v3->v4: * Dependency patches have merged so this is no longer an RFC. * Fixed update of stats->rx_packets. * Simplified refactoring following feedback from Niklas and Sergey. * Renamed needs_irq_coalesce -> coalesce_irqs. * Used a separate page pool for each RX queue. * Passed struct ravb_rx_desc to ravb_alloc_rx_buffer() so that we can simplify the calling function. * Explained the calculation of rx_desc->ds_cc. * Added handling of nonlinear SKBs in ravb_rx_csum_gbeth(). * Used Niklas' suggested commit message for patch 2/7. * Added Sergey's Reviewed-by tags to patches 5/7 and 6/7. Changes v2->v3: * Incorporated feedback on RFC v2 from Sergey. * Split out bugfixes and rebased. This changed the order of what was the first 5 patches of v2 and things look a little different so I've not picked up Reviewed-by tags from v2. * Further refactoring and tidy up of RX ring refill and ravb_rx_gbeth(). * Switched to using a page pool to allocate RX buffers. * Re-tested and provided updated performance figures. Changes v1->v2: * Marked as RFC as the series depends on unmerged patches. * Refactored R-Car code paths as well as GbEth code paths. * Updated references to the patches this series depends on. ==================== Link: https://lore.kernel.org/r/20240604072825.7490-1-paul.barker.ct@bp.renesas.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 10:00:02 +02:00
Paul Barker	966726324b	net: ravb: Allocate RX buffers via page pool This patch makes multiple changes that can't be separated: 1) Allocate plain RX buffers via a page pool instead of allocating SKBs, then use build_skb() when a packet is received. 2) For GbEth IP, reduce the RX buffer size to 2kB. 3) For GbEth IP, merge packets which span more than one RX descriptor as SKB fragments instead of copying data. Implementing (1) without (2) would require the use of an order-1 page pool (instead of an order-0 page pool split into page fragments) for GbEth. Implementing (2) without (3) would leave us no space to re-assemble packets which span more than one RX descriptor. Implementing (3) without (1) would not be possible as the network stack expects to use put_page() or page_pool_put_page() to free SKB fragments after an SKB is consumed. RX checksum offload support is adjusted to handle both linear and nonlinear (fragmented) packets. This patch gives the following improvements during testing with iperf3. * RZ/G2L: * TCP RX: same bandwidth at -43% CPU load (70% -> 40%) * UDP RX: same bandwidth at -17% CPU load (88% -> 74%) * RZ/G2UL: * TCP RX: +30% bandwidth (726Mbps -> 941Mbps) * UDP RX: +417% bandwidth (108Mbps -> 558Mbps) * RZ/G3S: * TCP RX: +64% bandwidth (562Mbps -> 920Mbps) * UDP RX: +420% bandwidth (90Mbps -> 468Mbps) * RZ/Five: * TCP RX: +217% bandwidth (145Mbps -> 459Mbps) * UDP RX: +470% bandwidth (20Mbps -> 114Mbps) There is no significant impact on bandwidth or CPU load in testing on RZ/G2H or R-Car M3N. Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 09:59:58 +02:00
Paul Barker	65c482bc22	net: ravb: Use NAPI threaded mode on 1-core CPUs with GbEth IP NAPI Threaded mode (along with the previously enabled SW IRQ Coalescing) is required to improve network stack performance for single core SoCs using the GbEth IP (currently the RZ/G2L SoC family and the RZ/G3S SoC). This patch gives the following improvements during testing with iperf3. * RZ/G2UL: * TCP TX: +32% bandwidth (638Mbps -> 841Mbps) * TXP RX: +8.8% bandwidth (667Mbps -> 726Mbps) * UDP RX: +104% bandwidth (53Mbps -> 108Mbps) * RZ/G3S: * TCP TX: 29% bandwidth (529Mbps -> 681Mbps) * UDP RX: +1290% bandwidth (6.46Mbps -> 90Mbps) * RZ/Five: * UDP RX: Test no longer crashes (0 -> 20 Mbps) This patch gives the following reductions in performance in the same testing: * RZ/G2UL: * UDP TX: -7.5% bandwidth (594Mbps -> 549Mbps) * RZ/G3S: * UDP TX: -5% bandwidth (625Mbps -> 594Mbps) These losses are considered acceptable given the benefits shown above. If UDP TX bandwidth must be maximised for a particular use case, NAPI threaded mode can be disabled at runtime via sysfs writes. The improvement of UDP RX bandwidth for the single core SoCs (RZ/G2UL & RZ/G3S) is particularly critical. Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 09:59:58 +02:00
Paul Barker	7b39c1814c	net: ravb: Enable SW IRQ Coalescing for GbEth Software IRQ Coalescing is required to improve network stack performance in the RZ/G2L SoC family and the RZ/G3S SoC, i.e. the SoCs which use the GbEth IP. This patch gives the following improvements during testing with iperf3: * RZ/G2L: * TCP RX: same bandwidth with -6% CPU load (76% -> 71%) * UDP RX: same bandwidth with -10% CPU load (99% -> 89%) * RZ/G2UL: * UDP RX: +4200% bandwidth (1.23Mbps -> 53Mbps) * RZ/G3S: * UDP RX: +425% bandwidth (1.23Mbps -> 6.46Mbps) The improvement of UDP RX bandwidth for the single core SoCs (RZ/G2UL & RZ/G3S) is particularly critical. Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 09:59:58 +02:00
Paul Barker	3ee43f09cb	net: ravb: Refactor GbEth RX code path We can reduce code duplication in ravb_rx_gbeth(). Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 09:59:57 +02:00
Paul Barker	37a01c12e9	net: ravb: Refactor RX ring refill To reduce code duplication, we add a new RX ring refill function which can handle both the initial RX ring population (which was split between ravb_ring_init() and ravb_ring_format()) and the RX ring refill after polling (in ravb_rx()). Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 09:59:57 +02:00
Paul Barker	b0e0e20dc6	net: ravb: Align poll function with NAPI docs Align ravb_poll() with the documentation in `Documentation/networking/kapi.rst` and `Documentation/networking/napi.rst`. The documentation says that we should prefer napi_complete_done() over napi_complete(), and using the former allows us to properly support busy polling. We should ensure that napi_complete_done() is only called if the work budget has not been exhausted, and we should only re-arm interrupts if it returns true. Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 09:59:57 +02:00
Paul Barker	118e640af3	net: ravb: Simplify poll & receive functions We don't need to pass the work budget to ravb_rx() by reference, it's cleaner to pass this by value and return the amount of work done. This allows us to simplify the ravb_poll() function and use the common `work_done` variable name seen in other network drivers for consistency and ease of understanding. This is a pure refactor and should not affect behaviour. Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-06 09:59:57 +02:00
Jakub Kicinski	7da375e2c7	Merge branch 'net-mlx5e-shampo-enable-hw-gro-once-more' Tariq Toukan says: ==================== net/mlx5e: SHAMPO, Enable HW GRO once more This series enables hardware GRO for ConnectX-7 and newer NICs. SHAMPO stands for Split Header And Merge Payload Offload. The first part of the series contains important fixes and improvements. The second part reworks the HW GRO counters. Lastly, HW GRO is perf optimized and enabled. Here are the bandwidth numbers for a simple iperf3 test over a single rq where the application and irq are pinned to the same CPU: +---------+--------+--------+-----------+-------------+ \| streams \| SW GRO \| HW GRO \| Unit \| Improvement \| +---------+--------+--------+-----------+-------------+ \| 1 \| 36 \| 57 \| Gbits/sec \| 1.6 x \| \| 4 \| 34 \| 50 \| Gbits/sec \| 1.5 x \| \| 8 \| 31 \| 43 \| Gbits/sec \| 1.4 x \| +---------+--------+--------+-----------+-------------+ Benchmark details: VM based setup CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores NIC: ConnectX-7 100GbE iperf3 and irq running on same CPU over a single receive queue ==================== Link: https://lore.kernel.org/r/20240603212219.1037656-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 20:20:47 -07:00
Dragos Tatulea	14ae2fd12b	net/mlx5e: SHAMPO, Coalesce skb fragments to page size When doing hardware GRO (SHAMPO), the driver puts each data payload of a packet from the wire into one skb fragment. TCP Zero-Copy expects page sized skb fragments to be able to do it's page-flipping magic. With the current way of arranging fragments by the driver, only specific MTUs (page sized multiple + header size) will yield such page sized fragments in a high percentage. This change improves payload arrangement in the skb for hardware GRO by coalescing payloads into a single skb fragment when possible. To demonstrate the fix, running tcp_mmap with a MTU of 1500 yields: - Before: 0 % bytes mmap'ed - After : 81 % bytes mmap'ed More importantly, coalescing considerably improves the HW GRO performance. Here are the results for a iperf3 bandwidth benchmark: +---------+--------+--------+------------------------+-----------+ \| streams \| SW GRO \| HW GRO \| HW GRO with coalescing \| Unit \| \|---------+--------+--------+------------------------+-----------\| \| 1 \| 36 \| 42 \| 57 \| Gbits/sec \| \| 4 \| 34 \| 39 \| 50 \| Gbits/sec \| \| 8 \| 31 \| 35 \| 43 \| Gbits/sec \| +---------+--------+--------+------------------------+-----------+ Benchmark details: VM based setup CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores NIC: ConnectX-7 100GbE iperf3 and irq running on same CPU over a single receive queue Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-15-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 20:20:46 -07:00
Yoray Zack	99be56171f	net/mlx5e: SHAMPO, Re-enable HW-GRO Add back HW-GRO to the reported features. As the current implementation of HW-GRO uses KSMs with a specific fixed buffer size (256B) to map its headers buffer, we reported the feature only if the NIC is supporting KSM and the minimum value for buffer size is below the requested one. iperf3 bandwidth comparison: +---------+--------+--------+-----------+ \| streams \| SW GRO \| HW GRO \| Unit \| \|---------+--------+--------+-----------\| \| 1 \| 36 \| 42 \| Gbits/sec \| \| 4 \| 34 \| 39 \| Gbits/sec \| \| 8 \| 31 \| 35 \| Gbits/sec \| +---------+--------+--------+-----------+ A downstream patch will add skb fragment coalescing which will improve performance considerably. Benchmark details: VM based setup CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores NIC: ConnectX-7 100GbE iperf3 and irq running on same CPU over a single receive queue Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-14-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 20:20:46 -07:00
Yoray Zack	758191c9ea	net/mlx5e: SHAMPO, Use KSMs instead of KLMs KSM Mkey is KLM Mkey with a fixed buffer size. Due to this fact, it is a faster mechanism than KLM. SHAMPO feature used KLMs Mkeys for memory mappings of its headers buffer. As it used KLMs with the same buffer size for each entry, we can use KSMs instead. This commit changes the Mkeys that map the SHAMPO headers buffer from KLMs to KSMs. Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-13-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 20:20:46 -07:00
Tariq Toukan	e95c5b9e89	net/mlx5e: SHAMPO, Add header-only ethtool counters for header data split Count the number of header-only packets and bytes from SHAMPO. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-12-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 20:20:46 -07:00
Dragos Tatulea	16f448d47a	net/mlx5e: SHAMPO, Drop rx_gro_match_packets counter After modifying rx_gro_packets to be more accurate, the rx_gro_match_packets counter is redundant. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-11-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 20:20:46 -07:00
Dragos Tatulea	8f9eb8bb5c	net/mlx5e: SHAMPO, Make GRO counters more precise Don't count non GRO packets. A non GRO packet is a packet with a GRO cb count of 1. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-10-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 20:20:46 -07:00
Yoray Zack	f5a699e00f	net/mlx5e: SHAMPO, Skipping on duplicate flush of the same SHAMPO SKB SHAMPO SKB can be flushed in mlx5e_shampo_complete_rx_cqe(). If the SKB was flushed, rq->hw_gro_data->skb was also set to NULL. We can skip on flushing the SKB in mlx5e_shampo_flush_skb if rq->hw_gro_data->skb == NULL. Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-9-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 20:20:46 -07:00
Dragos Tatulea	d34d7d1973	net/mlx5e: SHAMPO, Specialize mlx5e_fill_skb_data() mlx5e_fill_skb_data() used to have multiple callers. But after the XDP multibuf refactoring from commit `2cb0e27d43` ("net/mlx5e: RX, Prepare non-linear striding RQ for XDP multi-buffer support") the SHAMPO code path is the only caller. Take advantage of this and specialize the function: - Drop the redundant check. - Assume that data_bcnt is > 0. This is needed in a downstream patch. Rename the function as well to make things clear. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Suggested-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-8-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 20:20:46 -07:00
Dragos Tatulea	e839ac9a89	net/mlx5e: SHAMPO, Simplify header page release in teardown The function that releases SHAMPO header pages (mlx5e_shampo_dealloc_hd) has some complicated logic that comes from the fact that it is called twice during teardown: 1) To release the posted header pages that didn't get any completions. 2) To release all remaining header pages. This flow is not necessary: all header pages can be released from the driver side in one go. Furthermore, the above flow is buggy. Taking the 8 headers per page example: 1) Release fragments 5-7. Page will be released. 2) Release remaining fragments 0-4. The bits in the header will indicate that the page needs releasing. But this is incorrect: page was released in step 1. This patch releases all header pages in one go. This simplifies the header page cleanup function. For consistency, the datapath header page release API (mlx5e_free_rx_shampo_hd_entry()) is used. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 20:20:46 -07:00
Dragos Tatulea	083dbb54c4	net/mlx5e: SHAMPO, Disable gso_size for non GRO packets When HW GRO is enabled, forwarding of packets is broken due to gso_size being set incorrectly on non GRO packets. Non GRO packets have a skb GRO count of 1. mlx5 always sets gso_size on the skb, even for non GRO packets. It leans on the fact that gso_size is normally reset in napi_gro_complete(). But this happens only for packets from GRO'able protocols (TCP/UDP) that have a gro_receive() handler. The problematic scenarios are: 1) Non GRO protocol packets are received, validate_xmit_skb() will drop them (see EPROTONOSUPPORT in skb_mac_gso_segment()). The fix for this case would be to not set gso_size at all for SHAMPO packets with header size 0. 2) Packets from a GRO'ed protocol (TCP) are received but immediately flushed because they are not GRO'able (TCP SYN for example). mlx5e_shampo_update_hdr(), which updates the remaining GRO state on the skb, is not called because skb GRO count is 1. The fix here would be to always call mlx5e_shampo_update_hdr(), regardless of skb GRO count. But this call is expensive The unified fix for both cases is to reset gso_size before calling napi_gro_receive(). It is a change that is more effective (no call to mlx5e_shampo_update_hdr() necessary) and simple (smallest code footprint). Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 20:20:46 -07:00
Dragos Tatulea	a64bbd8c28	net/mlx5e: SHAMPO, Fix FCS config when HW GRO on For the following scenario: ethtool --features eth3 rx-gro-hw on ethtool --features eth3 rx-fcs on ethtool --features eth3 rx-fcs off ... there is a firmware error because the driver enables HW GRO first while FCS is still enabled. This patch fixes this by swapping the order of HW GRO and FCS for this specific case. Take LRO into consideration as well for consistency. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 20:20:46 -07:00
Dragos Tatulea	fba8334721	net/mlx5e: SHAMPO, Fix invalid WQ linked list unlink When all the strides in a WQE have been consumed, the WQE is unlinked from the WQ linked list (mlx5_wq_ll_pop()). For SHAMPO, it is possible to receive CQEs with 0 consumed strides for the same WQE even after the WQE is fully consumed and unlinked. This triggers an additional unlink for the same wqe which corrupts the linked list. Fix this scenario by accepting 0 sized consumed strides without unlinking the WQE again. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 20:20:46 -07:00
Dragos Tatulea	70bd03b89f	net/mlx5e: SHAMPO, Fix incorrect page release Under the following conditions: 1) No skb created yet 2) header_size == 0 (no SHAMPO header) 3) header_index + 1 % MLX5E_SHAMPO_WQ_HEADER_PER_PAGE == 0 (this is the last page fragment of a SHAMPO header page) a new skb is formed with a page that is NOT a SHAMPO header page (it is a regular data page). Further down in the same function (mlx5e_handle_rx_cqe_mpwrq_shampo()), a SHAMPO header page from header_index is released. This is wrong and it leads to SHAMPO header pages being released more than once. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 20:20:41 -07:00
Tariq Toukan	4e92d24741	net/mlx5e: SHAMPO, Use net_prefetch API Let the SHAMPO functions use the net-specific prefetch API, similar to all other usages. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 20:12:58 -07:00
Jakub Kicinski	5899c88513	Merge branch 'intel-wired-lan-driver-updates-2024-05-29-ice-igc' Jacob Keller says: ==================== Intel Wired LAN Driver Updates 2024-05-29 (ice, igc) This series includes fixes for the ice driver as well as a fix for the igc driver. Jacob fixes two issues in the ice driver with reading the NVM for providing firmware data via devlink info. First, fix an off-by-one error when reading the Preserved Fields Area, resolving an infinite loop triggered on some NVMs which lack certain data in the NVM. Second, fix the reading of the NVM Shadow RAM on newer E830 and E825-C devices which have a variable sized CSS header rather than assuming this header is always the same fixed size as in the E810 devices. Larysa fixes three issues with the ice driver XDP logic that could occur if the number of queues is changed after enabling an XDP program. First, the af_xdp_zc_qps bitmap is removed and replaced by simpler logic to track whether queues are in zero-copy mode. Second, the reset and .ndo_bpf flows are distinguished to avoid potential races with a PF reset occuring simultaneously to .ndo_bpf callback from userspace. Third, the logic for mapping XDP queues to vectors is fixed so that XDP state is restored for XDP queues after a reconfiguration. Sasha fixes reporting of Energy Efficient Ethernet support via ethtool in the igc driver. v1: https://lore.kernel.org/r/20240530-net-2024-05-30-intel-net-fixes-v1-0-8b11c8c9bff8@intel.com ==================== Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-0-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 19:28:06 -07:00
Sasha Neftin	7d67d11fbe	igc: Fix Energy Efficient Ethernet support declaration The commit `01cf893bf0` ("net: intel: i40e/igc: Remove setting Autoneg in EEE capabilities") removed SUPPORTED_Autoneg field but left inappropriate ethtool_keee structure initialization. When "ethtool --show <device>" (get_eee) invoke, the 'ethtool_keee' structure was accidentally overridden. Remove the 'ethtool_keee' overriding and add EEE declaration as per IEEE specification that allows reporting Energy Efficient Ethernet capabilities. Examples: Before fix: ethtool --show-eee enp174s0 EEE settings for enp174s0: EEE status: not supported After fix: EEE settings for enp174s0: EEE status: disabled Tx LPI: disabled Supported EEE link modes: 100baseT/Full 1000baseT/Full 2500baseT/Full Fixes: `01cf893bf0` ("net: intel: i40e/igc: Remove setting Autoneg in EEE capabilities") Suggested-by: Dima Ruinskiy <dima.ruinskiy@intel.com> Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-6-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 19:27:56 -07:00
Larysa Zaremba	f3df404425	ice: map XDP queues to vectors in ice_vsi_map_rings_to_vectors() ice_pf_dcb_recfg() re-maps queues to vectors with ice_vsi_map_rings_to_vectors(), which does not restore the previous state for XDP queues. This leads to no AF_XDP traffic after rebuild. Map XDP queues to vectors in ice_vsi_map_rings_to_vectors(). Also, move the code around, so XDP queues are mapped independently only through .ndo_bpf(). Fixes: `6624e780a5` ("ice: split ice_vsi_setup into smaller functions") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-5-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 19:27:56 -07:00
Larysa Zaremba	744d197162	ice: add flag to distinguish reset from .ndo_bpf in XDP rings config Commit `6624e780a5` ("ice: split ice_vsi_setup into smaller functions") has placed ice_vsi_free_q_vectors() after ice_destroy_xdp_rings() in the rebuild process. The behaviour of the XDP rings config functions is context-dependent, so the change of order has led to ice_destroy_xdp_rings() doing additional work and removing XDP prog, when it was supposed to be preserved. Also, dependency on the PF state reset flags creates an additional, fortunately less common problem: * PFR is requested e.g. by tx_timeout handler * .ndo_bpf() is asked to delete the program, calls ice_destroy_xdp_rings(), but reset flag is set, so rings are destroyed without deleting the program * ice_vsi_rebuild tries to delete non-existent XDP rings, because the program is still on the VSI * system crashes With a similar race, when requested to attach a program, ice_prepare_xdp_rings() can actually skip setting the program in the VSI and nevertheless report success. Instead of reverting to the old order of function calls, add an enum argument to both ice_prepare_xdp_rings() and ice_destroy_xdp_rings() in order to distinguish between calls from rebuild and .ndo_bpf(). Fixes: `efc2214b60` ("ice: Add support for XDP") Reviewed-by: Igor Bagnucki <igor.bagnucki@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-4-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 19:27:56 -07:00
Larysa Zaremba	adbf5a4234	ice: remove af_xdp_zc_qps bitmap Referenced commit has introduced a bitmap to distinguish between ZC and copy-mode AF_XDP queues, because xsk_get_pool_from_qid() does not do this for us. The bitmap would be especially useful when restoring previous state after rebuild, if only it was not reallocated in the process. This leads to e.g. xdpsock dying after changing number of queues. Instead of preserving the bitmap during the rebuild, remove it completely and distinguish between ZC and copy-mode queues based on the presence of a device associated with the pool. Fixes: `e102db780e` ("ice: track AF_XDP ZC enabled queues in bitmap") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-3-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 19:27:56 -07:00
Jacob Keller	cfa747a66e	ice: fix reads from NVM Shadow RAM on E830 and E825-C devices The ice driver reads data from the Shadow RAM portion of the NVM during initialization, including data used to identify the NVM image and device, such as the ETRACK ID used to populate devlink dev info fw.bundle. Currently it is using a fixed offset defined by ICE_CSS_HEADER_LENGTH to compute the appropriate offset. This worked fine for E810 and E822 devices which both have CSS header length of 330 words. Other devices, including both E825-C and E830 devices have different sizes for their CSS header. The use of a hard coded value results in the driver reading from the wrong block in the NVM when attempting to access the Shadow RAM copy. This results in the driver reporting the fw.bundle as 0x0 in both the devlink dev info and ethtool -i output. The first E830 support was introduced by commit `ba20ecb1d1` ("ice: Hook up 4 E830 devices by adding their IDs") and the first E825-C support was introducted by commit `f64e189442` ("ice: introduce new E825C devices family") The NVM actually contains the CSS header length embedded in it. Remove the hard coded value and replace it with logic to read the length from the NVM directly. This is more resilient against all existing and future hardware, vs looking up the expected values from a table. It ensures the driver will read from the appropriate place when determining the ETRACK ID value used for populating the fw.bundle_id and for reporting in ethtool -i. The CSS header length for both the active and inactive flash bank is stored in the ice_bank_info structure to avoid unnecessary duplicate work when accessing multiple words of the Shadow RAM. Both banks are read in the unlikely event that the header length is different for the NVM in the inactive bank, rather than being different only by the overall device family. Fixes: `ba20ecb1d1` ("ice: Hook up 4 E830 devices by adding their IDs") Co-developed-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-2-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 19:27:55 -07:00
Jacob Keller	03e4a092be	ice: fix iteration of TLVs in Preserved Fields Area The ice_get_pfa_module_tlv() function iterates over the Type-Length-Value structures in the Preserved Fields Area (PFA) of the NVM. This is used by the driver to access data such as the Part Board Assembly identifier. The function uses simple logic to iterate over the PFA. First, the pointer to the PFA in the NVM is read. Then the total length of the PFA is read from the first word. A pointer to the first TLV is initialized, and a simple loop iterates over each TLV. The pointer is moved forward through the NVM until it exceeds the PFA area. The logic seems sound, but it is missing a key detail. The Preserved Fields Area length includes one additional final word. This is documented in the device data sheet as a dummy word which contains 0xFFFF. All NVMs have this extra word. If the driver tries to scan for a TLV that is not in the PFA, it will read past the size of the PFA. It reads and interprets the last dummy word of the PFA as a TLV with type 0xFFFF. It then reads the word following the PFA as a length. The PFA resides within the Shadow RAM portion of the NVM, which is relatively small. All of its offsets are within a 16-bit size. The PFA pointer and TLV pointer are stored by the driver as 16-bit values. In almost all cases, the word following the PFA will be such that interpreting it as a length will result in 16-bit arithmetic overflow. Once overflowed, the new next_tlv value is now below the maximum offset of the PFA. Thus, the driver will continue to iterate the data as TLVs. In the worst case, the driver hits on a sequence of reads which loop back to reading the same offsets in an endless loop. To fix this, we need to correct the loop iteration check to account for this extra word at the end of the PFA. This alone is sufficient to resolve the known cases of this issue in the field. However, it is plausible that an NVM could be misconfigured or have corrupt data which results in the same kind of overflow. Protect against this by using check_add_overflow when calculating both the maximum offset of the TLVs, and when calculating the next_tlv offset at the end of each loop iteration. This ensures that the driver will not get stuck in an infinite loop when scanning the PFA. Fixes: `e961b679fb` ("ice: add board identifier info to devlink .info_get") Co-developed-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-1-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-05 19:27:55 -07:00

1 2 3 4 5 ...

1280457 Commits