Menglong Dong says:
====================
net: skb: check the boundrary of skb drop reason
In the commit 1330b6ef33 ("skb: make drop reason booleanable"),
SKB_NOT_DROPPED_YET is added to the enum skb_drop_reason, which makes
the invalid drop reason SKB_NOT_DROPPED_YET can leak to the kfree_skb
tracepoint. Once this happen (it happened, as 4th patch says), it can
cause NULL pointer in drop monitor and result in kernel panic.
Therefore, check the boundrary of drop reason in both kfree_skb_reason
(2th patch) and drop monitor (1th patch) to prevent such case happens
again.
Meanwhile, fix the invalid drop reason passed to kfree_skb_reason() in
tcp_v4_rcv() and tcp_v6_rcv().
Changes since v2:
1/4 - don't reset the reason and print the debug warning only (Jakub
Kicinski)
4/4 - remove new lines between tags
Changes since v1:
- consider tcp_v6_rcv() in the 4th patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'drop_reason' that passed to kfree_skb_reason() in tcp_v4_rcv()
and tcp_v6_rcv() can be SKB_NOT_DROPPED_YET(0), as it is used as the
return value of tcp_inbound_md5_hash().
And it can panic the kernel with NULL pointer in
net_dm_packet_report_size() if the reason is 0, as drop_reasons[0]
is NULL.
Fixes: 1330b6ef33 ("skb: make drop reason booleanable")
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SKB_DR_OR() is used to set the drop reason to a value when it is
not set or specified yet. SKB_NOT_DROPPED_YET should also be considered
as not set.
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sometimes, we may forget to reset skb drop reason to NOT_SPECIFIED after
we make it the return value of the functions with return type of enum
skb_drop_reason, such as tcp_inbound_md5_hash. Therefore, its value can
be SKB_NOT_DROPPED_YET(0), which is invalid for kfree_skb_reason().
So we check the range of drop reason in kfree_skb_reason() with
DEBUG_NET_WARN_ON_ONCE().
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'reason' will be set to 'SKB_DROP_REASON_NOT_SPECIFIED' if it not
small that SKB_DROP_REASON_MAX in net_dm_packet_trace_kfree_skb_hit(),
but it can't avoid it to be 0, which is invalid and can cause NULL
pointer in drop_reasons.
Therefore, reset it to SKB_DROP_REASON_NOT_SPECIFIED when 'reason <= 0'.
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Connect with O_NONBLOCK will not be completed immediately
and returns -EINPROGRESS. It is possible to use selector/poll
for completion by selecting the socket for writing. After select
indicates writability, a second connect function call will return
0 to indicate connected successfully as TCP does, but smc returns
-EISCONN. Use socket state for smc to indicate connect state, which
can help smc aligning the connect behaviour with TCP.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet says:
====================
net: add annotations for sk->sk_bound_dev_if
While writes on sk->sk_bound_dev_if are protected by socket lock,
we have many lockless reads all over the places.
This is based on syzbot report found in the first patch changelog.
v2: inline ipv6 function only defined if IS_ENABLED(CONFIG_IPV6) (kernel bots)
Change the INET6_MATCH() to inet6_match(), this is no longer a macro.
Change INET_MATCH() to inet_match() (Olivier Hartkopp & Jakub Kicinski)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This is no longer a macro, but an inlined function.
INET_MATCH() -> inet_match()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Olivier Hartkopp <socketcan@hartkopp.net>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
INET6_MATCH() runs without holding a lock on the socket.
We probably need to annotate most reads.
This patch makes INET6_MATCH() an inline function
to ease our changes.
v2: inline function only defined if IS_ENABLED(CONFIG_IPV6)
Change the name to inet6_match(), this is no longer a macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use READ_ONCE() in paths not holding the socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_bound_dev_if can change under us, use READ_ONCE() annotation.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_csk_bind_conflict() can access sk->sk_bound_dev_if for
unlocked sockets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When reading listener sk->sk_bound_dev_if locklessly,
we must use READ_ONCE().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_bindtoindex_locked() needs to use WRITE_ONCE(sk->sk_bound_dev_if, val),
because other cpus/threads might locklessly read this field.
sock_getbindtodevice(), sock_getsockopt() need READ_ONCE()
because they run without socket lock held.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_request_bound_dev_if() reads sk->sk_bound_dev_if twice
while listener socket is not locked.
Another cpu could change this field under us.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_rcv() reads sk->sk_bound_dev_if twice while the socket
is not locked. Another cpu could change this field under us.
Fixes: 0fd9a65a76 ("[SCTP] Support SO_BINDTODEVICE socket option on incoming packets.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP sendmsg() is lockless, and reads sk->sk_bound_dev_if while
this field can be changed by another thread.
Adds minimal annotations to avoid KCSAN splats for UDP.
Following patches will add more annotations to potential lockless readers.
BUG: KCSAN: data-race in __ip6_datagram_connect / udpv6_sendmsg
write to 0xffff888136d47a94 of 4 bytes by task 7681 on cpu 0:
__ip6_datagram_connect+0x6e2/0x930 net/ipv6/datagram.c:221
ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272
inet_dgram_connect+0x107/0x190 net/ipv4/af_inet.c:576
__sys_connect_file net/socket.c:1900 [inline]
__sys_connect+0x197/0x1b0 net/socket.c:1917
__do_sys_connect net/socket.c:1927 [inline]
__se_sys_connect net/socket.c:1924 [inline]
__x64_sys_connect+0x3d/0x50 net/socket.c:1924
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff888136d47a94 of 4 bytes by task 7670 on cpu 1:
udpv6_sendmsg+0xc60/0x16e0 net/ipv6/udp.c:1436
inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:652
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg net/socket.c:725 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2413
___sys_sendmsg net/socket.c:2467 [inline]
__sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
__do_sys_sendmmsg net/socket.c:2582 [inline]
__se_sys_sendmmsg net/socket.c:2579 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x00000000 -> 0xffffff9b
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 7670 Comm: syz-executor.3 Tainted: G W 5.18.0-rc1-syzkaller-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
I chose to not add Fixes: tag because race has minor consequences
and stable teams busy enough.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet says:
====================
tcp: BIG TCP implementation
This series implements BIG TCP as presented in netdev 0x15:
https://netdevconf.info/0x15/session.html?BIG-TCP
Jonathan Corbet made a nice summary: https://lwn.net/Articles/884104/
Standard TSO/GRO packet limit is 64KB
With BIG TCP, we allow bigger TSO/GRO packet sizes for IPv6 traffic.
Note that this feature is by default not enabled, because it might
break some eBPF programs assuming TCP header immediately follows IPv6 header.
While tcpdump recognizes the HBH/Jumbo header, standard pcap filters
are unable to skip over IPv6 extension headers.
Reducing number of packets traversing networking stack usually improves
performance, as shown on this experiment using a 100Gbit NIC, and 4K MTU.
'Standard' performance with current (74KB) limits.
for i in {1..10}; do ./netperf -t TCP_RR -H iroa23 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
77 138 183 8542.19
79 143 178 8215.28
70 117 164 9543.39
80 144 176 8183.71
78 126 155 9108.47
80 146 184 8115.19
71 113 165 9510.96
74 113 164 9518.74
79 137 178 8575.04
73 111 171 9561.73
Now enable BIG TCP on both hosts.
ip link set dev eth0 gro_max_size 185000 gso_max_size 185000
for i in {1..10}; do ./netperf -t TCP_RR -H iroa23 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
57 83 117 13871.38
64 118 155 11432.94
65 116 148 11507.62
60 105 136 12645.15
60 103 135 12760.34
60 102 134 12832.64
62 109 132 10877.68
58 82 115 14052.93
57 83 124 14212.58
57 82 119 14196.01
We see an increase of transactions per second, and lower latencies as well.
v7: adopt unsafe_memcpy() in mlx5 to avoid FORTIFY warnings.
v6: fix a compilation error for CONFIG_IPV6=n in
"net: allow gso_max_size to exceed 65536", reported by kernel bots.
v5: Replaced two patches (that were adding new attributes) with patches
from Alexander Duyck. Idea is to reuse existing gso_max_size/gro_max_size
v4: Rebased on top of Jakub series (Merge branch 'tso-gso-limit-split')
max_tso_size is now family independent.
v3: Fixed a typo in RFC number (Alexander)
Added Reviewed-by: tags from Tariq on mlx4/mlx5 parts.
v2: Removed the MAX_SKB_FRAGS change, this belongs to a different series.
Addressed feedback, for Alexander and nvidia folks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
mlx5 supports LSOv2.
IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
with JUMBO TLV for big packets.
We need to ignore/skip this HBH header when populating TX descriptor.
Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet
layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only.
v7: adopt unsafe_memcpy() and MLX5_UNSAFE_MEMCPY_DISCLAIMER
v2: clear hopbyhop in mlx5e_tx_get_gso_ihs()
v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y
Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Leon Romanovsky <leon@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
mlx4 supports LSOv2 just fine.
IPv6 stack inserts a temporary Hop-by-Hop header
with JUMBO TLV for big packets.
We need to ignore the HBH header when populating TX descriptor.
Tested:
Before: (not enabling bigger TSO/GRO packets)
ip link set dev eth0 gso_max_size 65536 gro_max_size 65536
netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind
Local /Remote
Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem
Send Recv Size Size Time Rate local remote local remote
bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr
262144 540000 70000 70000 10.00 6591.45 0.86 1.34 62.490 97.446
262144 540000
After: (enabling bigger TSO/GRO packets)
ip link set dev eth0 gso_max_size 185000 gro_max_size 185000
netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind
Local /Remote
Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem
Send Recv Size Size Time Rate local remote local remote
bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr
262144 540000 70000 70000 10.00 8383.95 0.95 1.01 54.432 57.584
262144 540000
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set the TSO driver limit to GSO_MAX_SIZE (512 KB).
This allows the admin/user to set a GSO limit up to this value.
ip link set dev veth10 gso_max_size 200000
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of simply forcing a 0 payload_len in IPv6 header,
implement RFC 2675 and insert a custom extension header.
Note that only TCP stack is currently potentially generating
jumbograms, and that this extension header is purely local,
it wont be sent on a physical link.
This is needed so that packet capture (tcpdump and friends)
can properly dissect these large packets.
Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow the gro_max_size to exceed a value larger than 65536.
There weren't really any external limitations that prevented this other
than the fact that IPv4 only supports a 16 bit length field. Since we have
the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to
exceed this value and for IPv4 and non-TCP flows we can cap things at 65536
via a constant rather than relying on gro_max_size.
[edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch will add GRO_IPV6_MAX_SIZE, allowing gro to build
BIG TCP ipv6 packets (bigger than 64K).
This patch changes ipv6_gro_complete() to insert a HBH/jumbo header
so that resulting packet can go through IPv6/TCP stacks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6 tcp and gro stacks will soon be able to build big TCP packets,
with an added temporary Hop By Hop header.
If GSO is involved for these large packets, we need to remove
the temporary HBH header before segmentation happens.
v2: perform HBH removal from ipv6_gso_segment() instead of
skb_segment() (Alexander feedback)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patches will need to add and remove local IPv6 jumbogram
options to enable BIG TCP.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hystart_ack_delay() had the assumption that a TSO packet
would not be bigger than GSO_MAX_SIZE.
This will no longer be true.
We should use sk->sk_gso_max_size instead.
This reduces chances of spurious Hystart ACK train detections.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure we will not overflow shinfo->gso_segs
Minimal TCP MSS size is 8 bytes, and shinfo->gso_segs
is a 16bit field.
TCP_MIN_GSO_SIZE is currently defined in include/net/tcp.h,
it seems cleaner to not bring tcp details into include/linux/netdevice.h
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code for gso_max_size was added originally to allow for debugging and
workaround of buggy devices that couldn't support TSO with blocks 64K in
size. The original reason for limiting it to 64K was because that was the
existing limits of IPv4 and non-jumbogram IPv6 length fields.
With the addition of Big TCP we can remove this limit and allow the value
to potentially go up to UINT_MAX and instead be limited by the tso_max_size
value.
So in order to support this we need to go through and clean up the
remaining users of the gso_max_size value so that the values will cap at
64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
limit for GSO_MAX_SIZE.
v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
in a new sk_trim_gso_size() helper.
netif_set_tso_max_size() caps the requested TSO size
with GSO_MAX_SIZE.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
New netlink attributes IFLA_TSO_MAX_SIZE and IFLA_TSO_MAX_SEGS
are used to report to user-space the device TSO limits.
ip -d link sh dev eth1
...
tso_max_size 65536 tso_max_segs 65535
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Phil Edworthy says:
====================
Add Renesas RZ/V2M Ethernet support
The RZ/V2M Ethernet is very similar to R-Car Gen3 Ethernet-AVB, though
some small parts are the same as R-Car Gen2.
Other differences are:
* It has separate data (DI), error (Line 1) and management (Line 2) irqs
rather than one irq for all three.
* Instead of using the High-speed peripheral bus clock for gPTP, it has
a separate gPTP reference clock.
v4:
* Add clk_disable_unprepare() for gptp ref clk
v3:
* Really renamed irq_en_dis_regs to irq_en_dis this time
* Modified ravb_ptp_extts() to use irq_en_dis
* Added Reviewed-by tags
v2:
* Just net patches in this series
* Instead of reusing ch22 and ch24 interrupt names, use the proper names
* Renamed irq_en_dis_regs to irq_en_dis
* Squashed use of GIC reg versus GIE/GID and got rid of separate gptp_ptm_gic feature.
* Move err_mgmt_irqs code under multi_irqs
* Minor editing of the commit msgs
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
RZ/V2M Ethernet is very similar to R-Car Gen3 Ethernet-AVB, though
some small parts are the same as R-Car Gen2.
Other differences to R-Car Gen3 and Gen2 are:
* It has separate data (DI), error (Line 1) and management (Line 2) irqs
rather than one irq for all three.
* Instead of using the High-speed peripheral bus clock for gPTP, it has a
separate gPTP reference clock.
Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com>
Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
RZ/V2M has a separate gPTP reference clock that is used when the
AVB-DMAC Mode Register (CCC) gPTP Clock Select (CSEL) bits are
set to "01: High-speed peripheral bus clock".
Therefore, add a feature that allows this clock to be used for
gPTP.
Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com>
Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
R-Car has a combined interrupt line, ch22 = Line0_DiA | Line1_A | Line2_A.
RZ/V2M has separate interrupt lines for each of these, so add a feature
that allows the driver to get these interrupts and call the common handler.
Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com>
Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when the HW has a single interrupt, the driver uses the
GIC, TIC, RIC0 registers to enable and disable interrupts.
When the HW has multiple interrupts, it uses the GIE, GID, TIE, TID,
RIE0, RID0 registers.
However, other devices, e.g. RZ/V2M, have multiple irqs and only have
the GIC, TIC, RIC0 registers.
Therefore, split this into a separate feature.
Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com>
Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Document the Ethernet AVB IP found on RZ/V2M SoC.
It includes the Ethernet controller (E-MAC) and Dedicated Direct memory
access controller (DMAC) for transferring transmitted Ethernet frames
to and received Ethernet frames from respective storage areas in the
RAM at high speed.
The AVB-DMAC is compliant with IEEE 802.1BA, IEEE 802.1AS timing and
synchronization protocol, IEEE 802.1Qav real-time transfer, and the
IEEE 802.1Qat stream reservation protocol.
R-Car has a pair of combined interrupt lines:
ch22 = Line0_DiA | Line1_A | Line2_A
ch23 = Line0_DiB | Line1_B | Line2_B
Line0 for descriptor interrupts (which we call dia and dib).
Line1 for error related interrupts (which we call err_a and err_b).
Line2 for management and gPTP related interrupts (mgmt_a and mgmt_b).
RZ/V2M hardware has separate interrupt lines for each of these.
It has 3 clocks; the main AXI clock, the AMBA CHI (Coherent Hub
Interface) clock and a gPTP reference clock.
Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com>
Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
This is v2 including deadlock fix in conntrack ecache rework
reported by Jakub Kicinski.
The following patchset contains Netfilter updates for net-next,
mostly updates to conntrack from Florian Westphal.
1) Add a dedicated list for conntrack event redelivery.
2) Include event redelivery list in conntrack dumps of dying type.
3) Remove per-cpu dying list for event redelivery, not used anymore.
4) Add netns .pre_exit to cttimeout to zap timeout objects before
synchronize_rcu() call.
5) Remove nf_ct_unconfirmed_destroy.
6) Add generation id for conntrack extensions for conntrack
timeout and helpers.
7) Detach timeout policy from conntrack on cttimeout module removal.
8) Remove __nf_ct_unconfirmed_destroy.
9) Remove unconfirmed list.
10) Remove unconditional local_bh_disable in init_conntrack().
11) Consolidate conntrack iterator nf_ct_iterate_cleanup().
12) Detect if ctnetlink listeners exist to short-circuit event
path early.
13) Un-inline nf_ct_ecache_ext_add().
14) Add nf_conntrack_events autodetect ctnetlink listener mode
and make it default.
15) Add nf_ct_ecache_exist() to check for event cache extension.
16) Extend flowtable reverse route lookup to include source, iif,
tos and mark, from Sven Auhagen.
17) Do not verify zero checksum UDP packets in nf_reject,
from Kevin Mitchell.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When ADQ queue groups (TCs) are created via tc mqprio command,
RSS contexts and associated RSS indirection tables are configured
automatically per TC based on the queue ranges specified for
each traffic class.
For ex:
tc qdisc add dev enp175s0f0 root mqprio num_tc 3 map 0 1 2 \
queues 2@0 8@2 4@10 hw 1 mode channel
will create 3 queue groups (TC 0-2) with queue ranges 2, 8 and 4
in 3 queue groups. Each queue group is associated with its
own RSS context and RSS indirection table.
Add support to expose RSS indirection tables for all ADQ queue
groups using ethtool RSS contexts interface.
ethtool -x enp175s0f0 context <tc-num>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
Tested-by: Bharathi Sreenivas <bharathi.sreenivas@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20220512213249.3747424-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add the capability to map non-linear xdp frames in XDP_TX and ndo_xdp_xmit
callback.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20220512212621.3746140-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove napi_weight statics which are set to 64 and never modified,
remnants of the out-of-tree napi_weight module param.
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/r/20220512205603.1536771-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2022-05-13
1) Cleanups for the code behind the XFRM offload API. This is a
preparation for the extension of the API for policy offload.
From Leon Romanovsky.
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: drop not needed flags variable in XFRM offload struct
net/mlx5e: Use XFRM state direction instead of flags
netdevsim: rely on XFRM state direction instead of flags
ixgbe: propagate XFRM offload state direction instead of flags
xfrm: store and rely on direction to construct offload flags
xfrm: rename xfrm_state_offload struct to allow reuse
xfrm: delete not used number of external headers
xfrm: free not used XFRM_ESP_NO_TRAILER flag
====================
Link: https://lore.kernel.org/r/20220513151218.4010119-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If CONFIG_PTP_1588_CLOCK=m and CONFIG_SFC_SIENA=y, the siena driver will fail to link:
drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_ptp_remove_channel':
ptp.c:(.text+0xa28): undefined reference to `ptp_clock_unregister'
drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_ptp_probe_channel':
ptp.c:(.text+0x13a0): undefined reference to `ptp_clock_register'
ptp.c:(.text+0x1470): undefined reference to `ptp_clock_unregister'
drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_ptp_pps_worker':
ptp.c:(.text+0x1d29): undefined reference to `ptp_clock_event'
drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_siena_ptp_get_ts_info':
ptp.c:(.text+0x301b): undefined reference to `ptp_clock_index'
To fix this build error, make SFC_SIENA depends on PTP_1588_CLOCK.
Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: d48523cb88 ("sfc: Copy shared files needed for Siena (part 2)")
Signed-off-by: Ren Zhijie <renzhijie2@huawei.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Link: https://lore.kernel.org/r/20220513012721.140871-1-renzhijie2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The checksum is optional for UDP packets. However nf_reject would
previously require a valid checksum to elicit a response such as
ICMP_DEST_UNREACH.
Add some logic to nf_reject_verify_csum to determine if a UDP packet has
a zero checksum and should therefore not be verified.
Signed-off-by: Kevin Mitchell <kevmitch@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When creating a flow table entry, the reverse route is looked
up based on the current packet.
There can be scenarios where the user creates a custom ip rule
to route the traffic differently.
In order to support those scenarios, the lookup needs to add
more information based on the current packet.
The patch adds multiple new information to the route lookup.
Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The pointer check usually results in a 'false positive': its likely
that the ctnetlink module is loaded but no event monitoring is enabled.
After recent change to autodetect ctnetlink usage and only allocate
the ecache extension if a listener is active, check if the extension
is present on a given conntrack.
If its not there, there is nothing to report and calls to the
notification framework can be elided.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This adds the new nf_conntrack_events=2 mode and makes it the
default.
This leverages the earlier flag in struct net to allow to avoid
the event extension as long as no event listener is active in
the namespace.
This avoids, for most cases, allocation of ct->ext area.
A followup patch will take further advantage of this by avoiding
calls down into the event framework if the extension isn't present.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Only called when new ct is allocated or the extension isn't present.
This function will be extended, place this in the conntrack module
instead of inlining.
The callers already depend on nf_conntrack module.
Return value is changed to bool, noone used the returned pointer.
Make sure that the core drops the newly allocated conntrack
if the extension is requested but can't be added.
This makes it necessary to ifdef the section, as the stub
always returns false we'd drop every new conntrack if the
the ecache extension is disabled in kconfig.
Add from data path (xt_CT, nft_ct) is unchanged.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
At this time, every new conntrack gets the 'event cache extension'
enabled for it.
This is because the 'net.netfilter.nf_conntrack_events' sysctl defaults
to 1.
Changing the default to 0 means that commands that rely on the event
notification extension, e.g. 'conntrack -E' or conntrackd, stop working.
We COULD detect if there is a listener by means of
'nfnetlink_has_listeners()' and only add the extension if this is true.
The downside is a dependency from conntrack module to nfnetlink module.
This adds a different way: inc/dec a counter whenever a ctnetlink group
is being (un)subscribed and toggle a flag in struct net.
Next patches will take advantage of this and will only add the event
extension if the flag is set.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds a structure to collect all the context data that is
passed to the cleanup iterator.
struct nf_ct_iter_data {
struct net *net;
void *data;
u32 portid;
int report;
};
There is a netns field that allows to clean up conntrack entries
specifically owned by the specified netns.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>