This patch adds the dataplane hardware offload to the flowtable
infrastructure. Three new flags represent the hardware state of this
flow:
* FLOW_OFFLOAD_HW: This flow entry resides in the hardware.
* FLOW_OFFLOAD_HW_DYING: This flow entry has been scheduled to be remove
from hardware. This might be triggered by either packet path (via TCP
RST/FIN packet) or via aging.
* FLOW_OFFLOAD_HW_DEAD: This flow entry has been already removed from
the hardware, the software garbage collector can remove it from the
software flowtable.
This patch supports for:
* IPv4 only.
* Aging via FLOW_CLS_STATS, no packet and byte counter synchronization
at this stage.
This patch also adds the action callback that specifies how to convert
the flow entry into the flow_rule object that is passed to the driver.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the NFTA_FLOWTABLE_FLAGS attribute that allows users to
specify the NF_FLOWTABLE_HW_OFFLOAD flag. This patch also adds a new
setup interface for the flowtable type to perform the flowtable offload
block callback configuration.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
One conflict in the BPF samples Makefile, some fixes in 'net' whilst
we were converting over to Makefile.target rules in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking at a syzbot KCSAN report [1], I found multiple
issues in this code :
1) fib6_nh->last_probe has an initial value of 0.
While probably okay on 64bit kernels, this causes an issue
on 32bit kernels since the time_after(jiffies, 0 + interval)
might be false ~24 days after boot (for HZ=1000)
2) The data-race found by KCSAN
I could use READ_ONCE() and WRITE_ONCE(), but we also can
take the opportunity of not piling-up too many rt6_probe_deferred()
works by using instead cmpxchg() so that only one cpu wins the race.
[1]
BUG: KCSAN: data-race in find_match / find_match
write to 0xffff8880bb7aabe8 of 8 bytes by interrupt on cpu 1:
rt6_probe net/ipv6/route.c:663 [inline]
find_match net/ipv6/route.c:757 [inline]
find_match+0x5bd/0x790 net/ipv6/route.c:733
__find_rr_leaf+0xe3/0x780 net/ipv6/route.c:831
find_rr_leaf net/ipv6/route.c:852 [inline]
rt6_select net/ipv6/route.c:896 [inline]
fib6_table_lookup+0x383/0x650 net/ipv6/route.c:2164
ip6_pol_route+0xee/0x5c0 net/ipv6/route.c:2200
ip6_pol_route_output+0x48/0x60 net/ipv6/route.c:2452
fib6_rule_lookup+0x3d6/0x470 net/ipv6/fib6_rules.c:117
ip6_route_output_flags_noref+0x16b/0x230 net/ipv6/route.c:2484
ip6_route_output_flags+0x50/0x1a0 net/ipv6/route.c:2497
ip6_dst_lookup_tail+0x25d/0xc30 net/ipv6/ip6_output.c:1049
ip6_dst_lookup_flow+0x68/0x120 net/ipv6/ip6_output.c:1150
inet6_csk_route_socket+0x2f7/0x420 net/ipv6/inet6_connection_sock.c:106
inet6_csk_xmit+0x91/0x1f0 net/ipv6/inet6_connection_sock.c:121
__tcp_transmit_skb+0xe81/0x1d60 net/ipv4/tcp_output.c:1169
tcp_transmit_skb net/ipv4/tcp_output.c:1185 [inline]
tcp_xmit_probe_skb+0x19b/0x1d0 net/ipv4/tcp_output.c:3735
read to 0xffff8880bb7aabe8 of 8 bytes by interrupt on cpu 0:
rt6_probe net/ipv6/route.c:657 [inline]
find_match net/ipv6/route.c:757 [inline]
find_match+0x521/0x790 net/ipv6/route.c:733
__find_rr_leaf+0xe3/0x780 net/ipv6/route.c:831
find_rr_leaf net/ipv6/route.c:852 [inline]
rt6_select net/ipv6/route.c:896 [inline]
fib6_table_lookup+0x383/0x650 net/ipv6/route.c:2164
ip6_pol_route+0xee/0x5c0 net/ipv6/route.c:2200
ip6_pol_route_output+0x48/0x60 net/ipv6/route.c:2452
fib6_rule_lookup+0x3d6/0x470 net/ipv6/fib6_rules.c:117
ip6_route_output_flags_noref+0x16b/0x230 net/ipv6/route.c:2484
ip6_route_output_flags+0x50/0x1a0 net/ipv6/route.c:2497
ip6_dst_lookup_tail+0x25d/0xc30 net/ipv6/ip6_output.c:1049
ip6_dst_lookup_flow+0x68/0x120 net/ipv6/ip6_output.c:1150
inet6_csk_route_socket+0x2f7/0x420 net/ipv6/inet6_connection_sock.c:106
inet6_csk_xmit+0x91/0x1f0 net/ipv6/inet6_connection_sock.c:121
__tcp_transmit_skb+0xe81/0x1d60 net/ipv4/tcp_output.c:1169
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 18894 Comm: udevd Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: cc3a86c802 ("ipv6: Change rt6_probe to take a fib6_nh")
Fixes: f547fac624 ("ipv6: rate-limit probes for neighbourless routes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_ack_backlog can be read without any lock being held.
We need to use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing
and/or potential KCSAN warnings.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same code which recognizes ICMP error packets is duplicated several
times. Use the icmp_is_err() and icmpv6_is_err() helpers instead, which
do the same thing.
ip_multipath_l3_keys() and tcf_nat_act() didn't check for all the error types,
assume that they should instead.
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Faster jhash2() can be used instead of jhash(), since
IPv6 addresses have the needed alignment requirement.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
traceroute6 output can be confusing, in that it shows the address
that a router would use to reach the sender, rather than the address
the packet used to reach the router.
Consider this case:
------------------------ N2
| |
------ ------ N3 ----
| R1 | | R2 |------|H2|
------ ------ ----
| |
------------------------ N1
|
----
|H1|
----
where H1's default route is through R1, and R1's default route is
through R2 over N2.
traceroute6 from H1 to H2 shows R2's address on N1 rather than on N2.
The script below can be used to reproduce this scenario.
traceroute6 output without this patch:
traceroute to 2000:103::4 (2000:103::4), 30 hops max, 80 byte packets
1 2000:101::1 (2000:101::1) 0.036 ms 0.008 ms 0.006 ms
2 2000:101::2 (2000:101::2) 0.011 ms 0.008 ms 0.007 ms
3 2000:103::4 (2000:103::4) 0.013 ms 0.010 ms 0.009 ms
traceroute6 output with this patch:
traceroute to 2000:103::4 (2000:103::4), 30 hops max, 80 byte packets
1 2000:101::1 (2000:101::1) 0.056 ms 0.019 ms 0.006 ms
2 2000:102::2 (2000:102::2) 0.013 ms 0.008 ms 0.008 ms
3 2000:103::4 (2000:103::4) 0.013 ms 0.009 ms 0.009 ms
#!/bin/bash
#
# ------------------------ N2
# | |
# ------ ------ N3 ----
# | R1 | | R2 |------|H2|
# ------ ------ ----
# | |
# ------------------------ N1
# |
# ----
# |H1|
# ----
#
# N1: 2000:101::/64
# N2: 2000:102::/64
# N3: 2000:103::/64
#
# R1's host part of address: 1
# R2's host part of address: 2
# H1's host part of address: 3
# H2's host part of address: 4
#
# For example:
# the IPv6 address of R1's interface on N2 is 2000:102::1/64
#
# Nets are implemented by macvlan interfaces (bridge mode) over
# dummy interfaces.
#
# Create net namespaces
ip netns add host1
ip netns add host2
ip netns add rtr1
ip netns add rtr2
# Create nets
ip link add net1 type dummy; ip link set net1 up
ip link add net2 type dummy; ip link set net2 up
ip link add net3 type dummy; ip link set net3 up
# Add interfaces to net1, move them to their nemaspaces
ip link add link net1 dev host1net1 type macvlan mode bridge
ip link set host1net1 netns host1
ip link add link net1 dev rtr1net1 type macvlan mode bridge
ip link set rtr1net1 netns rtr1
ip link add link net1 dev rtr2net1 type macvlan mode bridge
ip link set rtr2net1 netns rtr2
# Add interfaces to net2, move them to their nemaspaces
ip link add link net2 dev rtr1net2 type macvlan mode bridge
ip link set rtr1net2 netns rtr1
ip link add link net2 dev rtr2net2 type macvlan mode bridge
ip link set rtr2net2 netns rtr2
# Add interfaces to net3, move them to their nemaspaces
ip link add link net3 dev rtr2net3 type macvlan mode bridge
ip link set rtr2net3 netns rtr2
ip link add link net3 dev host2net3 type macvlan mode bridge
ip link set host2net3 netns host2
# Configure interfaces and routes in host1
ip netns exec host1 ip link set lo up
ip netns exec host1 ip link set host1net1 up
ip netns exec host1 ip -6 addr add 2000:101::3/64 dev host1net1
ip netns exec host1 ip -6 route add default via 2000:101::1
# Configure interfaces and routes in rtr1
ip netns exec rtr1 ip link set lo up
ip netns exec rtr1 ip link set rtr1net1 up
ip netns exec rtr1 ip -6 addr add 2000:101::1/64 dev rtr1net1
ip netns exec rtr1 ip link set rtr1net2 up
ip netns exec rtr1 ip -6 addr add 2000:102::1/64 dev rtr1net2
ip netns exec rtr1 ip -6 route add default via 2000:102::2
ip netns exec rtr1 sysctl net.ipv6.conf.all.forwarding=1
# Configure interfaces and routes in rtr2
ip netns exec rtr2 ip link set lo up
ip netns exec rtr2 ip link set rtr2net1 up
ip netns exec rtr2 ip -6 addr add 2000:101::2/64 dev rtr2net1
ip netns exec rtr2 ip link set rtr2net2 up
ip netns exec rtr2 ip -6 addr add 2000:102::2/64 dev rtr2net2
ip netns exec rtr2 ip link set rtr2net3 up
ip netns exec rtr2 ip -6 addr add 2000:103::2/64 dev rtr2net3
ip netns exec rtr2 sysctl net.ipv6.conf.all.forwarding=1
# Configure interfaces and routes in host2
ip netns exec host2 ip link set lo up
ip netns exec host2 ip link set host2net3 up
ip netns exec host2 ip -6 addr add 2000:103::4/64 dev host2net3
ip netns exec host2 ip -6 route add default via 2000:103::2
# Ping host2 from host1
ip netns exec host1 ping6 -c5 2000:103::4
# Traceroute host2 from host1
ip netns exec host1 traceroute6 2000:103::4
# Delete nets
ip link del net3
ip link del net2
ip link del net1
# Delete namespaces
ip netns del rtr2
ip netns del rtr1
ip netns del host2
ip netns del host1
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Original-patch-by: Honggang Xu <hxu@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only slightly tricky merge conflict was the netdevsim because the
mutex locking fix overlapped a lot of driver reload reorganization.
The rest were (relatively) trivial in nature.
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot triggered struct net NULL deref in NF_HOOK_LIST:
RIP: 0010:NF_HOOK_LIST include/linux/netfilter.h:331 [inline]
RIP: 0010:ip6_sublist_rcv+0x5c9/0x930 net/ipv6/ip6_input.c:292
ipv6_list_rcv+0x373/0x4b0 net/ipv6/ip6_input.c:328
__netif_receive_skb_list_ptype net/core/dev.c:5274 [inline]
Reason:
void ipv6_list_rcv(struct list_head *head, struct packet_type *pt,
struct net_device *orig_dev)
[..]
list_for_each_entry_safe(skb, next, head, list) {
/* iterates list */
skb = ip6_rcv_core(skb, dev, net);
/* ip6_rcv_core drops skb -> NULL is returned */
if (skb == NULL)
continue;
[..]
}
/* sublist is empty -> curr_net is NULL */
ip6_sublist_rcv(&sublist, curr_dev, curr_net);
Before the recent change NF_HOOK_LIST did a list iteration before
struct net deref, i.e. it was a no-op in the empty list case.
List iteration now happens after *net deref, causing crash.
Follow the same pattern as the ip(v6)_list_rcv loop and add a list_empty
test for the final sublist dispatch too.
Cc: Edward Cree <ecree@solarflare.com>
Reported-by: syzbot+c54f457cad330e57e967@syzkaller.appspotmail.com
Fixes: ca58fbe06c ("netfilter: add and use nf_hook_slow_list()")
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Leon Romanovsky <leonro@mellanox.com>
Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The check for !md doens't really work for ip_tunnel_info_opts(info) which
only does info + 1. Also to avoid out-of-bounds access on info, it should
ensure options_len is not less than erspan_metadata in both erspan_xmit()
and ip6erspan_tunnel_xmit().
Fixes: 1a66a836da ("gre: add collect_md mode to ERSPAN tunnel")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next,
more specifically:
* Updates for ipset:
1) Coding style fix for ipset comment extension, from Jeremy Sowden.
2) De-inline many functions in ipset, from Jeremy Sowden.
3) Move ipset function definition from header to source file.
4) Move ip_set_put_flags() to source, export it as a symbol, remove
inline.
5) Move range_to_mask() to the source file where this is used.
6) Move ip_set_get_ip_port() to the source file where this is used.
* IPVS selftests and netns improvements:
7) Two patches to speedup ipvs netns dismantle, from Haishuang Yan.
8) Three patches to add selftest script for ipvs, also from
Haishuang Yan.
* Conntrack updates and new nf_hook_slow_list() function:
9) Document ct ecache extension, from Florian Westphal.
10) Skip ct extensions from ctnetlink dump, from Florian.
11) Free ct extension immediately, from Florian.
12) Skip access to ecache extension from nf_ct_deliver_cached_events()
this is not correct as reported by Syzbot.
13) Add and use nf_hook_slow_list(), from Florian.
* Flowtable infrastructure updates:
14) Move priority to nf_flowtable definition.
15) Dynamic allocation of per-device hooks in flowtables.
16) Allow to include netdevice only once in flowtable definitions.
17) Rise maximum number of devices per flowtable.
* Netfilter hardware offload infrastructure updates:
18) Add nft_flow_block_chain() helper function.
19) Pass callback list to nft_setup_cb_call().
20) Add nft_flow_cls_offload_setup() helper function.
21) Remove rules for the unregistered device via netdevice event.
22) Support for multiple devices in a basechain definition at the
ingress hook.
22) Add nft_chain_offload_cmd() helper function.
23) Add nft_flow_block_offload_init() helper function.
24) Rewind in case of failing to bind multiple devices to hook.
25) Typo in IPv6 tproxy module description, from Norman Rasmussen.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Include <net/addrconf.h> for the missing declarations of
various functions. Fixes the following sparse warnings:
net/ipv6/addrconf_core.c:94:5: warning: symbol 'register_inet6addr_notifier' was not declared. Should it be static?
net/ipv6/addrconf_core.c💯5: warning: symbol 'unregister_inet6addr_notifier' was not declared. Should it be static?
net/ipv6/addrconf_core.c:106:5: warning: symbol 'inet6addr_notifier_call_chain' was not declared. Should it be static?
net/ipv6/addrconf_core.c:112:5: warning: symbol 'register_inet6addr_validator_notifier' was not declared. Should it be static?
net/ipv6/addrconf_core.c:118:5: warning: symbol 'unregister_inet6addr_validator_notifier' was not declared. Should it be static?
net/ipv6/addrconf_core.c:125:5: warning: symbol 'inet6addr_validator_notifier_call_chain' was not declared. Should it be static?
net/ipv6/addrconf_core.c:237:6: warning: symbol 'in6_dev_finish_destroy' was not declared. Should it be static?
Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Thomas found that some forwarded packets would be stuck
in FQ packet scheduler because their skb->tstamp contained
timestamps far in the future.
We thought we addressed this point in commit 8203e2d844
("net: clear skb->tstamp in forwarding paths") but there
is still an issue when/if a packet needs to be fragmented.
In order to meet EDT requirements, we have to make sure all
fragments get the original skb->tstamp.
Note that this original skb->tstamp should be zero in
forwarding path, but might have a non zero value in
output path if user decided so.
Fixes: fb420d5d91 ("tcp/fq: move back to CLOCK_MONOTONIC")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Thomas Bartschies <Thomas.Bartschies@cvk.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit b0818f80c8.
Started seeing weird behavior after this patch especially in
the IPv6 code path. Haven't root caused it, but since this was
applied to net branch, taking a precautionary measure to revert
it and look / analyze those failures
Revert this now and I'll send a better fix after analysing / fixing
the weirdness observed.
CC: Eric Dumazet <edumazet@google.com>
CC: Wei Wang <weiwan@google.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While invalidating the dst, we assign backhole_netdev instead of
loopback device. However, this device does not have idev pointer
and hence no ip6_ptr even if IPv6 is enabled. Possibly this has
triggered the syzbot reported crash.
The syzbot report does not have reproducer, however, this is the
only device that doesn't have matching idev created.
Crash instruction is :
static inline bool ip6_ignore_linkdown(const struct net_device *dev)
{
const struct inet6_dev *idev = __in6_dev_get(dev);
return !!idev->cnf.ignore_routes_with_linkdown; <= crash
}
Also ipv6 always assumes presence of idev and never checks for it
being NULL (as does the above referenced code). So adding a idev
for the blackhole_netdev to avoid this class of crashes in the future.
Signed-off-by: David S. Miller <davem@davemloft.net>
There are few places where we fetch tp->write_seq while
this field can change from IRQ or other cpu.
We need to add READ_ONCE() annotations, and also make
sure write sides use corresponding WRITE_ONCE() to avoid
store-tearing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are few places where we fetch tp->copied_seq while
this field can change from IRQ or other cpu.
We need to add READ_ONCE() annotations, and also make
sure write sides use corresponding WRITE_ONCE() to avoid
store-tearing.
Note that tcp_inq_hint() was already using READ_ONCE(tp->copied_seq)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are few places where we fetch tp->rcv_nxt while
this field can change from IRQ or other cpu.
We need to add READ_ONCE() annotations, and also make
sure write sides use corresponding WRITE_ONCE() to avoid
store-tearing.
Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)
syzbot reported :
BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv
write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
NF_HOOK include/linux/netfilter.h:305 [inline]
NF_HOOK include/linux/netfilter.h:299 [inline]
ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
dst_input include/net/dst.h:442 [inline]
ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
NF_HOOK include/linux/netfilter.h:305 [inline]
NF_HOOK include/linux/netfilter.h:299 [inline]
ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
__netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
__netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
napi_skb_finish net/core/dev.c:5671 [inline]
napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
sock_poll+0xed/0x250 net/socket.c:1256
vfs_poll include/linux/poll.h:90 [inline]
ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
ep_send_events fs/eventpoll.c:1793 [inline]
ep_poll+0xe3/0x900 fs/eventpoll.c:1930
do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
__do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
__se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
__x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both tcp_v4_err() and tcp_v6_err() do the following operations
while they do not own the socket lock :
fastopen = tp->fastopen_rsk;
snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;
The problem is that without appropriate barrier, the compiler
might reload tp->fastopen_rsk and trigger a NULL deref.
request sockets are protected by RCU, we can simply add
the missing annotations and barriers to solve the issue.
Fixes: 168a8f5805 ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6erspan driver calls ether_setup(), after commit 61e84623ac
("net: centralize net_device min/max MTU checking"), the range
of mtu is [min_mtu, max_mtu], which is [68, 1500] by default.
It causes the dev mtu of the erspan device to not be greater
than 1500, this limit value is not correct for ip6erspan tap
device.
Fixes: 61e84623ac ("net: centralize net_device min/max MTU checking")
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Rajendra reported a kernel panic when a link was taken down:
[ 6870.263084] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
[ 6870.271856] IP: [<ffffffff8efc5764>] __ipv6_ifa_notify+0x154/0x290
<snip>
[ 6870.570501] Call Trace:
[ 6870.573238] [<ffffffff8efc58c6>] ? ipv6_ifa_notify+0x26/0x40
[ 6870.579665] [<ffffffff8efc98ec>] ? addrconf_dad_completed+0x4c/0x2c0
[ 6870.586869] [<ffffffff8efe70c6>] ? ipv6_dev_mc_inc+0x196/0x260
[ 6870.593491] [<ffffffff8efc9c6a>] ? addrconf_dad_work+0x10a/0x430
[ 6870.600305] [<ffffffff8f01ade4>] ? __switch_to_asm+0x34/0x70
[ 6870.606732] [<ffffffff8ea93a7a>] ? process_one_work+0x18a/0x430
[ 6870.613449] [<ffffffff8ea93d6d>] ? worker_thread+0x4d/0x490
[ 6870.619778] [<ffffffff8ea93d20>] ? process_one_work+0x430/0x430
[ 6870.626495] [<ffffffff8ea99dd9>] ? kthread+0xd9/0xf0
[ 6870.632145] [<ffffffff8f01ade4>] ? __switch_to_asm+0x34/0x70
[ 6870.638573] [<ffffffff8ea99d00>] ? kthread_park+0x60/0x60
[ 6870.644707] [<ffffffff8f01ae77>] ? ret_from_fork+0x57/0x70
[ 6870.650936] Code: 31 c0 31 d2 41 b9 20 00 08 02 b9 09 00 00 0
addrconf_dad_work is kicked to be scheduled when a device is brought
up. There is a race between addrcond_dad_work getting scheduled and
taking the rtnl lock and a process taking the link down (under rtnl).
The latter removes the host route from the inet6_addr as part of
addrconf_ifdown which is run for NETDEV_DOWN. The former attempts
to use the host route in __ipv6_ifa_notify. If the down event removes
the host route due to the race to the rtnl, then the BUG listed above
occurs.
Since the DAD sequence can not be aborted, add a check for the missing
host route in __ipv6_ifa_notify. The only way this should happen is due
to the previously mentioned race. The host route is created when the
address is added to an interface; it is only removed on a down event
where the address is kept. Add a warning if the host route is missing
AND the device is up; this is a situation that should never happen.
Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Reported-by: Rajendra Dendukuri <rajendra.dendukuri@broadcom.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit a3ce2a21bb.
Eric reported tests failings with commit. After digging into it,
the bottom line is that the DAD sequence is not to be messed with.
There are too many cases that are expected to proceed regardless
of whether a device is up.
Revert the patch and I will send a different solution for the
problem Rajendra reported.
Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since errors are propagated all the way up to the caller, propagate
possible extack of the caller all the way down to the notifier block
callback.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unlike events for registered notifier, during the registration, the
errors that happened for the block being registered are not propagated
up to the caller. Make sure the error is propagated for FIB rules and
entries.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently all users of FIB notifier only cares about events in init_net.
Later in this patchset, users get interested in other namespaces too.
However, for every registered block user is interested only about one
namespace. Make the FIB notifier registration per-netns and avoid
unnecessary calls of notifier block for other namespaces.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prior to this change an application sending <= 1MSS worth of data and
enabling UDP GSO would fail if the system had SW GSO enabled, but the
same send would succeed if HW GSO offload is enabled. In addition to this
inconsistency the error in the SW GSO case does not get back to the
application if sending out of a real device so the user is unaware of this
failure.
With this change we only perform GSO if the # of segments is > 1 even
if the application has enabled segmentation. I've also updated the
relevant udpgso selftests.
Fixes: bec1f6f697 ("udp: generate gso with UDP_SEGMENT")
Signed-off-by: Josh Hunt <johunt@akamai.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit dfec0ee22c ("udp: Record gso_segs when supporting UDP segmentation offload")
added gso_segs calculation, but incorrectly got sizeof() the pointer and
not the underlying data type. In addition let's fix the v6 case.
Fixes: bec1f6f697 ("udp: generate gso with UDP_SEGMENT")
Fixes: dfec0ee22c ("udp: Record gso_segs when supporting UDP segmentation offload")
Signed-off-by: Josh Hunt <johunt@akamai.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This began with a syzbot report. syzkaller was injecting
IPv6 TCP SYN packets having a v4mapped source address.
After an unsuccessful 4-tuple lookup, TCP creates a request
socket (SYN_RECV) and calls reqsk_queue_hash_req()
reqsk_queue_hash_req() calls sk_ehashfn(sk)
At this point we have AF_INET6 sockets, and the heuristic
used by sk_ehashfn() to either hash the IPv4 or IPv6 addresses
is to use ipv6_addr_v4mapped(&sk->sk_v6_daddr)
For the particular spoofed packet, we end up hashing V4 addresses
which were not initialized by the TCP IPv6 stack, so KMSAN fired
a warning.
I first fixed sk_ehashfn() to test both source and destination addresses,
but then faced various problems, including user-space programs
like packetdrill that had similar assumptions.
Instead of trying to fix the whole ecosystem, it is better
to admit that we have a dual stack behavior, and that we
can not build linux kernels without V4 stack anyway.
The dual stack API automatically forces the traffic to be IPv4
if v4mapped addresses are used at bind() or connect(), so it makes
no sense to allow IPv6 traffic to use the same v4mapped class.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Remove the skb_ext_del from nf_reset, and renames it to a more
fitting nf_reset_ct(). Patch from Florian Westphal.
2) Fix deadlock in nft_connlimit between packet path updates and
the garbage collector.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rajendra reported a kernel panic when a link was taken down:
[ 6870.263084] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
[ 6870.271856] IP: [<ffffffff8efc5764>] __ipv6_ifa_notify+0x154/0x290
<snip>
[ 6870.570501] Call Trace:
[ 6870.573238] [<ffffffff8efc58c6>] ? ipv6_ifa_notify+0x26/0x40
[ 6870.579665] [<ffffffff8efc98ec>] ? addrconf_dad_completed+0x4c/0x2c0
[ 6870.586869] [<ffffffff8efe70c6>] ? ipv6_dev_mc_inc+0x196/0x260
[ 6870.593491] [<ffffffff8efc9c6a>] ? addrconf_dad_work+0x10a/0x430
[ 6870.600305] [<ffffffff8f01ade4>] ? __switch_to_asm+0x34/0x70
[ 6870.606732] [<ffffffff8ea93a7a>] ? process_one_work+0x18a/0x430
[ 6870.613449] [<ffffffff8ea93d6d>] ? worker_thread+0x4d/0x490
[ 6870.619778] [<ffffffff8ea93d20>] ? process_one_work+0x430/0x430
[ 6870.626495] [<ffffffff8ea99dd9>] ? kthread+0xd9/0xf0
[ 6870.632145] [<ffffffff8f01ade4>] ? __switch_to_asm+0x34/0x70
[ 6870.638573] [<ffffffff8ea99d00>] ? kthread_park+0x60/0x60
[ 6870.644707] [<ffffffff8f01ae77>] ? ret_from_fork+0x57/0x70
[ 6870.650936] Code: 31 c0 31 d2 41 b9 20 00 08 02 b9 09 00 00 0
addrconf_dad_work is kicked to be scheduled when a device is brought
up. There is a race between addrcond_dad_work getting scheduled and
taking the rtnl lock and a process taking the link down (under rtnl).
The latter removes the host route from the inet6_addr as part of
addrconf_ifdown which is run for NETDEV_DOWN. The former attempts
to use the host route in ipv6_ifa_notify. If the down event removes
the host route due to the race to the rtnl, then the BUG listed above
occurs.
This scenario does not occur when the ipv6 address is not kept
(net.ipv6.conf.all.keep_addr_on_down = 0) as addrconf_ifdown sets the
state of the ifp to DEAD. Handle when the addresses are kept by checking
IF_READY which is reset by addrconf_ifdown.
The 'dead' flag for an inet6_addr is set only under rtnl, in
addrconf_ifdown and it means the device is getting removed (or IPv6 is
disabled). The interesting cases for changing the idev flag are
addrconf_notify (NETDEV_UP and NETDEV_CHANGE) and addrconf_ifdown
(reset the flag). The former does not have the idev lock - only rtnl;
the latter has both. Based on that the existing dead + IF_READY check
can be moved to right after the rtnl_lock in addrconf_dad_work.
Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Reported-by: Rajendra Dendukuri <rajendra.dendukuri@broadcom.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just put related code together to ease code reading: the memcpy() is
related to the nla_reserve().
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 174e23810c
("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
recycle always drop skb extensions. The additional skb_ext_del() that is
performed via nf_reset on napi skb recycle is not needed anymore.
Most nf_reset() calls in the stack are there so queued skb won't block
'rmmod nf_conntrack' indefinitely.
This removes the skb_ext_del from nf_reset, and renames it to a more
fitting nf_reset_ct().
In a few selected places, add a call to skb_ext_reset to make sure that
no active extensions remain.
I am submitting this for "net", because we're still early in the release
cycle. The patch applies to net-next too, but I think the rename causes
needless divergence between those trees.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ctl packets sent on behalf of TIME_WAIT sockets currently
have a zero skb->priority, which can cause various problems.
In this patch we :
- add a tw_priority field in struct inet_timewait_sock.
- populate it from sk->sk_priority when a TIME_WAIT is created.
- For IPv4, change ip_send_unicast_reply() and its two
callers to propagate tw_priority correctly.
ip_send_unicast_reply() no longer changes sk->sk_priority.
- For IPv6, make sure TIME_WAIT sockets pass their tw_priority
field to tcp_v6_send_response() and tcp_v6_send_ack().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can populate skb->priority for some ctl packets
instead of always using zero.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, ip6_xmit() sets skb->priority based on sk->sk_priority
This is not desirable for TCP since TCP shares the same ctl socket
for a given netns. We want to be able to send RST or ACK packets
with a non zero skb->priority.
This patch has no functional change.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7d9e5f4221 removed references from certain dsts, but accounting
for this never translated down into the fib6 suppression code. This bug
was triggered by WireGuard users who use wg-quick(8), which uses the
"suppress-prefix" directive to ip-rule(8) for routing all of their
internet traffic without routing loops. The test case added here
causes the reference underflow by causing packets to evaluate a suppress
rule.
Fixes: 7d9e5f4221 ("ipv6: convert major tx path to use RT6_LOOKUP_F_DST_NOREF")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
$ sed -e 's/^ /\t/' -i */Kconfig
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Acked-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yi Ren reported an issue discovered by syzkaller, and bisected
to the cited commit.
Many thanks to Yi, this trivial patch does not reflect the patient
work that has been done.
Fixes: d64a1f574a ("ipv6: honor RT6_LOOKUP_F_DST_NOREF in rule lookup logic")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Bisected-and-reported-by: Yi Ren <c4tren@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
In ip6erspan_tunnel_xmit(), if the skb will not be sent out, it has to
be freed on the tx_err path. Otherwise when deleting a netns, it would
cause dst/dev to leak, and dmesg shows:
unregister_netdevice: waiting for lo to become free. Usage count = 1
Fixes: ef7baf5e08 ("ip6_gre: add ip6 erspan collect_md mode")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP reuseport groups can hold a mix unconnected and connected sockets.
Ensure that connections only receive all traffic to their 4-tuple.
Fast reuseport returns on the first reuseport match on the assumption
that all matches are equal. Only if connections are present, return to
the previous behavior of scoring all sockets.
Record if connections are present and if so (1) treat such connected
sockets as an independent match from the group, (2) only return
2-tuple matches from reuseport and (3) do not return on the first
2-tuple reuseport match to allow for a higher scoring match later.
New field has_conns is set without locks. No other fields in the
bitmap are modified at runtime and the field is only ever set
unconditionally, so an RMW cannot miss a change.
Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Link: http://lkml.kernel.org/r/CA+FuTSfRP09aJNYRt04SS6qj22ViiOEWaWmLAwX0psk8-PGNxw@mail.gmail.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable setting skb->mark for UDP and RAW sockets using cmsg.
This is analogous to existing support for TOS, TTL, txtime, etc.
Packet sockets already support this as of commit c7d39e3263
("packet: support per-packet fwmark for af_packet sendmsg").
Similar to other fields, implement by
1. initialize the sockcm_cookie.mark from socket option sk_mark
2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
3. initialize inet_cork.mark from sockcm_cookie.mark
4. initialize each (usually just one) skb->mark from inet_cork.mark
Step 1 is handled in one location for most protocols by ipcm_init_sk
as of commit 351782067b ("ipv4: ipcm_cookie initializers").
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Fix error path of nf_tables_updobj(), from Dan Carpenter.
2) Move large structure away from stack in the nf_tables offload
infrastructure, from Arnd Bergmann.
3) Move indirect flow_block logic to nf_tables_offload.
4) Support for synproxy objects, from Fernando Fernandez Mancera.
5) Support for fwd and dup offload.
6) Add __nft_offload_get_chain() helper, this implicitly fixes missing
mutex and check for offload flags in the indirect block support,
patch from wenxu.
7) Remove rules on device unregistration, from wenxu. This includes
two preparation patches to reuse nft_flow_offload_chain() and
nft_flow_offload_rule().
Large batch from Jeremy Sowden to make a second pass to the
CONFIG_HEADER_TEST support and a bit of housekeeping:
8) Missing include guard in conntrack label header, from Jeremy Sowden.
9) A few coding style errors: trailing whitespace, incorrect indent in
Kconfig, and semicolons at the end of function definitions.
10) Remove unused ipt_init() and ip6t_init() declarations.
11) Inline xt_hashlimit, ebt_802_3 and xt_physdev headers. They are
only used once.
12) Update include directive in several netfilter files.
13) Remove unused include/net/netfilter/ipv6/nf_conntrack_icmpv6.h.
14) Move nf_ip6_ext_hdr() to include/linux/netfilter_ipv6.h
15) Move several synproxy structure definitions to nf_synproxy.h
16) Move nf_bridge_frag_data structure to include/linux/netfilter_bridge.h
17) Clean up static inline definitions in nf_conntrack_ecache.h.
18) Replace defined(CONFIG...) || defined(CONFIG...MODULE) with IS_ENABLED(CONFIG...).
19) Missing inline function conditional definitions based on Kconfig
preferences in synproxy and nf_conntrack_timeout.
20) Update br_nf_pre_routing_ipv6() definition.
21) Move conntrack code in linux/skbuff.h to nf_conntrack headers.
22) Several patches to remove superfluous CONFIG_NETFILTER and
CONFIG_NF_CONNTRACK checks in headers, coming from the initial batch
support for CONFIG_HEADER_TEST for netfilter.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a struct definition function in nf_conntrack_bridge.h which is
not specific to conntrack and is used elswhere in netfilter. Move it
into netfilter_bridge.h.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>