This avoids another inderect call per RX packet which save us around
20-40 ns.
Changelog:
v1 -> v2:
- Move declaraions to fib_rules.h to remove warnings
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure not just the pointer itself but the whole range lies in
the user address space. For that pass the length and then use
the access_ok helper to do the check.
Fixes: 6d04fe15f7 ("net: optimize the sockptr_t for unified kernel/user address spaces")
Reported-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
sockptr_advance never properly worked. Replace it with _offset variants
of copy_from_sockptr and copy_to_sockptr.
Fixes: ba423fdaa5 ("net: add a new sockptr_t type")
Reported-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This was accidentally removed in an unrelated commit.
Fixes: c2f12630c6 ("netfilter: switch nf_setsockopt to sockptr_t")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cited commit mistakenly copied provided option to 'val' instead of to
'mfc':
```
- if (copy_from_user(&mfc, optval, sizeof(mfc))) {
+ if (copy_from_sockptr(&val, optval, sizeof(val))) {
```
Fix this by copying the option to 'mfc'.
selftest router_multicast.sh before:
$ ./router_multicast.sh
smcroutectl: Unknown or malformed IPC message 'a' from client.
smcroutectl: failed removing multicast route, does not exist.
TEST: mcast IPv4 [FAIL]
Multicast not received on first host
TEST: mcast IPv6 [ OK ]
smcroutectl: Unknown or malformed IPC message 'a' from client.
smcroutectl: failed removing multicast route, does not exist.
TEST: RPF IPv4 [FAIL]
Multicast not received on first host
TEST: RPF IPv6 [ OK ]
selftest router_multicast.sh after:
$ ./router_multicast.sh
TEST: mcast IPv4 [ OK ]
TEST: mcast IPv6 [ OK ]
TEST: RPF IPv4 [ OK ]
TEST: RPF IPv6 [ OK ]
Fixes: 01ccb5b48f ("net/ipv4: switch ip_mroute_setsockopt to sockptr_t")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP reuseport conflict was a little bit tricky.
The net-next code, via bpf-next, extracted the reuseport handling
into a helper so that the BPF sk lookup code could invoke it.
At the same time, the logic for reuseport handling of unconnected
sockets changed via commit efc6b6f6c3
which changed the logic to carry on the reuseport result into the
rest of the lookup loop if we do not return immediately.
This requires moving the reuseport_has_conns() logic into the callers.
While we are here, get rid of inline directives as they do not belong
in foo.c files.
The other changes were cases of more straightforward overlapping
modifications.
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend the rfc 4884 read interface introduced for ipv4 in
commit eba75c587e ("icmp: support rfc 4884") to ipv6.
Add socket option SOL_IPV6/IPV6_RECVERR_RFC4884.
Changes v1->v2:
- make ipv6_icmp_error_rfc4884 static (file scope)
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RFC 4884 spec is largely the same between IPv4 and IPv6.
Factor out the IPv4 specific parts in preparation for IPv6 support:
- icmp types supported
- icmp header size, and thus offset to original datagram start
- datagram length field offset in icmp(6)hdr.
- datagram length field word size: 4B for IPv4, 8B for IPv6.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Only accept packets with original datagram len field >= header len.
The extension header must start after the original datagram headers.
The embedded datagram len field is compared against the 128B minimum
stipulated by RFC 4884. It is unlikely that headers extend beyond
this. But as we know the exact header length, check explicitly.
2) Remove the check that datagram length must be <= 576B.
This is a send constraint. There is no value in testing this on rx.
Within private networks it may be known safe to send larger packets.
Process these packets.
This test was also too lax. It compared original datagram length
rather than entire icmp packet length. The stand-alone fix would be:
- if (hlen + skb->len > 576)
+ if (-skb_network_offset(skb) + skb->len > 576)
Fixes: eba75c587e ("icmp: support rfc 4884")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For architectures like x86 and arm64 we don't need the separate bit to
indicate that a pointer is a kernel pointer as the address spaces are
unified. That way the sockptr_t can be reduced to a union of two
pointers, which leads to nicer calling conventions.
The only caveat is that we need to check that users don't pass in kernel
address and thus gain access to kernel memory. Thus the USER_SOCKPTR
helper is replaced with a init_user_sockptr function that does this check
and returns an error if it fails.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rework the remaining setsockopt code to pass a sockptr_t instead of a
plain user pointer. This removes the last remaining set_fs(KERNEL_DS)
outside of architecture specific code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> [ieee802154]
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is mostly to prepare for cleaning up the callers, as bpfilter by
design can't handle kernel pointers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously TLP may send multiple probes of new data in one
flight. This happens when the sender is cwnd limited. After the
initial TLP containing new data is sent, the sender receives another
ACK that acks partial inflight. It may re-arm another TLP timer
to send more, if no further ACK returns before the next TLP timeout
(PTO) expires. The sender may send in theory a large amount of TLP
until send queue is depleted. This only happens if the sender sees
such irregular uncommon ACK pattern. But it is generally undesirable
behavior during congestion especially.
The original TLP design restrict only one TLP probe per inflight as
published in "Reducing Web Latency: the Virtue of Gentle Aggression",
SIGCOMM 2013. This patch changes TLP to send at most one probe
per inflight.
Note that if the sender is app-limited, TLP retransmits old data
and did not have this issue.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-07-21
The following pull-request contains BPF updates for your *net-next* tree.
We've added 46 non-merge commits during the last 6 day(s) which contain
a total of 68 files changed, 4929 insertions(+), 526 deletions(-).
The main changes are:
1) Run BPF program on socket lookup, from Jakub.
2) Introduce cpumap, from Lorenzo.
3) s390 JIT fixes, from Ilya.
4) teach riscv JIT to emit compressed insns, from Luke.
5) use build time computed BTF ids in bpf iter, from Yonghong.
====================
Purely independent overlapping changes in both filter.h and xdp.h
Signed-off-by: David S. Miller <davem@davemloft.net>
We can't use IS_UDPLITE to replace udp_sk->pcflag when UDPLITE_RECV_CC is
checked.
Fixes: b2bf1e2659 ("[UDP]: Clean up for IS_UDPLITE macro")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, SO_REUSEPORT does not work well if connected sockets are in a
UDP reuseport group.
Then reuseport_has_conns() returns true and the result of
reuseport_select_sock() is discarded. Also, unconnected sockets have the
same score, hence only does the first unconnected socket in udp_hslot
always receive all packets sent to unconnected sockets.
So, the result of reuseport_select_sock() should be used for load
balancing.
The noteworthy point is that the unconnected sockets placed after
connected sockets in sock_reuseport.socks will receive more packets than
others because of the algorithm in reuseport_select_sock().
index | connected | reciprocal_scale | result
---------------------------------------------
0 | no | 20% | 40%
1 | no | 20% | 20%
2 | yes | 20% | 0%
3 | no | 20% | 40%
4 | yes | 20% | 0%
If most of the sockets are connected, this can be a problem, but it still
works better than now.
Fixes: acdcecc612 ("udp: correct reuseport selection with connected sockets")
CC: Willem de Bruijn <willemb@google.com>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One additional field btf_id is added to struct
bpf_ctx_arg_aux to store the precomputed btf_ids.
The btf_id is computed at build time with
BTF_ID_LIST or BTF_ID_LIST_GLOBAL macro definitions.
All existing bpf iterators are changed to used
pre-compute btf_ids.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200720163403.1393551-1-yhs@fb.com
Add setsockopt SOL_IP/IP_RECVERR_4884 to return the offset to an
extension struct if present.
ICMP messages may include an extension structure after the original
datagram. RFC 4884 standardized this behavior. It stores the offset
in words to the extension header in u8 icmphdr.un.reserved[1].
The field is valid only for ICMP types destination unreachable, time
exceeded and parameter problem, if length is at least 128 bytes and
entire packet does not exceed 576 bytes.
Return the offset to the start of the extension struct when reading an
ICMP error from the error queue, if it matches the above constraints.
Do not return the raw u8 field. Return the offset from the start of
the user buffer, in bytes. The kernel does not return the network and
transport headers, so subtract those.
Also validate the headers. Return the offset regardless of validation,
as an invalid extension must still not be misinterpreted as part of
the original datagram. Note that !invalid does not imply valid. If
the extension version does not match, no validation can take place,
for instance.
For backward compatibility, make this optional, set by setsockopt
SOL_IP/IP_RECVERR_RFC4884. For API example and feature test, see
github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp_v2.c
For forward compatibility, reserve only setsockopt value 1, leaving
other bits for additional icmp extensions.
Changes
v1->v2:
- convert word offset to byte offset from start of user buffer
- return in ee_data as u8 may be insufficient
- define extension struct and object header structs
- return len only if constraints met
- if returning len, also validate
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Handle the few cases that need special treatment in-line using
in_compat_syscall(). This also removes all the now unused
compat_{get,set}sockopt methods.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Handle the few cases that need special treatment in-line using
in_compat_syscall().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Factor out one helper each for setting the native and compat
version of the MCAST_MSFILTER option.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Factor out one helper each for setting the native and compat
version of the MCAST_MSFILTER option.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Factor out one helper each for getting the native and compat
version of the MCAST_MSFILTER option.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lift the in_compat_syscall() from the callers instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
All instances handle compat sockopts via in_compat_syscall() now, so
remove the compat_{get,set} methods as well as the
compat_nf_{get,set}sockopt wrappers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge the native and compat {get,set}sockopt handlers using
in_compat_syscall().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge the native and compat {get,set}sockopt handlers using
in_compat_syscall().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the compat handling to sock_common_{get,set}sockopt instead,
keyed of in_compat_syscall(). This allow to remove the now unused
->compat_{get,set}sockopt methods from struct proto_ops.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following INET/TCP socket lookup changes, modify UDP socket lookup to let
BPF program select a receiving socket before searching for a socket by
destination address and port as usual.
Lookup of connected sockets that match packet 4-tuple is unaffected by this
change. BPF program runs, and potentially overrides the lookup result, only
if a 4-tuple match was not found.
Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200717103536.397595-9-jakub@cloudflare.com
Prepare for calling into reuseport from __udp4_lib_lookup as well.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200717103536.397595-8-jakub@cloudflare.com
Run a BPF program before looking up a listening socket on the receive path.
Program selects a listening socket to yield as result of socket lookup by
calling bpf_sk_assign() helper and returning SK_PASS code. Program can
revert its decision by assigning a NULL socket with bpf_sk_assign().
Alternatively, BPF program can also fail the lookup by returning with
SK_DROP, or let the lookup continue as usual with SK_PASS on return, when
no socket has been selected with bpf_sk_assign().
This lets the user match packets with listening sockets freely at the last
possible point on the receive path, where we know that packets are destined
for local delivery after undergoing policing, filtering, and routing.
With BPF code selecting the socket, directing packets destined to an IP
range or to a port range to a single socket becomes possible.
In case multiple programs are attached, they are run in series in the order
in which they were attached. The end result is determined from return codes
of all the programs according to following rules:
1. If any program returned SK_PASS and selected a valid socket, the socket
is used as result of socket lookup.
2. If more than one program returned SK_PASS and selected a socket,
last selection takes effect.
3. If any program returned SK_DROP, and no program returned SK_PASS and
selected a socket, socket lookup fails with -ECONNREFUSED.
4. If all programs returned SK_PASS and none of them selected a socket,
socket lookup continues to htable-based lookup.
Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200717103536.397595-5-jakub@cloudflare.com
Prepare for calling into reuseport from __inet_lookup_listener as well.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200717103536.397595-4-jakub@cloudflare.com
There are two existing SNMP counters, TCPDSACKRecv and TCPDSACKOfoRecv,
which are incremented depending on whether the DSACKed range is below
the cumulative ACK sequence number or not. Unfortunately, these both
implicitly assume each DSACK covers only one segment. This makes these
counters unusable for estimating spurious retransmit rates,
or real/non-spurious loss rate.
This patch introduces a new SNMP counter, TCPDSACKRecvSegs, which tracks
the estimated number of duplicate segments based on:
(DSACKed sequence range) / MSS. This counter is usable for estimating
spurious retransmit rates, or real/non-spurious loss rate.
Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, while processing DSACK, we assume DSACK covers only one
segment. This leads to significant underestimation of DSACKs with
LRO/GRO. This patch fixes segment accounting with DSACK by estimating
segment count from DSACK sequence range / MSS.
Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-07-13
The following pull-request contains BPF updates for your *net-next* tree.
We've added 36 non-merge commits during the last 7 day(s) which contain
a total of 62 files changed, 2242 insertions(+), 468 deletions(-).
The main changes are:
1) Avoid trace_printk warning banner by switching bpf_trace_printk to use
its own tracing event, from Alan.
2) Better libbpf support on older kernels, from Andrii.
3) Additional AF_XDP stats, from Ciara.
4) build time resolution of BTF IDs, from Jiri.
5) BPF_CGROUP_INET_SOCK_RELEASE hook, from Stanislav.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Simple fixes which require no deep knowledge of the code.
Cc: Paul Moore <paul@paul-moore.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 0c3d79bce4 ("tcp: reduce SYN-ACK
retrans for TCP_DEFER_ACCEPT") introduces syn_ack_recalc() which decides
if a minisock is held and a SYN+ACK is retransmitted or not.
If rskq_defer_accept is not zero in syn_ack_recalc(), max_retries always
has the same value because max_retries is overwritten by rskq_defer_accept
in reqsk_timer_handler().
This commit adds three changes:
- remove redundant non-zero check for rskq_defer_accept in
reqsk_timer_handler().
- remove max_retries from the arguments of syn_ack_recalc() and use
rskq_defer_accept instead.
- rename thresh to max_syn_ack_retries for readability.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
CC: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an interface to report offloaded UDP ports via ethtool netlink.
Now that core takes care of tracking which UDP tunnel ports the NICs
are aware of we can quite easily export this information out to
user space.
The responsibility of writing the netlink dumps is split between
ethtool code and udp_tunnel_nic.c - since udp_tunnel module may
not always be loaded, yet we should always report the capabilities
of the NIC.
$ ethtool --show-tunnels eth0
Tunnel information for eth0:
UDP port table 0:
Size: 4
Types: vxlan
No entries
UDP port table 1:
Size: 4
Types: geneve, vxlan-gpe
Entries (1):
port 1230, vxlan-gpe
v4:
- back to v2, build fix is now directly in udp_tunnel.h
v3:
- don't compile ETHTOOL_MSG_TUNNEL_INFO_GET in if CONFIG_INET
not set.
v2:
- fix string set count,
- reorder enums in the uAPI,
- fix type of ETHTOOL_A_TUNNEL_UDP_TABLE_TYPES to bitset
in docs and comments.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cater to devices which:
(a) may want to sleep in the callbacks;
(b) only have IPv4 support;
(c) need all the programming to happen while the netdev is up.
Drivers attach UDP tunnel offload info struct to their netdevs,
where they declare how many UDP ports of various tunnel types
they support. Core takes care of tracking which ports to offload.
Use a fixed-size array since this matches what almost all drivers
do, and avoids a complexity and uncertainty around memory allocations
in an atomic context.
Make sure that tunnel drivers don't try to replay the ports when
new NIC netdev is registered. Automatic replays would mess up
reference counting, and will be removed completely once all drivers
are converted.
v4:
- use a #define NULL to avoid build issues with CONFIG_INET=n.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>