linux/net/ipv6
Daniel Borkmann 983695fa67 bpf: fix unconnected udp hooks
Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
to applications as also stated in original motivation in 7828f20e37 ("Merge
branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
two hooks into Cilium to enable host based load-balancing with Kubernetes,
I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
typically sets up DNS as a service and is thus subject to load-balancing.

Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
is currently insufficient and thus not usable as-is for standard applications
shipped with most distros. To break down the issue we ran into with a simple
example:

  # cat /etc/resolv.conf
  nameserver 147.75.207.207
  nameserver 147.75.207.208

For the purpose of a simple test, we set up above IPs as service IPs and
transparently redirect traffic to a different DNS backend server for that
node:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 => 8.8.8.8:53
  2    147.75.207.208:53   1 => 8.8.8.8:53

The attached BPF program is basically selecting one of the backends if the
service IP/port matches on the cgroup hook. DNS breaks here, because the
hooks are not transparent enough to applications which have built-in msg_name
address checks:

  # nslookup 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]
  ;; connection timed out; no servers could be reached

  # dig 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]

  ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
  ;; global options: +cmd
  ;; connection timed out; no servers could be reached

For comparison, if none of the service IPs is used, and we tell nslookup
to use 8.8.8.8 directly it works just fine, of course:

  # nslookup 1.1.1.1 8.8.8.8
  1.1.1.1.in-addr.arpa	name = one.one.one.one.

In order to fix this and thus act more transparent to the application,
this needs reverse translation on recvmsg() side. A minimal fix for this
API is to add similar recvmsg() hooks behind the BPF cgroups static key
such that the program can track state and replace the current sockaddr_in{,6}
with the original service IP. From BPF side, this basically tracks the
service tuple plus socket cookie in an LRU map where the reverse NAT can
then be retrieved via map value as one example. Side-note: the BPF cgroups
static key should be converted to a per-hook static key in future.

Same example after this fix:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 => 8.8.8.8:53
  2    147.75.207.208:53   1 => 8.8.8.8:53

Lookups work fine now:

  # nslookup 1.1.1.1
  1.1.1.1.in-addr.arpa    name = one.one.one.one.

  Authoritative answers can be found from:

  # dig 1.1.1.1

  ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
  ;; global options: +cmd
  ;; Got answer:
  ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
  ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

  ;; OPT PSEUDOSECTION:
  ; EDNS: version: 0, flags:; udp: 512
  ;; QUESTION SECTION:
  ;1.1.1.1.                       IN      A

  ;; AUTHORITY SECTION:
  .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400

  ;; Query time: 17 msec
  ;; SERVER: 147.75.207.207#53(147.75.207.207)
  ;; WHEN: Tue May 21 12:59:38 UTC 2019
  ;; MSG SIZE  rcvd: 111

And from an actual packet level it shows that we're using the back end
server when talking via 147.75.207.20{7,8} front end:

  # tcpdump -i any udp
  [...]
  12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  [...]

In order to be flexible and to have same semantics as in sendmsg BPF
programs, we only allow return codes in [1,1] range. In the sendmsg case
the program is called if msg->msg_name is present which can be the case
in both, connected and unconnected UDP.

The former only relies on the sockaddr_in{,6} passed via connect(2) if
passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
way to call into the BPF program whenever a non-NULL msg->msg_name was
passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
that for TCP case, the msg->msg_name is ignored in the regular recvmsg
path and therefore not relevant.

For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
the hook is not called. This is intentional as it aligns with the same
semantics as in case of TCP cgroup BPF hooks right now. This might be
better addressed in future through a different bpf_attach_type such
that this case can be distinguished from the regular recvmsg paths,
for example.

Fixes: 1cedee13d2 ("bpf: Hooks for sys_sendmsg")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-06 16:53:12 -07:00
..
ila genetlink: optionally validate strictly/dumps 2019-04-27 17:07:22 -04:00
netfilter netfilter: x_tables: merge ip and ipv6 masquerade modules 2019-04-11 20:59:29 +02:00
addrconf_core.c ipv6: Pass fib6_result to fib lookups 2019-04-17 23:10:47 -07:00
addrconf.c netlink: make validation more configurable for future strictness 2019-04-27 17:07:21 -04:00
addrlabel.c netlink: make validation more configurable for future strictness 2019-04-27 17:07:21 -04:00
af_inet6.c net: rework SIOCGSTAMP ioctl handling 2019-04-19 14:07:40 -07:00
ah6.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-11-15 11:56:19 -08:00
anycast.c net/ipv6: compute anycast address hash only if dev is null 2018-11-08 17:04:43 -08:00
calipso.c ipv6: make ipv6_renew_options() interrupt/kernel safe 2018-07-05 20:15:26 +09:00
datagram.c net: Treat sock->sk_drops as an unsigned int when printing 2019-05-19 10:31:10 -07:00
esp6_offload.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-05-02 22:14:21 -04:00
esp6.c esp: Skip TX bytes accounting when sending from a request socket 2019-01-28 11:20:58 +01:00
exthdrs_core.c net: ipv6: Fix typo in ipv6_find_hdr() documentation 2018-05-07 23:50:27 -04:00
exthdrs_offload.c
exthdrs.c ipv6: make ipv6_renew_options() interrupt/kernel safe 2018-07-05 20:15:26 +09:00
fib6_notifier.c net: Add module reference to FIB notifiers 2017-09-01 20:33:42 -07:00
fib6_rules.c ipv6: Use result arg in fib_lookup_arg consistently 2019-04-23 21:53:11 -07:00
fou6.c fou, fou6: avoid uninit-value in gue_err() and gue6_err() 2019-03-08 15:19:53 -08:00
icmp.c ipv6: Add rate limit mask for ICMPv6 messages 2019-04-18 16:58:37 -07:00
inet6_connection_sock.c
inet6_hashtables.c net: tcp6: prefer listeners bound to an address 2018-12-14 15:55:20 -08:00
ip6_checksum.c net: udp: fix handling of CHECKSUM_COMPLETE packets 2018-10-24 14:18:16 -07:00
ip6_fib.c ipv6: prevent possible fib6 leaks 2019-05-16 12:21:00 -07:00
ip6_flowlabel.c ipv6/flowlabel: wait rcu grace period before put_pid() 2019-04-29 23:30:13 -04:00
ip6_gre.c net: ip6_gre: fix possible use-after-free in ip6erspan_rcv 2019-04-08 16:16:47 -07:00
ip6_icmp.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
ip6_input.c net: use indirect calls helpers at early demux stage 2019-05-05 10:38:04 -07:00
ip6_offload.c gso: validate gso_type on ipip style tunnels 2019-02-20 11:24:27 -08:00
ip6_offload.h
ip6_output.c neighbor: Add skip_cache argument to neigh_output 2019-04-08 15:22:41 -07:00
ip6_tunnel.c ip6_tunnel: Match to ARPHRD_TUNNEL6 for dev type 2019-04-02 13:19:34 -07:00
ip6_udp_tunnel.c net/ipv6/udp_tunnel: prefer SO_BINDTOIFINDEX over SO_BINDTODEVICE 2019-01-17 14:55:52 -08:00
ip6_vti.c xfrm: store xfrm_mode directly, not its address 2019-04-08 09:15:28 +02:00
ip6mr.c rhashtable: use bit_spin_locks to protect hash bucket. 2019-04-07 19:12:12 -07:00
ipcomp6.c
ipv6_sockglue.c net: ipv6: add socket option IPV6_ROUTER_ALERT_ISOLATE 2019-03-03 21:05:10 -08:00
Kconfig xfrm: make xfrm modes builtin 2019-04-08 09:15:17 +02:00
Makefile xfrm: make xfrm modes builtin 2019-04-08 09:15:17 +02:00
mcast_snoop.c net: remove unneeded switch fall-through 2019-02-21 13:48:00 -08:00
mcast.c bridge: join all-snoopers multicast address 2019-01-22 17:18:08 -08:00
mip6.c
ndisc.c net ipv6: Prevent neighbor add if protocol is disabled on device 2019-04-17 23:19:07 -07:00
netfilter.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next 2019-02-18 11:38:30 -08:00
output_core.c inet: switch IP ID generator to siphash 2019-03-27 14:29:26 -07:00
ping.c ipv6: fold sockcm_cookie into ipcm6_cookie 2018-07-07 10:58:49 +09:00
proc.c proc: introduce proc_create_net_single 2018-05-16 07:24:30 +02:00
protocol.c net: Add sysctl to toggle early demux for tcp and udp 2017-03-24 13:17:07 -07:00
raw.c net: rework SIOCGSTAMP ioctl handling 2019-04-19 14:07:40 -07:00
reassembly.c net: remove unused struct inet_frag_queue.fragments field 2019-02-26 08:27:05 -08:00
route.c ipv6: fix src addr routing with the exception table 2019-05-16 14:30:53 -07:00
seg6_hmac.c Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-07-03 10:29:26 +09:00
seg6_iptunnel.c netlink: make validation more configurable for future strictness 2019-04-27 17:07:21 -04:00
seg6_local.c netlink: make validation more configurable for future strictness 2019-04-27 17:07:21 -04:00
seg6.c genetlink: optionally validate strictly/dumps 2019-04-27 17:07:22 -04:00
sit.c vrf: sit mtu should not be updated when vrf netdev is the link 2019-05-07 12:19:19 -07:00
syncookies.c net/ipv4: disable SMC TCP option with SYN Cookies 2018-03-25 20:53:54 -04:00
sysctl_net_ipv6.c ipv6: sr: Compute flowlabel for outer IPv6 header of seg6 encap mode 2018-04-25 13:02:15 -04:00
tcp_ipv6.c net: use indirect calls helpers at early demux stage 2019-05-05 10:38:04 -07:00
tcpv6_offload.c net: use indirect call wrappers at GRO transport layer 2018-12-15 13:23:02 -08:00
tunnel6.c net: Convert protocol error handlers from void to int 2018-11-08 17:13:08 -08:00
udp_impl.h udp6: add missing rehash callback to udplite 2019-01-17 15:01:08 -08:00
udp_offload.c net: use indirect call wrappers at GRO transport layer 2018-12-15 13:23:02 -08:00
udp.c bpf: fix unconnected udp hooks 2019-06-06 16:53:12 -07:00
udplite.c udp6: add missing rehash callback to udplite 2019-01-17 15:01:08 -08:00
xfrm6_input.c net: use skb_sec_path helper in more places 2018-12-19 11:21:37 -08:00
xfrm6_output.c xfrm: store xfrm_mode directly, not its address 2019-04-08 09:15:28 +02:00
xfrm6_policy.c xfrm: remove decode_session indirection from afinfo_policy 2019-04-23 07:42:20 +02:00
xfrm6_protocol.c xfrm: remove unneeded export_symbols 2019-04-23 07:42:20 +02:00
xfrm6_state.c xfrm: remove VLA usage in __xfrm6_sort() 2018-04-26 07:51:48 +02:00
xfrm6_tunnel.c xfrm: clean up xfrm protocol checks 2019-03-26 08:35:36 +01:00