linux/net
Daniel Borkmann 983695fa67 bpf: fix unconnected udp hooks
Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
to applications as also stated in original motivation in 7828f20e37 ("Merge
branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
two hooks into Cilium to enable host based load-balancing with Kubernetes,
I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
typically sets up DNS as a service and is thus subject to load-balancing.

Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
is currently insufficient and thus not usable as-is for standard applications
shipped with most distros. To break down the issue we ran into with a simple
example:

  # cat /etc/resolv.conf
  nameserver 147.75.207.207
  nameserver 147.75.207.208

For the purpose of a simple test, we set up above IPs as service IPs and
transparently redirect traffic to a different DNS backend server for that
node:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 => 8.8.8.8:53
  2    147.75.207.208:53   1 => 8.8.8.8:53

The attached BPF program is basically selecting one of the backends if the
service IP/port matches on the cgroup hook. DNS breaks here, because the
hooks are not transparent enough to applications which have built-in msg_name
address checks:

  # nslookup 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]
  ;; connection timed out; no servers could be reached

  # dig 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]

  ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
  ;; global options: +cmd
  ;; connection timed out; no servers could be reached

For comparison, if none of the service IPs is used, and we tell nslookup
to use 8.8.8.8 directly it works just fine, of course:

  # nslookup 1.1.1.1 8.8.8.8
  1.1.1.1.in-addr.arpa	name = one.one.one.one.

In order to fix this and thus act more transparent to the application,
this needs reverse translation on recvmsg() side. A minimal fix for this
API is to add similar recvmsg() hooks behind the BPF cgroups static key
such that the program can track state and replace the current sockaddr_in{,6}
with the original service IP. From BPF side, this basically tracks the
service tuple plus socket cookie in an LRU map where the reverse NAT can
then be retrieved via map value as one example. Side-note: the BPF cgroups
static key should be converted to a per-hook static key in future.

Same example after this fix:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 => 8.8.8.8:53
  2    147.75.207.208:53   1 => 8.8.8.8:53

Lookups work fine now:

  # nslookup 1.1.1.1
  1.1.1.1.in-addr.arpa    name = one.one.one.one.

  Authoritative answers can be found from:

  # dig 1.1.1.1

  ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
  ;; global options: +cmd
  ;; Got answer:
  ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
  ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

  ;; OPT PSEUDOSECTION:
  ; EDNS: version: 0, flags:; udp: 512
  ;; QUESTION SECTION:
  ;1.1.1.1.                       IN      A

  ;; AUTHORITY SECTION:
  .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400

  ;; Query time: 17 msec
  ;; SERVER: 147.75.207.207#53(147.75.207.207)
  ;; WHEN: Tue May 21 12:59:38 UTC 2019
  ;; MSG SIZE  rcvd: 111

And from an actual packet level it shows that we're using the back end
server when talking via 147.75.207.20{7,8} front end:

  # tcpdump -i any udp
  [...]
  12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  [...]

In order to be flexible and to have same semantics as in sendmsg BPF
programs, we only allow return codes in [1,1] range. In the sendmsg case
the program is called if msg->msg_name is present which can be the case
in both, connected and unconnected UDP.

The former only relies on the sockaddr_in{,6} passed via connect(2) if
passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
way to call into the BPF program whenever a non-NULL msg->msg_name was
passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
that for TCP case, the msg->msg_name is ignored in the regular recvmsg
path and therefore not relevant.

For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
the hook is not called. This is intentional as it aligns with the same
semantics as in case of TCP cgroup BPF hooks right now. This might be
better addressed in future through a different bpf_attach_type such
that this case can be distinguished from the regular recvmsg paths,
for example.

Fixes: 1cedee13d2 ("bpf: Hooks for sys_sendmsg")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-06 16:53:12 -07:00
..
6lowpan 6lowpan: Off by one handling ->nexthdr 2019-04-23 19:09:58 +02:00
9p 9p/net: fix memory leak in p9_client_create 2019-03-13 11:50:04 +01:00
802
8021q vlan: Mark expected switch fall-through 2019-05-20 11:38:55 -07:00
appletalk Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-05-02 22:14:21 -04:00
atm net: atm: clean up a range check 2019-05-05 10:25:52 -07:00
ax25 net: ax25: fix misuse of %x 2019-04-21 10:37:26 -07:00
batman-adv This feature/cleanup patchset includes the following patches: 2019-05-09 09:44:17 -07:00
bluetooth Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2019-05-07 22:03:58 -07:00
bpf bpf: Introduce bpf sk local storage 2019-04-27 09:07:04 -07:00
bpfilter treewide: prefix header search paths with $(srctree)/ 2019-05-18 11:49:57 +09:00
bridge Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf 2019-05-13 08:55:15 -07:00
caif net: caif: fix the value of size argument of snprintf 2019-05-17 11:31:15 -07:00
can netlink: make validation more configurable for future strictness 2019-04-27 17:07:21 -04:00
ceph AFS fixes 2019-05-16 17:00:13 -07:00
core bpf: fix unconnected udp hooks 2019-06-06 16:53:12 -07:00
dcb netlink: make validation more configurable for future strictness 2019-04-27 17:07:21 -04:00
dccp net: dccp : proto: remove Unneeded variable "err" 2019-05-12 13:21:30 -07:00
decnet netlink: make validation more configurable for future strictness 2019-04-27 17:07:21 -04:00
dns_resolver dns_resolver: Allow used keys to be invalidated 2019-05-15 17:35:54 +01:00
dsa net: dsa: Initialize DSA_SKB_CB(skb)->deferred_xmit variable 2019-05-12 13:19:46 -07:00
ethernet net: ethernet: support of_get_mac_address new ERR_PTR error 2019-05-07 12:22:47 -07:00
hsr genetlink: optionally validate strictly/dumps 2019-04-27 17:07:22 -04:00
ieee802154 genetlink: optionally validate strictly/dumps 2019-04-27 17:07:22 -04:00
ife
ipv4 bpf: fix unconnected udp hooks 2019-06-06 16:53:12 -07:00
ipv6 bpf: fix unconnected udp hooks 2019-06-06 16:53:12 -07:00
iucv
kcm kcm: switch order of device registration to fix a crash 2019-04-01 14:59:20 -07:00
key xfrm: clean up xfrm protocol checks 2019-03-26 08:35:36 +01:00
l2tp Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-05-07 17:22:09 -07:00
l3mdev
lapb
llc llc: Check address length before reading address field 2019-04-12 10:25:03 -07:00
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-05-02 22:14:21 -04:00
mac802154
mpls netlink: make validation more configurable for future strictness 2019-04-27 17:07:21 -04:00
ncsi genetlink: optionally validate strictly/dumps 2019-04-27 17:07:22 -04:00
netfilter net: replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC 2019-05-14 19:52:50 -07:00
netlabel genetlink: optionally validate strictly/dumps 2019-04-27 17:07:22 -04:00
netlink net: Treat sock->sk_drops as an unsigned int when printing 2019-05-19 10:31:10 -07:00
netrom net: rework SIOCGSTAMP ioctl handling 2019-04-19 14:07:40 -07:00
nfc genetlink: optionally validate strictly/dumps 2019-04-27 17:07:22 -04:00
nsh
openvswitch openvswitch: Replace removed NF_NAT_NEEDED with IS_ENABLED(CONFIG_NF_NAT) 2019-05-08 09:43:15 -07:00
packet packet: Fix error path in packet_init 2019-05-09 13:45:46 -07:00
phonet net: Treat sock->sk_drops as an unsigned int when printing 2019-05-19 10:31:10 -07:00
psample genetlink: optionally validate strictly/dumps 2019-04-27 17:07:22 -04:00
qrtr net: qrtr: Fix message type of outgoing packets 2019-05-20 20:50:31 -04:00
rds mm/gup: change GUP fast to use flags rather than a write 'bool' 2019-05-14 09:47:46 -07:00
rfkill *: convert stream-like files from nonseekable_open -> stream_open 2019-05-06 17:46:41 +03:00
rose Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-04-25 23:52:29 -04:00
rxrpc rxrpc: Allow the kernel to mark a call as being non-interruptible 2019-05-16 16:25:20 +01:00
sched net/sched: avoid double free on matchall reoffload 2019-05-08 16:34:58 -07:00
sctp Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2019-05-07 22:03:58 -07:00
smc 5.2 Merge Window pull request 2019-05-09 09:02:46 -07:00
strparser net: strparser: make it explicitly non-modular 2019-04-22 21:50:54 -07:00
sunrpc This pull consists mostly of nfsd container work: 2019-05-15 18:21:43 -07:00
switchdev
tipc tipc: fix modprobe tipc failed after switch order of device registration 2019-05-20 10:45:43 -07:00
tls net/tls: handle errors from padding_length() 2019-05-09 16:37:39 -07:00
unix datagram: remove rendundant 'peeked' argument 2019-04-08 09:51:54 -07:00
vmw_vsock vsock/virtio: Initialize core virtio vsock before registering the driver 2019-05-18 10:50:28 -07:00
wimax genetlink: optionally validate strictly/dumps 2019-04-27 17:07:22 -04:00
wireless Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2019-05-07 22:03:58 -07:00
x25 net: rework SIOCGSTAMP ioctl handling 2019-04-19 14:07:40 -07:00
xdp mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM 2019-05-14 09:47:45 -07:00
xfrm xfrm: ressurrect "Fix uninitialized memory read in _decode_session4" 2019-05-16 14:14:47 -07:00
compat.c net: rework SIOCGSTAMP ioctl handling 2019-04-19 14:07:40 -07:00
Kconfig net: devlink: select NET_DEVLINK from drivers 2019-03-24 14:55:31 -04:00
Makefile
socket.c net: fix kernel-doc warnings for socket.c 2019-05-19 10:33:22 -07:00
sysctl_net.c