linux/net/core
Daniel Borkmann f318903c0b bpf: Add netns cookie and enable it for bpf cgroup hooks
In Cilium we're mainly using BPF cgroup hooks today in order to implement
kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
between Cilium managed nodes. While this works in its current shape and avoids
packet-level NAT for inter Cilium managed node traffic, there is one major
limitation we're facing today, that is, lack of netns awareness.

In Kubernetes, the concept of Pods (which hold one or multiple containers)
has been built around network namespaces, so while we can use the global scope
of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
NodePort ports on loopback addresses), we also have the need to differentiate
between initial network namespaces and non-initial one. For example, ExternalIP
services mandate that non-local service IPs are not to be translated from the
host (initial) network namespace as one example. Right now, we have an ugly
work-around in place where non-local service IPs for ExternalIP services are
not xlated from connect() and friends BPF hooks but instead via less efficient
packet-level NAT on the veth tc ingress hook for Pod traffic.

On top of determining whether we're in initial or non-initial network namespace
we also have a need for a socket-cookie like mechanism for network namespaces
scope. Socket cookies have the nice property that they can be combined as part
of the key structure e.g. for BPF LRU maps without having to worry that the
cookie could be recycled. We are planning to use this for our sessionAffinity
implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
provide the cookie for the initial network namespace while passing the context
instead of NULL would provide the cookie from the application's network namespace.
We're using a hole, so no size increase; the assignment happens only once.
Therefore this allows for a comparison on initial namespace as well as regular
cookie usage as we have today with socket cookies. We could later on enable
this helper for other program types as well as we would see need.

  (*) Both externalTrafficPolicy={Local|Cluster} types
  [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
2020-03-27 19:40:38 -07:00
..
bpf_sk_storage.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2020-02-29 15:53:35 -08:00
datagram.c net: datagram: drop 'destructor' argument from several helpers 2020-02-28 12:12:53 -08:00
datagram.h
dev_addr_lists.c net: remove unnecessary variables and callback 2019-10-24 14:53:49 -07:00
dev_ioctl.c net: Introduce peer to peer one step PTP time stamping. 2019-12-25 19:51:34 -08:00
dev.c Merge branch 'ct-offload' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux 2020-03-12 12:34:23 -07:00
devlink.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-03-12 22:34:48 -07:00
drop_monitor.c net: core: Replace zero-length array with flexible-array member 2020-02-28 12:08:37 -08:00
dst_cache.c
dst.c net: print proper warning on dst underflow 2019-09-26 09:05:56 +02:00
failover.c
fib_notifier.c net: fib_notifier: propagate extack down to the notifier block callback 2019-10-04 11:10:56 -07:00
fib_rules.c net: fib_rules: Correctly set table field when table number exceeds 8 bits 2020-02-16 18:38:24 -08:00
filter.c bpf: Add netns cookie and enable it for bpf cgroup hooks 2020-03-27 19:40:38 -07:00
flow_dissector.c bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites. 2020-02-24 16:20:09 -08:00
flow_offload.c flow_offload: Add flow_match_ct to get rule ct match 2020-03-12 15:00:39 -07:00
gen_estimator.c net_sched: gen_estimator: extend packet counter to 64bit 2019-11-06 21:51:36 -08:00
gen_stats.c net_sched: add TCA_STATS_PKT64 attribute 2019-11-05 18:20:55 -08:00
gro_cells.c
hwbm.c
link_watch.c
lwt_bpf.c net: ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookup 2019-12-04 12:27:13 -08:00
lwtunnel.c
Makefile ethtool: move to its own directory 2019-12-12 17:07:05 -08:00
neighbour.c net: neigh: remove unused NEIGH_SYSCTL_MS_JIFFIES_ENTRY 2020-02-20 10:02:23 -08:00
net_namespace.c bpf: Add netns cookie and enable it for bpf cgroup hooks 2020-03-27 19:40:38 -07:00
net-procfs.c net: procfs: use index hashlist instead of name hashlist 2019-10-01 14:47:19 -07:00
net-sysfs.c net-sysfs: add queue_change_owner() 2020-02-26 20:07:26 -08:00
net-sysfs.h net-sysfs: add netdev_change_owner() 2020-02-26 20:07:25 -08:00
net-traces.c
netclassid_cgroup.c cgroup, netclassid: periodically release file_lock on classid updating 2020-03-09 18:13:39 -07:00
netevent.c
netpoll.c net: fix skb use after free in netpoll 2019-08-27 20:52:02 -07:00
netprio_cgroup.c netprio: use css ID instead of cgroup ID 2019-11-12 08:18:03 -08:00
page_pool.c net: page_pool: API cleanup and comments 2020-02-20 10:09:25 -08:00
pktgen.c pktgen: Allow on loopback device 2020-03-10 15:44:59 -07:00
ptp_classifier.c
request_sock.c tcp: add rcu protection around tp->fastopen_rsk 2019-10-13 10:13:08 -07:00
rtnetlink.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-02-21 13:39:34 -08:00
scm.c y2038: socket: remove timespec reference in timestamping 2019-11-15 14:38:29 +01:00
secure_seq.c
skbuff.c net: core: Distribute switch variables for initialization 2020-02-20 10:00:19 -08:00
skmsg.c bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites. 2020-02-24 16:20:09 -08:00
sock_diag.c sock: make cookie generation global instead of per netns 2019-08-09 13:14:46 -07:00
sock_map.c bpf: sockmap: Add UDP support 2020-03-09 22:34:58 +01:00
sock_reuseport.c net: Generate reuseport group ID on group creation 2020-02-21 22:29:45 +01:00
sock.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-03-12 22:34:48 -07:00
stream.c tcp: make sure EPOLLOUT wont be missed 2019-08-19 13:07:43 -07:00
sysctl_net_core.c net, sysctl: Fix compiler warning when only cBPF is present 2019-12-19 17:17:51 +01:00
timestamping.c net: Introduce a new MII time stamping interface. 2019-12-25 19:51:33 -08:00
tso.c net: Use skb accessors in network core 2019-07-22 20:47:56 -07:00
utils.c net: Fix skb->csum update in inet_proto_csum_replace16(). 2020-01-24 20:54:30 +01:00
xdp.c net: page_pool: API cleanup and comments 2020-02-20 10:09:25 -08:00