linux/net/core
Daniel Borkmann 8520e224f5 bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode
Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
Back in the days, commit bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
embedded per-socket cgroup information into sock->sk_cgrp_data and in order
to save 8 bytes in struct sock made both mutually exclusive, that is, when
cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).

The assumption made was "there is no reason to mix the two and this is in line
with how legacy and v2 compatibility is handled" as stated in bd1060a1d6.
However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
this assumption no longer holds, and the possibility of the v1/v2 mixed mode
with the v2 root fallback being hit becomes a real security issue.

Many of the cgroup v2 BPF programs are also used for policy enforcement, just
to pick _one_ example, that is, to programmatically deny socket related system
calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
a policy bypass for the affected Pods.

In production environments, we have recently seen this case due to various
circumstances: i) a different 3rd party agent and/or ii) a container runtime
such as [0] in the user's environment configuring legacy cgroup v1 net_cls
tags, which triggered implicitly mentioned root fallback. Another case is
Kubernetes projects like kind [1] which create Kubernetes nodes in a container
and also add cgroup namespaces to the mix, meaning programs which are attached
to the cgroup v2 root of the cgroup namespace get attached to a non-root
cgroup v2 path from init namespace point of view. And the latter's root is
out of reach for agents on a kind Kubernetes node to configure. Meaning, any
entity on the node setting cgroup v1 net_cls tag will trigger the bypass
despite cgroup v2 BPF programs attached to the namespace root.

Generally, this mutual exclusiveness does not hold anymore in today's user
environments and makes cgroup v2 usage from BPF side fragile and unreliable.
This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
sock_cgroup_data in order to address these issues; this implicitly also fixes
the tradeoffs being made back then with regards to races and refcount leaks
as stated in bd1060a1d6, and removes the fallback, so that cgroup v2 BPF
programs always operate as expected.

  [0] https://github.com/nestybox/sysbox/
  [1] https://kind.sigs.k8s.io/

Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net
2021-09-13 16:35:58 -07:00
..
bpf_sk_storage.c net: in_irq() cleanup 2021-08-13 14:09:19 -07:00
datagram.c udp: fix skb_copy_and_csum_datagram with odd segment sizes 2021-02-04 18:56:56 -08:00
datagram.h
dev_addr_lists.c net-next: When a bond have a massive amount of VLANs with IPv6 addresses, performance of changing link state, attaching a VRF, changing an IPv6 address, etc. go down dramtically. 2021-08-25 10:29:07 +01:00
dev_ioctl.c net: core: don't call SIOCBRADD/DELIF for non-bridge devices 2021-08-05 11:36:59 +01:00
dev.c net: in_irq() cleanup 2021-08-13 14:09:19 -07:00
devlink.c devlink: Clear whole devlink_flash_notify struct 2021-08-14 13:59:10 +01:00
drop_monitor.c net: Remove redundant if statements 2021-08-05 13:27:50 +01:00
dst_cache.c
dst.c net: Remove redundant if statements 2021-08-05 13:27:50 +01:00
failover.c
fib_notifier.c
fib_rules.c memcg: enable accounting for IP address and routing-related objects 2021-07-20 06:00:38 -07:00
filter.c bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt 2021-08-25 17:40:35 -07:00
flow_dissector.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-07-31 09:14:46 -07:00
flow_offload.c net: Fix offloading indirect devices dependency on qdisc order creation 2021-08-19 13:19:30 +01:00
gen_estimator.c net_sched: gen_estimator: support large ewma log 2021-01-15 18:11:06 -08:00
gen_stats.c docs: networking: convert gen_stats.txt to ReST 2020-04-28 14:39:46 -07:00
gro_cells.c gro_cells: reduce number of synchronize_net() calls 2020-11-25 11:28:12 -08:00
hwbm.c
link_watch.c net: linkwatch: fix failure to restore device state across suspend/resume 2021-08-11 14:43:16 -07:00
lwt_bpf.c lwt_bpf: Replace preempt_disable() with migrate_disable() 2020-12-07 11:53:40 -08:00
lwtunnel.c netfilter: add netfilter hooks to SRv6 data plane 2021-08-30 01:51:36 +02:00
Makefile sock_map: Relax config dependency to CONFIG_NET 2021-07-15 18:17:49 -07:00
neighbour.c net: Support filtering interfaces on no master 2021-08-10 16:03:34 -07:00
net_namespace.c net: net_namespace: Optimize the code 2021-08-18 10:34:48 +01:00
net-procfs.c net: procfs: add seq_puts() statement for dev_mcast 2021-08-18 10:13:20 +01:00
net-sysfs.c net-sysfs: remove possible sleep from an RCU read-side critical section 2021-03-22 13:28:13 -07:00
net-sysfs.h net-sysfs: add netdev_change_owner() 2020-02-26 20:07:25 -08:00
net-traces.c tcp: add tracepoint for checksum errors 2021-05-14 15:26:03 -07:00
netclassid_cgroup.c bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode 2021-09-13 16:35:58 -07:00
netevent.c net: core: Correct function name netevent_unregister_notifier() in the kerneldoc 2021-03-28 17:56:56 -07:00
netpoll.c asm-generic/unaligned: Unify asm/unaligned.h around struct helper 2021-07-02 12:43:40 -07:00
netprio_cgroup.c bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode 2021-09-13 16:35:58 -07:00
page_pool.c page_pool: use relaxed atomic for release side accounting 2021-08-24 10:46:31 +01:00
pktgen.c pktgen: Remove fill_imix_distribution() CONFIG_XFRM dependency 2021-08-18 11:41:13 +01:00
ptp_classifier.c bpf: Refactor BPF_PROG_RUN into a function 2021-08-17 00:45:07 +02:00
request_sock.c
rtnetlink.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-08-26 17:57:57 -07:00
scm.c memcg: enable accounting for scm_fp_list objects 2021-07-20 06:00:38 -07:00
secure_seq.c crypto: lib/sha1 - remove unnecessary includes of linux/cryptohash.h 2020-05-08 15:32:17 +10:00
selftests.c net: selftests: add MTU test 2021-07-22 00:52:04 -07:00
skbuff.c net: in_irq() cleanup 2021-08-13 14:09:19 -07:00
skmsg.c bpf, sockmap: Fix memleak on ingress msg enqueue 2021-07-27 14:55:30 -07:00
sock_diag.c bpf, net: Rework cookie generator as per-cpu one 2020-09-30 11:50:35 -07:00
sock_map.c af_unix: Add unix_stream_proto for sockmap 2021-08-16 18:43:39 -07:00
sock_reuseport.c tcp: Add stats for socket migration. 2021-06-23 12:56:08 -07:00
sock.c sock: remove one redundant SKB_FRAG_PAGE_ORDER macro 2021-08-26 10:46:20 +01:00
stream.c
sysctl_net_core.c net: change netdev_unregister_timeout_secs min value to 1 2021-03-25 17:24:06 -07:00
timestamping.c net: Introduce a new MII time stamping interface. 2019-12-25 19:51:33 -08:00
tso.c net: tso: add UDP segmentation support 2020-06-18 20:46:23 -07:00
utils.c net: Fix skb->csum update in inet_proto_csum_replace16(). 2020-01-24 20:54:30 +01:00
xdp.c xdp: Move the rxq_info.mem clearing to unreg_mem_model() 2021-06-28 23:07:59 +02:00