linux

Files

Daniel Borkmann 8520e224f5 bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode

Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
Back in the days, commit bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
embedded per-socket cgroup information into sock->sk_cgrp_data and in order
to save 8 bytes in struct sock made both mutually exclusive, that is, when
cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).

The assumption made was "there is no reason to mix the two and this is in line
with how legacy and v2 compatibility is handled" as stated in bd1060a1d6.
However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
this assumption no longer holds, and the possibility of the v1/v2 mixed mode
with the v2 root fallback being hit becomes a real security issue.

Many of the cgroup v2 BPF programs are also used for policy enforcement, just
to pick _one_ example, that is, to programmatically deny socket related system
calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
a policy bypass for the affected Pods.

In production environments, we have recently seen this case due to various
circumstances: i) a different 3rd party agent and/or ii) a container runtime
such as [0] in the user's environment configuring legacy cgroup v1 net_cls
tags, which triggered implicitly mentioned root fallback. Another case is
Kubernetes projects like kind [1] which create Kubernetes nodes in a container
and also add cgroup namespaces to the mix, meaning programs which are attached
to the cgroup v2 root of the cgroup namespace get attached to a non-root
cgroup v2 path from init namespace point of view. And the latter's root is
out of reach for agents on a kind Kubernetes node to configure. Meaning, any
entity on the node setting cgroup v1 net_cls tag will trigger the bypass
despite cgroup v2 BPF programs attached to the namespace root.

Generally, this mutual exclusiveness does not hold anymore in today's user
environments and makes cgroup v2 usage from BPF side fragile and unreliable.
This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
sock_cgroup_data in order to address these issues; this implicitly also fixes
the tradeoffs being made back then with regards to races and refcount leaks
as stated in bd1060a1d6, and removes the fallback, so that cgroup v2 BPF
programs always operate as expected.

  [0] https://github.com/nestybox/sysbox/
  [1] https://kind.sigs.k8s.io/

Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net

2021-09-13 16:35:58 -07:00

bpf

bpf: Add oversize check before call kvcalloc()

2021-09-13 16:28:15 -07:00

cgroup

bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode

2021-09-13 16:35:58 -07:00

configs

drivers/char: remove /dev/kmem for good

2021-05-07 00:26:34 -07:00

debug

kernel: debug: Fix unreachable code in gdb_serial_stub()

2021-07-12 11:03:35 -05:00

dma

dma-mapping: handle vmalloc addresses in dma_common_{mmap,get_sgtable}

2021-07-16 11:30:26 +02:00

entry

tick/nohz: Only check for RCU deferred wakeup on user/guest entry when needed

2021-05-31 10:14:49 +02:00

events

Merge tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

2021-08-31 16:43:06 -07:00

gcov

Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR

2021-06-22 11:07:18 -07:00

irq

Merge tag 'irq-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2021-08-30 14:38:37 -07:00

kcsan

kcsan: use u64 instead of cycles_t

2021-07-30 17:09:02 +02:00

livepatch

Merge tag 'livepatching-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching

2021-04-27 18:14:38 -07:00

locking

Merge tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2021-08-30 14:26:36 -07:00

power

Merge branches 'pm-pci', 'pm-sleep', 'pm-domains' and 'powercap'

2021-08-30 19:25:42 +02:00

printk

Merge tag 'printk-for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

2021-06-29 12:07:18 -07:00

rcu

Merge tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2021-08-30 14:26:36 -07:00

sched

Merge tag 'pm-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

2021-08-31 13:21:58 -07:00

time

clocksource: Make clocksource watchdog test safe for slow-HZ systems

2021-08-28 17:01:32 +02:00

trace

Merge tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

2021-08-31 16:43:06 -07:00

.gitignore

.gitignore: prefix local generated files with a slash

2021-05-02 00:43:35 +09:00

acct.c

kernel/acct.c: use #elif instead of #end and #elif

2020-12-15 22:46:15 -08:00

async.c

kernel/async.c: remove async_unregister_domain()

2021-05-07 00:26:33 -07:00

audit_fsnotify.c

audit_alloc_mark(): don't open-code ERR_CAST()

2021-02-23 10:25:27 -05:00

audit_tree.c

audit: move put_tree() to avoid trim_trees refcount underflow and UAF

2021-08-24 18:52:36 -04:00

audit_watch.c

fsnotify: generalize handle_inode_event()

2020-12-03 14:58:35 +01:00

audit.c

lsm: separate security_task_getsecid() into subjective and objective variants

2021-03-22 15:23:32 -04:00

audit.h

audit: add header protection to kernel/audit.h

2021-07-19 22:38:24 -04:00

auditfilter.c

lsm: separate security_task_getsecid() into subjective and objective variants

2021-03-22 15:23:32 -04:00

auditsc.c

audit: remove trailing spaces and tabs

2021-06-10 20:59:05 -04:00

backtracetest.c

…

bounds.c

…

capability.c

capability: handle idmapped mounts

2021-01-24 14:27:16 +01:00

cfi.c

cfi: Use rcu_read_{un}lock_sched_notrace

2021-08-11 13:11:12 -07:00

compat.c

…

configs.c

…

context_tracking.c

…

cpu_pm.c

PM: cpu: Make notifier chain use a raw_spinlock_t

2021-08-16 18:55:32 +02:00

cpu.c

cpu/hotplug: Add debug printks for hotplug callback failures

2021-08-10 18:31:32 +02:00

crash_core.c

kdump: use vmlinux_build_id to simplify

2021-07-08 11:48:22 -07:00

crash_dump.c

…

cred.c

ucounts: Increase ucounts reference counter before the security hook

2021-08-23 16:13:04 -05:00

delayacct.c

delayacct: Add sysctl to enable at runtime

2021-05-12 11:43:25 +02:00

dma.c

…

exec_domain.c

…

exit.c

io_uring: remove files pointer in cancellation functions

2021-08-23 13:10:37 -06:00

extable.c

…

fail_function.c

fault-injection: handle EI_ETYPE_TRUE

2020-12-15 22:46:19 -08:00

fork.c

Merge tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

2021-08-31 16:43:06 -07:00

freezer.c

sched: Add get_current_state()

2021-06-18 11:43:08 +02:00

futex.c

futex: Prevent requeue_pi() lock nesting issue on RT

2021-08-17 19:05:59 +02:00

gen_kheaders.sh

kbuild: clean up ${quiet} checks in shell scripts

2021-05-27 04:01:50 +09:00

groups.c

groups: simplify struct group_info allocation

2021-02-26 09:41:03 -08:00

hung_task.c

Merge branch 'akpm' (patches from Andrew)

2021-07-02 12:08:10 -07:00

iomem.c

…

irq_work.c

irq_work: Make irq_work_queue() NMI-safe again

2021-06-10 10:00:08 +02:00

jump_label.c

jump_label: Fix jump_label_text_reserved() vs __init

2021-07-05 10:46:20 +02:00

kallsyms.c

module: add printk formats to add module build ID to stacktraces

2021-07-08 11:48:22 -07:00

kcmp.c

Merge branch 'exec-update-lock-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

2020-12-15 19:36:48 -08:00

Kconfig.freezer

…

Kconfig.hz

…

Kconfig.locks

locking/rwlock: Provide RT variant

2021-08-17 17:50:51 +02:00

Kconfig.preempt

sched/core: Disable CONFIG_SCHED_CORE by default

2021-06-28 22:43:05 +02:00

kcov.c

…

kexec_core.c

kernel.h: split out panic and oops helpers

2021-07-01 11:06:04 -07:00

kexec_elf.c

…

kexec_file.c

kernel: kexec_file: fix error return code of kexec_calculate_store_digests()

2021-05-07 00:26:32 -07:00

kexec_internal.h

kexec: move machine_kexec_post_load() to public interface

2021-02-22 12:33:26 +00:00

kexec.c

…

kheaders.c

…

kmod.c

modules: add CONFIG_MODPROBE_PATH

2021-05-07 00:26:33 -07:00

kprobes.c

Merge tag 'locking-urgent-2021-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2021-07-11 11:06:09 -07:00

ksysfs.c

…

kthread.c

Merge branch 'akpm' (patches from Andrew)

2021-06-29 17:29:11 -07:00

latencytop.c

…

Makefile

kbuild: update config_data.gz only when the content of .config is changed

2021-05-02 00:43:35 +09:00

module_signature.c

module: harden ELF info handling

2021-01-19 10:24:45 +01:00

module_signing.c

module: harden ELF info handling

2021-01-19 10:24:45 +01:00

module-internal.h

…

module.c

module: add printk formats to add module build ID to stacktraces

2021-07-08 11:48:22 -07:00

notifier.c

notifier: Remove atomic_notifier_call_chain_robust()

2021-08-16 18:55:32 +02:00

nsproxy.c

Merge tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

2020-12-14 16:40:27 -08:00

padata.c

padata: Remove repeated verbose license text

2021-08-27 16:30:18 +08:00

panic.c

kernel.h: split out panic and oops helpers

2021-07-01 11:06:04 -07:00

params.c

params: lift param_set_uint_minmax to common code

2021-08-16 14:42:22 +02:00

pid_namespace.c

Merge tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

2020-12-14 16:40:27 -08:00

pid.c

kernel/pid.c: implement additional checks upon pidfd_create() parameters

2021-08-10 12:53:07 +02:00

profile.c

kernel: Initialize cpumask before parsing

2021-04-10 13:35:54 +02:00

ptrace.c

sched: Change task_struct::state

2021-06-18 11:43:09 +02:00

range.c

…

reboot.c

reboot: Add hardware protection power-off

2021-06-21 13:08:36 +01:00

regset.c

…

relay.c

relay: allow the use of const callback structs

2020-12-15 22:46:18 -08:00

resource_kunit.c

resource: provide meaningful MODULE_LICENSE() in test suite

2020-11-25 18:52:35 +01:00

resource.c

kernel/resource: fix return code check in __request_free_mem_region

2021-05-14 19:41:32 -07:00

rseq.c

rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-14 18:04:09 +02:00

scftorture.c

scftorture: Avoid NULL pointer exception on early exit

2021-07-27 11:39:30 -07:00

scs.c

scs: switch to vmapped shadow stacks

2020-12-01 10:30:28 +00:00

seccomp.c

seccomp: Fix setting loaded filter count during TSYNC

2021-08-11 11:48:28 -07:00

signal.c

posix-cpu-timers: Assert task sighand is locked while starting cputime counter

2021-08-10 17:09:58 +02:00

smp.c

smp: Fix all kernel-doc warnings

2021-08-11 14:47:16 +02:00

smpboot.c

smpboot: Replace deprecated CPU-hotplug functions.

2021-08-10 14:57:42 +02:00

smpboot.h

…

softirq.c

genirq: Change force_irqthreads to a static key

2021-08-10 22:50:07 +02:00

stackleak.c

…

stacktrace.c

…

static_call.c

static_call: Fix static_call_text_reserved() vs __init

2021-07-05 10:46:33 +02:00

stop_machine.c

stop_machine: Add caller debug info to queue_stop_cpus_work

2021-03-23 16:01:58 +01:00

sys_ni.c

mm: introduce memfd_secret system call to create "secret" memory areas

2021-07-08 11:48:21 -07:00

sys.c

set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds

2021-08-12 14:54:25 +02:00

sysctl-test.c

kernel/sysctl-test: Remove some casts which are no-longer required

2021-06-23 16:41:24 -06:00

sysctl.c

sysctl: introduce new proc handler proc_dobool

2021-08-17 11:47:53 -04:00

task_work.c

kasan: record task_work_add() call stack

2021-04-30 11:20:42 -07:00

taskstats.c

treewide: rename nla_strlcpy to nla_strscpy.

2020-11-16 08:08:54 -08:00

test_kprobes.c

…

torture.c

torture: Replace deprecated CPU-hotplug functions.

2021-08-10 10:48:07 -07:00

tracepoint.c

tracepoint: Use rcu get state and cond sync for static call updates

2021-08-06 10:54:41 -04:00

tsacct.c

…

ucount.c

ucounts: add missing data type changes

2021-08-09 15:45:02 -05:00

uid16.c

…

uid16.h

…

umh.c

kernel/umh.c: fix some spelling mistakes

2021-05-07 00:26:34 -07:00

up.c

Merge tag 'locking-urgent-2021-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2021-05-09 13:07:03 -07:00

user_namespace.c

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

2021-06-28 20:39:26 -07:00

user-return-notifier.c

…

user.c

Reimplement RLIMIT_MEMLOCK on top of ucounts

2021-04-30 14:14:02 -05:00

usermode_driver.c

Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

2021-07-03 11:41:14 -07:00

utsname_sysctl.c

…

utsname.c

…

watch_queue.c

watch_queue: rectify kernel-doc for init_watch()

2021-01-26 11:16:34 +00:00

watchdog_hld.c

…

watchdog.c

kernel: watchdog: modify the explanation related to watchdog thread

2021-06-29 10:53:46 -07:00

workqueue_internal.h

workqueue: Assign a color to barrier work items

2021-08-17 07:49:10 -10:00

workqueue.c

workqueue: Assign a color to barrier work items

2021-08-17 07:49:10 -10:00