linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-26 14:12:06 +00:00

History

Daniel Borkmann e420bed025 bpf: Add fd-based tcx multi-prog infra with link support This work refactors and adds a lightweight extension ("tcx") to the tc BPF ingress and egress data path side for allowing BPF program management based on fds via bpf() syscall through the newly added generic multi-prog API. The main goal behind this work which we also presented at LPC [0] last year and a recent update at LSF/MM/BPF this year [3] is to support long-awaited BPF link functionality for tc BPF programs, which allows for a model of safe ownership and program detachment. Given the rise in tc BPF users in cloud native environments, this becomes necessary to avoid hard to debug incidents either through stale leftover programs or 3rd party applications accidentally stepping on each others toes. As a recap, a BPF link represents the attachment of a BPF program to a BPF hook point. The BPF link holds a single reference to keep BPF program alive. Moreover, hook points do not reference a BPF link, only the application's fd or pinning does. A BPF link holds meta-data specific to attachment and implements operations for link creation, (atomic) BPF program update, detachment and introspection. The motivation for BPF links for tc BPF programs is multi-fold, for example: - From Meta: "It's especially important for applications that are deployed fleet-wide and that don't "control" hosts they are deployed to. If such application crashes and no one notices and does anything about that, BPF program will keep running draining resources or even just, say, dropping packets. We at FB had outages due to such permanent BPF attachment semantics. With fd-based BPF link we are getting a framework, which allows safe, auto-detachable behavior by default, unless application explicitly opts in by pinning the BPF link." [1] - From Cilium-side the tc BPF programs we attach to host-facing veth devices and phys devices build the core datapath for Kubernetes Pods, and they implement forwarding, load-balancing, policy, EDT-management, etc, within BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently experienced hard-to-debug issues in a user's staging environment where another Kubernetes application using tc BPF attached to the same prio/handle of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath it. The goal is to establish a clear/safe ownership model via links which cannot accidentally be overridden. [0,2] BPF links for tc can co-exist with non-link attachments, and the semantics are in line also with XDP links: BPF links cannot replace other BPF links, BPF links cannot replace non-BPF links, non-BPF links cannot replace BPF links and lastly only non-BPF links can replace non-BPF links. In case of Cilium, this would solve mentioned issue of safe ownership model as 3rd party applications would not be able to accidentally wipe Cilium programs, even if they are not BPF link aware. Earlier attempts [4] have tried to integrate BPF links into core tc machinery to solve cls_bpf, which has been intrusive to the generic tc kernel API with extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could be wiped from the qdisc also. Locking a tc BPF program in place this way, is getting into layering hacks given the two object models are vastly different. We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF attach API, so that the BPF link implementation blends in naturally similar to other link types which are fd-based and without the need for changing core tc internal APIs. BPF programs for tc can then be successively migrated from classic cls_bpf to the new tc BPF link without needing to change the program's source code, just the BPF loader mechanics for attaching is sufficient. For the current tc framework, there is no change in behavior with this change and neither does this change touch on tc core kernel APIs. The gist of this patch is that the ingress and egress hook have a lightweight, qdisc-less extension for BPF to attach its tc BPF programs, in other words, a minimal entry point for tc BPF. The name tcx has been suggested from discussion of earlier revisions of this work as a good fit, and to more easily differ between the classic cls_bpf attachment and the fd-based one. For the ingress and egress tcx points, the device holds a cache-friendly array with program pointers which is separated from control plane (slow-path) data. Earlier versions of this work used priority to determine ordering and expression of dependencies similar as with classic tc, but it was challenged that for something more future-proof a better user experience is required. Hence this resulted in the design and development of the generic attach/detach/query API for multi-progs. See prior patch with its discussion on the API design. tcx is the first user and later we plan to integrate also others, for example, one candidate is multi-prog support for XDP which would benefit and have the same 'look and feel' from API perspective. The goal with tcx is to have maximum compatibility to existing tc BPF programs, so they don't need to be rewritten specifically. Compatibility to call into classic tcf_classify() is also provided in order to allow successive migration or both to cleanly co-exist where needed given its all one logical tc layer and the tcx plus classic tc cls/act build one logical overall processing pipeline. tcx supports the simplified return codes TCX_NEXT which is non-terminating (go to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT. The fd-based API is behind a static key, so that when unused the code is also not entered. The struct tcx_entry's program array is currently static, but could be made dynamic if necessary at a point in future. The a/b pair swap design has been chosen so that for detachment there are no allocations which otherwise could fail. The work has been tested with tc-testing selftest suite which all passes, as well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB. Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews of this work. [0] https://lpc.events/event/16/contributions/1353/ [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>		2023-07-19 10:07:27 -07:00
..
preload	bpf: make preloaded map iterators to display map elements count	2023-07-06 12:42:25 -07:00
arraymap.c	bpf: return long from bpf_map_ops funcs	2023-03-22 15:11:30 -07:00
bloom_filter.c	bpf: Centralize permissions checks for all BPF map types	2023-06-19 14:04:04 +02:00
bpf_cgrp_storage.c	bpf: Teach verifier that certain helpers accept NULL pointer.	2023-04-04 16:57:16 -07:00
bpf_inode_storage.c	Networking changes for 6.4.	2023-04-26 16:07:23 -07:00
bpf_iter.c	bpf: implement numbers iterator	2023-03-08 16:19:51 -08:00
bpf_local_storage.c	bpf: Centralize permissions checks for all BPF map types	2023-06-19 14:04:04 +02:00
bpf_lru_list.c	bpf: Address KCSAN report on bpf_lru_list	2023-05-12 12:01:03 -07:00
bpf_lru_list.h	bpf: Address KCSAN report on bpf_lru_list	2023-05-12 12:01:03 -07:00
bpf_lsm.c	bpf: Fix the kernel crash caused by bpf_setsockopt().	2023-01-26 23:26:40 -08:00
bpf_struct_ops_types.h	bpf: Add dummy BPF STRUCT_OPS for test purpose	2021-11-01 14:10:00 -07:00
bpf_struct_ops.c	bpf: Centralize permissions checks for all BPF map types	2023-06-19 14:04:04 +02:00
bpf_task_storage.c	bpf: Teach verifier that certain helpers accept NULL pointer.	2023-04-04 16:57:16 -07:00
btf.c	for-netdev	2023-07-13 19:13:24 -07:00
cgroup_iter.c	cgroup: bpf: use cgroup_lock()/cgroup_unlock() wrappers	2023-03-17 12:07:13 -10:00
cgroup.c	bpf-next-for-netdev	2023-05-16 19:50:05 -07:00
core.c	bpf: Hide unused bpf_patch_call_args	2023-06-12 19:00:08 +02:00
cpumap.c	bpf: cpumap: Fix memory leak in cpu_map_update_elem	2023-07-11 19:57:03 -07:00
cpumask.c	bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.	2023-07-12 23:45:23 +02:00
devmap.c	bpf: Centralize permissions checks for all BPF map types	2023-06-19 14:04:04 +02:00
disasm.c	bpf: Relicense disassembler as GPL-2.0-only OR BSD-2-Clause	2021-09-02 14:49:23 +02:00
disasm.h	bpf: Relicense disassembler as GPL-2.0-only OR BSD-2-Clause	2021-09-02 14:49:23 +02:00
dispatcher.c	bpf: Synchronize dispatcher update with bpf_dispatcher_xdp_func	2022-12-14 12:02:14 -08:00
hashtab.c	bpf: populate the per-cpu insertions/deletions counters for hashmaps	2023-07-06 12:42:25 -07:00
helpers.c	bpf: Add 'owner' field to bpf_{list,rb}_node	2023-07-18 17:23:10 -07:00
inode.c	bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands	2023-05-23 23:31:42 +02:00
Kconfig	bpf: Add fd-based tcx multi-prog infra with link support	2023-07-19 10:07:27 -07:00
link_iter.c	bpf: Add bpf_link iterator	2022-05-10 11:20:45 -07:00
local_storage.c	cgroup changes for v6.4-rc1	2023-04-29 10:05:22 -07:00
log.c	bpf: drop unnecessary user-triggerable WARN_ONCE in verifierl log	2023-05-16 22:34:50 -07:00
lpm_trie.c	bpf: Centralize permissions checks for all BPF map types	2023-06-19 14:04:04 +02:00
Makefile	bpf: Add fd-based tcx multi-prog infra with link support	2023-07-19 10:07:27 -07:00
map_in_map.c	bpf: Fix elem_size not being set for inner maps	2023-06-02 16:22:12 -07:00
map_in_map.h
map_iter.c	bpf: allow any program to use the bpf_map_sum_elem_count kfunc	2023-07-19 09:48:53 -07:00
memalloc.c	bpf: Add object leak check.	2023-07-12 23:45:23 +02:00
mmap_unlock_work.h	bpf: Introduce helper bpf_find_vma	2021-11-07 11:54:51 -08:00
mprog.c	bpf: Add generic attach/detach/query API for multi-progs	2023-07-19 10:07:27 -07:00
net_namespace.c	net: Add includes masked by netdevice.h including uapi/bpf.h	2021-12-29 20:03:05 -08:00
offload.c	bpf: netdev: init the offload table earlier	2023-05-15 07:07:41 -07:00
percpu_freelist.c	bpf: Initialize same number of free nodes for each pcpu_freelist	2022-11-11 12:05:14 -08:00
percpu_freelist.h
prog_iter.c
queue_stack_maps.c	bpf: Centralize permissions checks for all BPF map types	2023-06-19 14:04:04 +02:00
reuseport_array.c	bpf: Centralize permissions checks for all BPF map types	2023-06-19 14:04:04 +02:00
ringbuf.c	bpf: Remove unnecessary ring buffer size check	2023-07-05 14:09:45 +02:00
stackmap.c	bpf: Centralize permissions checks for all BPF map types	2023-06-19 14:04:04 +02:00
syscall.c	bpf: Add fd-based tcx multi-prog infra with link support	2023-07-19 10:07:27 -07:00
sysfs_btf.c
task_iter.c	bpf: keep a reference to the mm, in case the task is dead.	2022-12-28 14:11:48 -08:00
tcx.c	bpf: Add fd-based tcx multi-prog infra with link support	2023-07-19 10:07:27 -07:00
tnum.c
trampoline.c	bpf: Fix memleak due to fentry attach failure	2023-05-15 23:41:59 +02:00
verifier.c	bpf: consider CONST_PTR_TO_MAP as trusted pointer to struct bpf_map	2023-07-19 09:48:52 -07:00