linux

History

Daniel Borkmann e2e9b6541d cls_bpf: add initial eBPF support for programmable classifiers This work extends the "classic" BPF programmable tc classifier by extending its scope also to native eBPF code! This allows for user space to implement own custom, 'safe' C like classifiers (or whatever other frontend language LLVM et al may provide in future), that can then be compiled with the LLVM eBPF backend to an eBPF elf file. The result of this can be loaded into the kernel via iproute2's tc. In the kernel, they can be JITed on major archs and thus run in native performance. Simple, minimal toy example to demonstrate the workflow: #include <linux/ip.h> #include <linux/if_ether.h> #include <linux/bpf.h> #include "tc_bpf_api.h" __section("classify") int cls_main(struct sk_buff *skb) { return (0x800 << 16) \| load_byte(skb, ETH_HLEN + __builtin_offsetof(struct iphdr, tos)); } char __license[] __section("license") = "GPL"; The classifier can then be compiled into eBPF opcodes and loaded via tc, for example: clang -O2 -emit-llvm -c cls.c -o - \| llc -march=bpf -filetype=obj -o cls.o tc filter add dev em1 parent 1: bpf cls.o [...] As it has been demonstrated, the scope can even reach up to a fully fledged flow dissector (similarly as in samples/bpf/sockex2_kern.c). For tc, maps are allowed to be used, but from kernel context only, in other words, eBPF code can keep state across filter invocations. In future, we perhaps may reattach from a different application to those maps e.g., to read out collected statistics/state. Similarly as in socket filters, we may extend functionality for eBPF classifiers over time depending on the use cases. For that purpose, cls_bpf programs are using BPF_PROG_TYPE_SCHED_CLS program type, so we can allow additional functions/accessors (e.g. an ABI compatible offset translation to skb fields/metadata). For an initial cls_bpf support, we allow the same set of helper functions as eBPF socket filters, but we could diverge at some point in time w/o problem. I was wondering whether cls_bpf and act_bpf could share C programs, I can imagine that at some point, we introduce i) further common handlers for both (or even beyond their scope), and/or if truly needed ii) some restricted function space for each of them. Both can be abstracted easily through struct bpf_verifier_ops in future. The context of cls_bpf versus act_bpf is slightly different though: a cls_bpf program will return a specific classid whereas act_bpf a drop/non-drop return code, latter may also in future mangle skbs. That said, we can surely have a "classify" and "action" section in a single object file, or considered mentioned constraint add a possibility of a shared section. The workflow for getting native eBPF running from tc [1] is as follows: for f_bpf, I've added a slightly modified ELF parser code from Alexei's kernel sample, which reads out the LLVM compiled object, sets up maps (and dynamically fixes up map fds) if any, and loads the eBPF instructions all centrally through the bpf syscall. The resulting fd from the loaded program itself is being passed down to cls_bpf, which looks up struct bpf_prog from the fd store, and holds reference, so that it stays available also after tc program lifetime. On tc filter destruction, it will then drop its reference. Moreover, I've also added the optional possibility to annotate an eBPF filter with a name (e.g. path to object file, or something else if preferred) so that when tc dumps currently installed filters, some more context can be given to an admin for a given instance (as opposed to just the file descriptor number). Last but not least, bpf_prog_get() and bpf_prog_put() needed to be exported, so that eBPF can be used from cls_bpf built as a module. Thanks to `60a3b2253c` ("net: bpf: make eBPF interpreter images read-only") I think this is of no concern since anything wanting to alter eBPF opcode after verification stage would crash the kernel. [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>		2015-03-01 14:05:19 -05:00
..
bpf	cls_bpf: add initial eBPF support for programmable classifiers	2015-03-01 14:05:19 -05:00
configs	x86: Add "make tinyconfig" to configure the tiniest possible kernel	2014-08-08 16:30:24 -07:00
debug	Surprising number of fixes this merge window :(	2015-01-23 06:40:36 +12:00
events	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2015-02-16 14:58:12 -08:00
gcov	gcov: enable GCOV_PROFILE_ALL from ARCH Kconfigs	2014-12-13 12:42:51 -08:00
irq	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2015-02-16 15:20:40 -08:00
livepatch	livepatch: add missing newline to error message	2015-02-06 21:28:35 +01:00
locking	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux	2015-02-11 17:42:32 -08:00
power	PM / sleep: Re-implement suspend-to-idle handling	2015-02-13 23:49:36 +01:00
printk	printk: correct timeout comment, neaten MODULE_PARM_DESC	2015-02-12 18:54:13 -08:00
rcu	rcu: use %*pb[l] to print bitmaps including cpumasks and nodemasks	2015-02-13 21:21:37 -08:00
sched	Suspend-to-idle timer quiescing support for v3.20-rc1	2015-02-17 14:17:51 -08:00
time	Suspend-to-idle timer quiescing support for v3.20-rc1	2015-02-17 14:17:51 -08:00
trace	tracing: use %*pb[l] to print bitmaps including cpumasks and nodemasks	2015-02-13 21:21:37 -08:00
.gitignore
acct.c	new fs_pin killing logics	2015-01-25 23:17:28 -05:00
async.c	kernel/async.c: switch to pr_foo()	2014-10-09 22:26:04 -04:00
audit_tree.c	fsnotify: unify inode and mount marks handling	2014-12-13 12:42:53 -08:00
audit_watch.c	audit: invalid op= values for rules	2014-09-23 16:37:53 -04:00
audit.c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-12-30 10:45:47 -08:00
audit.h	audit: replace getname()/putname() hacks with reference counters	2015-01-23 00:23:58 -05:00
auditfilter.c	Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit	2015-02-11 20:07:47 -08:00
auditsc.c	Merge branch 'getname2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2015-02-17 15:27:47 -08:00
backtracetest.c
bounds.c	page-cgroup: get rid of NR_PCG_FLAGS	2014-08-08 15:57:18 -07:00
capability.c	CAPABILITIES: remove undefined caps from all processes	2014-07-24 21:53:47 +10:00
cgroup_freezer.c	cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes	2014-07-15 11:05:09 -04:00
cgroup.c	kernfs: remove KERNFS_STATIC_NAME	2015-02-13 21:21:36 -08:00
compat.c	all arches, signal: move restart_block to struct task_struct	2015-02-12 18:54:12 -08:00
configs.c
context_tracking.c	sched: stop the unbound recursion in preempt_schedule_context()	2014-10-28 10:46:05 +01:00
cpu_pm.c
cpu.c	hotplugcpu: Avoid deadlocks by waking active_writer	2015-01-06 11:01:14 -08:00
cpuset.c	cpuset: use %*pb[l] to print bitmaps including cpumasks and nodemasks	2015-02-13 21:21:37 -08:00
crash_dump.c	crash_dump: Make is_kdump_kernel() accessible from modules	2014-08-25 15:42:19 -07:00
cred.c
delayacct.c	delayacct: Remove braindamaged type conversions	2014-07-23 10:18:06 -07:00
dma.c
elfcore.c
exec_domain.c
exit.c	oom, PM: make OOM detection in the freezer path raceless	2015-02-11 17:06:03 -08:00
extable.c	ftrace/x86/extable: Add is_ftrace_trampoline() function	2014-11-19 15:25:26 -05:00
fork.c	mm: do not use mm->nr_pmds on !MMU configurations	2015-02-12 18:54:10 -08:00
freezer.c	freezer: remove obsolete comments in __thaw_task()	2014-10-21 23:44:20 +02:00
futex_compat.c
futex.c	all arches, signal: move restart_block to struct task_struct	2015-02-12 18:54:12 -08:00
groups.c	userns: Don't allow setgroups until a gid mapping has been setablished	2014-12-09 16:58:40 -06:00
hung_task.c
irq_work.c	percpu: Convert remaining __get_cpu_var uses in 3.18-rcX	2014-10-29 11:18:18 -04:00
jump_label.c
kallsyms.c	kernel/kallsyms.c: use __seq_open_private()	2014-10-14 02:18:16 +02:00
kcmp.c	kcmp: fix standard comparison bug	2014-09-10 15:42:12 -07:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks	locking/mcs: Better differentiate between MCS variants	2015-01-14 15:07:32 +01:00
Kconfig.preempt
kexec.c	kexec: simplify conditional	2015-02-17 14:34:51 -08:00
kmod.c	usermodehelper: kill the kmod_thread_locker logic	2014-12-10 17:41:17 -08:00
kprobes.c	kprobes: makes kprobes/enabled works correctly for optimized kprobes.	2015-02-13 21:21:42 -08:00
ksysfs.c
kthread.c	kernel/kthread.c: partial revert of `81c98869fa` ("kthread: ensure locality of task_struct allocations")	2014-10-09 22:25:51 -04:00
latencytop.c
Makefile	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security	2015-02-11 20:25:11 -08:00
module_signing.c
module-internal.h
module.c	kernel/module.c: do not inline do_init_module()	2015-02-17 14:34:53 -08:00
notifier.c	rcu: Make SRCU optional by using CONFIG_SRCU	2015-01-06 11:04:29 -08:00
nsproxy.c	bury struct proc_ns in fs/proc	2014-12-04 14:34:54 -05:00
padata.c	padata: use %*pb[l] to print bitmaps including cpumasks and nodemasks	2015-02-13 21:21:38 -08:00
panic.c	livepatch: kernel: add TAINT_LIVEPATCH	2014-12-22 15:40:48 +01:00
params.c	param: fix uninitialized read with CONFIG_DEBUG_LOCK_ALLOC	2015-01-20 11:38:31 +10:30
pid_namespace.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-12-16 15:53:03 -08:00
pid.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-12-16 15:53:03 -08:00
profile.c	profile: use %*pb[l] to print bitmaps including cpumasks and nodemasks	2015-02-13 21:21:38 -08:00
ptrace.c	ptrace: remove linux/compat.h inclusion under CONFIG_COMPAT	2015-02-17 14:34:51 -08:00
range.c	kernel: avoid overflow in cmp_range	2015-01-17 10:02:23 +13:00
reboot.c	kernel: add support for kernel restart handler call chain	2014-09-26 00:00:06 -07:00
relay.c
resource.c	resources: Move struct resource_list_entry from ACPI into resource core	2015-02-05 15:09:25 +01:00
seccomp.c	seccomp: cap SECCOMP_RET_ERRNO data to MAX_ERRNO	2015-02-17 14:34:55 -08:00
signal.c	signal: use current->state helpers	2015-02-17 14:34:51 -08:00
smp.c	Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu	2014-10-15 07:48:18 +02:00
smpboot.c	smpboot: Add missing get_online_cpus() in smpboot_register_percpu_thread()	2015-01-23 11:33:51 +01:00
smpboot.h
softirq.c	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2015-02-09 15:24:03 -08:00
stacktrace.c	stacktrace: introduce snprint_stack_trace for buffer output	2014-12-13 12:42:48 -08:00
stop_machine.c
sys_ni.c	syscalls: implement execveat() system call	2014-12-13 12:42:51 -08:00
sys.c	x86, mpx: Strictly enforce empty prctl() args	2015-01-22 21:11:06 +01:00
sysctl_binary.c	kernel: add panic_on_warn	2014-12-10 17:41:10 -08:00
sysctl.c	mm, hugetlb: remove unnecessary lower bound on sysctl handlers"?	2015-02-10 14:30:34 -08:00
system_certificates.S
system_keyring.c	KEYS: validate certificate trust only with builtin keys	2014-07-17 09:35:17 -04:00
task_work.c
taskstats.c	netlink: make nlmsg_end() and genlmsg_end() void	2015-01-18 01:03:45 -05:00
test_kprobes.c	kernel/test_kprobes.c: use current logging functions	2014-08-08 15:57:18 -07:00
torture.c	torture: Address race in module cleanup	2014-09-16 13:41:06 -07:00
tracepoint.c	tracing: syscall_regfunc() should not skip kernel threads	2014-06-21 00:15:26 -04:00
tsacct.c	sched: Make task->start_time nanoseconds based	2014-07-23 10:18:05 -07:00
uid16.c	groups: Consolidate the setgroups permission checks	2014-12-05 17:19:27 -06:00
up.c
user_namespace.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2014-12-17 12:31:40 -08:00
user-return-notifier.c	scheduler: Replace __get_cpu_var with this_cpu_ptr	2014-08-26 13:45:45 -04:00
user.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2014-12-17 12:31:40 -08:00
utsname_sysctl.c
utsname.c	copy address of proc_ns_ops into ns_common	2014-12-04 14:34:47 -05:00
watchdog.c	kernel/sched/clock.c: add another clock for use with the soft lockup watchdog	2015-02-12 18:54:13 -08:00
workqueue_internal.h
workqueue.c	workqueue: use %*pb[l] to format bitmaps including cpumasks and nodemasks	2015-02-13 21:21:37 -08:00