linux/kernel/trace
Daniel Borkmann 3b1efb196e bpf, maps: flush own entries on perf map release
The behavior of perf event arrays are quite different from all
others as they are tightly coupled to perf event fds, f.e. shown
recently by commit e03e7ee34f ("perf/bpf: Convert perf_event_array
to use struct file") to make refcounting on perf event more robust.
A remaining issue that the current code still has is that since
additions to the perf event array take a reference on the struct
file via perf_event_get() and are only released via fput() (that
cleans up the perf event eventually via perf_event_release_kernel())
when the element is either manually removed from the map from user
space or automatically when the last reference on the perf event
map is dropped. However, this leads us to dangling struct file's
when the map gets pinned after the application owning the perf
event descriptor exits, and since the struct file reference will
in such case only be manually dropped or via pinned file removal,
it leads to the perf event living longer than necessary, consuming
needlessly resources for that time.

Relations between perf event fds and bpf perf event map fds can be
rather complex. F.e. maps can act as demuxers among different perf
event fds that can possibly be owned by different threads and based
on the index selection from the program, events get dispatched to
one of the per-cpu fd endpoints. One perf event fd (or, rather a
per-cpu set of them) can also live in multiple perf event maps at
the same time, listening for events. Also, another requirement is
that perf event fds can get closed from application side after they
have been attached to the perf event map, so that on exit perf event
map will take care of dropping their references eventually. Likewise,
when such maps are pinned, the intended behavior is that a user
application does bpf_obj_get(), puts its fds in there and on exit
when fd is released, they are dropped from the map again, so the map
acts rather as connector endpoint. This also makes perf event maps
inherently different from program arrays as described in more detail
in commit c9da161c65 ("bpf: fix clearing on persistent program
array maps").

To tackle this, map entries are marked by the map struct file that
added the element to the map. And when the last reference to that map
struct file is released from user space, then the tracked entries
are purged from the map. This is okay, because new map struct files
instances resp. frontends to the anon inode are provided via
bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
for retrieving a pinned map, but also when an initial instance is
created via map_create(). The rest is resolved by the vfs layer
automatically for us by keeping reference count on the map's struct
file. Any concurrent updates on the map slot are fine as well, it
just means that perf_event_fd_array_release() needs to delete less
of its own entires.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 23:42:57 -07:00
..
blktrace.c blktrace: add missed mask name 2016-05-10 08:41:36 -06:00
bpf_trace.c bpf, maps: flush own entries on perf map release 2016-06-15 23:42:57 -07:00
ftrace.c Three more changes. 2016-05-22 19:40:39 -07:00
Kconfig tracing: Add 'hist' event trigger command 2016-04-19 12:16:14 -04:00
Makefile tracing: Add 'hist' event trigger command 2016-04-19 12:16:14 -04:00
power-traces.c cpufreq: schedutil: New governor based on scheduler utilization data 2016-04-02 01:09:12 +02:00
ring_buffer_benchmark.c ring_buffer: Remove unneeded smp_wmb() before wakeup of reader benchmark 2015-11-03 16:19:02 -05:00
ring_buffer.c ring-buffer: Prevent overflow of size in ring_buffer_resize() 2016-05-13 16:44:20 -04:00
rpm-traces.c
trace_benchmark.c tracing: Only benchmark the time tracepoints take if tracing is on 2015-11-02 13:34:58 -05:00
trace_benchmark.h tracing: Add tracepoint benchmark tracepoint 2014-05-29 22:49:54 -04:00
trace_branch.c tracing: Remove {start,stop}_branch_trace 2015-10-21 10:10:09 -04:00
trace_clock.c tracing: Export tracing clock functions 2015-05-12 15:56:57 -04:00
trace_entries.h tracing: %pF is only for function pointers 2015-03-25 08:57:22 -04:00
trace_event_perf.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2016-05-17 16:26:30 -07:00
trace_events_filter_test.h
trace_events_filter.c tracing: Use temp buffer when filtering events 2016-05-03 17:59:24 -04:00
trace_events_hist.c tracing: Add check for NULL event field when creating hist field 2016-04-26 09:40:29 -04:00
trace_events_trigger.c tracing: Add support for named triggers 2016-04-19 18:56:00 -04:00
trace_events.c This includes two new updates for the ftrace infrastructure. 2016-05-18 18:55:19 -07:00
trace_export.c tracing: ftrace_event_is_function() can return boolean 2015-11-02 14:28:05 -05:00
trace_functions_graph.c arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections 2016-03-25 16:37:42 -07:00
trace_functions.c tracing: Make tracer_flags use the right set_flag callback 2016-03-08 11:19:08 -05:00
trace_irqsoff.c tracing: Remove redundant reset per-CPU buff in irqsoff tracer 2016-03-18 16:39:11 -04:00
trace_kdb.c tracing: Move trace_flags from global to a trace_array field 2015-09-30 15:22:55 -04:00
trace_kprobe.c perf: split perf_trace_buf_prepare into alloc and update parts 2016-04-07 21:04:26 -04:00
trace_mmiotrace.c kernel/...: convert pr_warning to pr_warn 2016-03-22 15:36:02 -07:00
trace_nop.c tracing: Fix typoes in code comment and printk in trace_nop.c 2016-03-08 11:23:57 -05:00
trace_output.c tracing: Record and show NMI state 2016-03-22 18:04:10 -04:00
trace_output.h tracing: Turn seq_print_user_ip() into a static function 2015-09-28 10:16:12 -04:00
trace_printk.c tracing: Fix trace_printk() to print when not using bprintk() 2016-03-22 18:02:40 -04:00
trace_probe.c kernel/...: convert pr_warning to pr_warn 2016-03-22 15:36:02 -07:00
trace_probe.h kernel/trace_probe: is_good_name can be boolean 2015-09-22 13:11:30 -04:00
trace_sched_switch.c sched/core: Fix trace_sched_switch() 2015-10-06 17:08:15 +02:00
trace_sched_wakeup.c Most of the changes are clean ups and small fixes. Some of them have 2015-11-06 13:30:20 -08:00
trace_selftest_dynamic.c
trace_selftest.c Seems that Peter Zijlstra added a new check that is making old 2014-10-12 07:28:55 -04:00
trace_seq.c tracing: use %*pb[l] to print bitmaps including cpumasks and nodemasks 2015-02-13 21:21:37 -08:00
trace_stack.c tracing, kasan: Silence Kasan warning in check_stack of stack_tracer 2016-02-19 12:36:44 -05:00
trace_stat.c kernel/...: convert pr_warning to pr_warn 2016-03-22 15:36:02 -07:00
trace_stat.h
trace_syscalls.c perf: split perf_trace_buf_prepare into alloc and update parts 2016-04-07 21:04:26 -04:00
trace_uprobe.c perf: split perf_trace_buf_prepare into alloc and update parts 2016-04-07 21:04:26 -04:00
trace.c tracing: Use temp buffer when filtering events 2016-05-03 17:59:24 -04:00
trace.h tracing: Use temp buffer when filtering events 2016-05-03 17:59:24 -04:00
tracing_map.c tracing: Handle tracing_map_alloc_elts() error path correctly 2016-04-26 09:40:30 -04:00
tracing_map.h tracing: Update some tracing_map constants and comments 2016-04-19 12:16:06 -04:00