Introduce xdp_update_skb_shared_info routine to update frags array
metadata in skb_shared_info data structure converting to a skb from
a xdp_buff or xdp_frame.
According to the current skb_shared_info architecture in
xdp_frame/xdp_buff and to the xdp frags support, there is
no need to run skb_add_rx_frag() and reset frags array converting the buffer
to a skb since the frag array will be in the same position for xdp_buff/xdp_frame
and for the skb, we just need to update memory metadata.
Introduce XDP_FLAGS_PF_MEMALLOC flag in xdp_buff_flags in order to mark
the xdp_buff or xdp_frame as under memory-pressure if pages of the frags array
are under memory pressure. Doing so we can avoid looping over all fragments in
xdp_update_skb_shared_info routine. The driver is expected to set the
flag constructing the xdp_buffer using xdp_buff_set_frag_pfmemalloc
utility routine.
Rely on xdp_update_skb_shared_info in __xdp_build_skb_from_frame routine
converting the non-linear xdp_frame to a skb after performing a XDP_REDIRECT.
Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/bfd23fb8a8d7438724f7819c567cdf99ffd6226f.1642758637.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Relying on xdp frags bit, remove skb_shared_info structure
allocated on the stack in mvneta_rx_swbm routine and simplify
mvneta_swbm_add_rx_fragment accessing skb_shared_info in the
xdp_buff structure directly. There is no performance penalty in
this approach since mvneta_swbm_add_rx_fragment is run just
for xdp frags use-case.
Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/45f050c094ccffce49d6bc5112939ed35250ba90.1642758637.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Update frags bit (XDP_FLAGS_HAS_FRAGS) in xdp_buff to notify
XDP/eBPF layer and XDP remote drivers if this is a "non-linear"
XDP buffer. Access skb_shared_info only if XDP_FLAGS_HAS_FRAGS flag
is set in order to avoid possible cache-misses.
Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/c00a73097f8a35860d50dae4a36e6cc9ef7e172f.1642758637.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce flags field in xdp_frame and xdp_buffer data structures
to define additional buffer features. At the moment the only
supported buffer feature is frags bit (XDP_FLAGS_HAS_FRAGS).
frags bit is used to specify if this is a linear buffer
(XDP_FLAGS_HAS_FRAGS not set) or a frags frame (XDP_FLAGS_HAS_FRAGS
set). In the latter case the driver is expected to initialize the
skb_shared_info structure at the end of the first buffer to link together
subsequent buffers belonging to the same frame.
Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/e389f14f3a162c0a5bc6a2e1aa8dd01a90be117d.1642758637.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce xdp_frags_size field in skb_shared_info data structure
to store xdp_buff/xdp_frame frame paged size (xdp_frags_size will
be used in xdp frags support). In order to not increase
skb_shared_info size we will use a hole due to skb_shared_info
alignment.
Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/8a849819a3e0a143d540f78a3a5add76e17e980d.1642758637.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add test for querying progs attached to sockmap. we use an existing
libbpf query interface to query prog cnt before and after progs
attaching to sockmap and check whether the queried prog id is right.
Signed-off-by: Di Zhu <zhudi2@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20220119014005.1209-2-zhudi2@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Right now there is no way to query whether BPF programs are
attached to a sockmap or not.
we can use the standard interface in libbpf to query, such as:
bpf_prog_query(mapFd, BPF_SK_SKB_STREAM_PARSER, 0, NULL, ...);
the mapFd is the fd of sockmap.
Signed-off-by: Di Zhu <zhudi2@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20220119014005.1209-1-zhudi2@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko says:
====================
Revamp existing low-level XDP APIs provided by libbpf to follow more
consistent naming (new APIs follow bpf_tc_xxx() approach where it makes
sense) and be extensible without ABI breakages (OPTS-based). See patch #1 for
details, remaining patches switch bpftool, selftests/bpf and samples/bpf to
new APIs.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Switch to using new bpf_xdp_*() APIs across all selftests. Take
advantage of a more straightforward and user-friendly semantics of
old_prog_fd (0 means "don't care") in few places.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220120061422.2710637-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce 4 new netlink-based XDP APIs for attaching, detaching, and
querying XDP programs:
- bpf_xdp_attach;
- bpf_xdp_detach;
- bpf_xdp_query;
- bpf_xdp_query_id.
These APIs replace bpf_set_link_xdp_fd, bpf_set_link_xdp_fd_opts,
bpf_get_link_xdp_id, and bpf_get_link_xdp_info APIs ([0]). The latter
don't follow a consistent naming pattern and some of them use
non-extensible approaches (e.g., struct xdp_link_info which can't be
modified without breaking libbpf ABI).
The approach I took with these low-level XDP APIs is similar to what we
did with low-level TC APIs. There is a nice duality of bpf_tc_attach vs
bpf_xdp_attach, and so on. I left bpf_xdp_attach() to support detaching
when -1 is specified for prog_fd for generality and convenience, but
bpf_xdp_detach() is preferred due to clearer naming and associated
semantics. Both bpf_xdp_attach() and bpf_xdp_detach() accept the same
opts struct allowing to specify expected old_prog_fd.
While doing the refactoring, I noticed that old APIs require users to
specify opts with old_fd == -1 to declare "don't care about already
attached XDP prog fd" condition. Otherwise, FD 0 is assumed, which is
essentially never an intended behavior. So I made this behavior
consistent with other kernel and libbpf APIs, in which zero FD means "no
FD". This seems to be more in line with the latest thinking in BPF land
and should cause less user confusion, hopefully.
For querying, I left two APIs, both more generic bpf_xdp_query()
allowing to query multiple IDs and attach mode, but also
a specialization of it, bpf_xdp_query_id(), which returns only requested
prog_id. Uses of prog_id returning bpf_get_link_xdp_id() were so
prevalent across selftests and samples, that it seemed a very common use
case and using bpf_xdp_query() for doing it felt very cumbersome with
a highly branches if/else chain based on flags and attach mode.
Old APIs are scheduled for deprecation in libbpf 0.8 release.
[0] Closes: https://github.com/libbpf/libbpf/issues/309
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20220120061422.2710637-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Andrii Nakryiko says:
====================
Officially deprecate legacy BPF map definitions in libbpf. They've been slated
for deprecation for a while in favor of more powerful BTF-defined map
definitions and this patch set adds warnings and a way to enforce this in
libbpf through LIBBPF_STRICT_MAP_DEFINITIONS strict mode flag.
Selftests are fixed up and updated, BPF documentation is updated, bpftool's
strict mode usage is adjusted to avoid breaking users unnecessarily.
v1->v2:
- replace missed bpf_map_def case in Documentation/bpf/btf.rst (Alexei).
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Enact deprecation of legacy BPF map definition in SEC("maps") ([0]). For
the definitions themselves introduce LIBBPF_STRICT_MAP_DEFINITIONS flag
for libbpf strict mode. If it is set, error out on any struct
bpf_map_def-based map definition. If not set, libbpf will print out
a warning for each legacy BPF map to raise awareness that it goes away.
For any use of BPF_ANNOTATE_KV_PAIR() macro providing a legacy way to
associate BTF key/value type information with legacy BPF map definition,
warn through libbpf's pr_warn() error message (but don't fail BPF object
open).
BPF-side struct bpf_map_def is marked as deprecated. User-space struct
bpf_map_def has to be used internally in libbpf, so it is left
untouched. It should be enough for bpf_map__def() to be marked
deprecated to raise awareness that it goes away.
bpftool is an interesting case that utilizes libbpf to open BPF ELF
object to generate skeleton. As such, even though bpftool itself uses
full on strict libbpf mode (LIBBPF_STRICT_ALL), it has to relax it a bit
for BPF map definition handling to minimize unnecessary disruptions. So
opt-out of LIBBPF_STRICT_MAP_DEFINITIONS for bpftool. User's code that
will later use generated skeleton will make its own decision whether to
enforce LIBBPF_STRICT_MAP_DEFINITIONS or not.
There are few tests in selftests/bpf that are consciously using legacy
BPF map definitions to test libbpf functionality. For those, temporary
opt out of LIBBPF_STRICT_MAP_DEFINITIONS mode for the duration of those
tests.
[0] Closes: https://github.com/libbpf/libbpf/issues/272
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220120060529.1890907-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Converted few remaining legacy BPF map definition to BTF-defined ones.
For the remaining two bpf_map_def-based legacy definitions that we want
to keep for testing purposes until libbpf 1.0 release, guard them in
pragma to suppres deprecation warnings which will be added in libbpf in
the next commit.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220120060529.1890907-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It's very easy to miss compilation warnings without -Werror, which is
not set for selftests. libbpf and bpftool are already strict about this,
so make selftests/bpf also treat compilation warnings as errors to catch
such regressions early.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220120060529.1890907-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Some users have complained that selftests fail to build when
CONFIG_NF_CONNTRACK=m. It would be useful to allow building as long as
it is set to module or built-in, even though in case of building as
module, user would need to load it before running the selftest. Note
that this also allows building selftest when CONFIG_NF_CONNTRACK is
disabled.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220120164932.2798544-1-memxor@gmail.com
The bind_perm BPF selftest failed when port 111/tcp was already in use
during the test. To fix this, the test now runs in its own network name
space.
To use unshare, it is necessary to reorder the includes. The style of
the includes is adapted to be consistent with the other prog_tests.
v2: Replace deprecated CHECK macro with ASSERT_OK
Fixes: 8259fdeb30 ("selftests/bpf: Verify that rebinding to port < 1024 from BPF works")
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/551ee65533bb987a43f93d88eaf2368b416ccd32.1642518457.git.fmaurer@redhat.com
Lorenzo Bianconi says:
====================
Rely on ASSERT* macros and get rid of deprecated CHECK ones in xdp_bpf2bpf and
xdp_adjust_tail bpf selftests.
This is a preliminary series for XDP multi-frags support.
Changes since v1:
- run each ASSERT test separately
- drop unnecessary return statements
- drop unnecessary if condition in test_xdp_bpf2bpf()
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
YiFei Zhu says:
====================
Right now, most cgroup hooks are best used for permission checks. They
can only reject a syscall with -EPERM, so a cause of a rejection, if
the rejected by eBPF cgroup hooks, is ambiguous to userspace.
Additionally, if the syscalls are implemented in eBPF, all permission
checks and the implementation has to happen within the same filter,
as programs executed later in the series of progs are unaware of the
return values return by the previous progs.
This patch series adds two helpers, bpf_get_retval and bpf_set_retval,
that allows hooks to get/set the return value of syscall to userspace.
This also allows later progs to retrieve retval set by previous progs.
For legacy programs that rejects a syscall without setting the retval,
for backwards compatibility, if a prog rejects without itself or a
prior prog setting retval to an -err, the retval is set by the kernel
to -EPERM.
For getsockopt hooks that has ctx->retval, this variable mirrors that
that accessed by the helpers.
Additionally, the following user-visible behavior for getsockopt
hooks has changed:
- If a prior filter rejected the syscall, it will be visible
in ctx->retval.
- Attempting to change the retval arbitrarily is now allowed and
will not cause an -EFAULT.
- If kernel rejects a getsockopt syscall before running the hooks,
the error will be visible in ctx->retval. Returning 0 from the
prog will not overwrite the error to -EPERM unless there is an
explicit call of bpf_set_retval(-EPERM)
Tests have been added in this series to test the behavior of the helper
with cgroup setsockopt getsockopt hooks.
Patch 1 changes the API of macros to prepare for the next patch and
should be a no-op.
Patch 2 moves ctx->retval to a struct pointed to by current
task_struct.
Patch 3 implements the helpers.
Patch 4 tests the behaviors of the helpers.
Patch 5 updates a test after the test broke due to the visible changes.
v1 -> v2:
- errno -> retval
- split one helper to get & set helpers
- allow retval to be set arbitrarily in the general case
- made the helper retval and context retval mirror each other
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The tests would break without this patch, because at one point it calls
getsockopt(fd, SOL_TCP, TCP_ZEROCOPY_RECEIVE, &buf, &optlen)
This getsockopt receives the kernel-set -EINVAL. Prior to this patch
series, the eBPF getsockopt hook's -EPERM would override kernel's
-EINVAL, however, after this patch series, return 0's automatic
-EPERM will not; the eBPF prog has to explicitly bpf_set_retval(-EPERM)
if that is wanted.
I also removed the explicit mentions of EPERM in the comments in the
prog.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/4f20b77cb46812dbc2bdcd7e3fa87c7573bde55e.1639619851.git.zhuyifei@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The tests checks how different ways of interacting with the helpers
(getting retval, setting EUNATCH, EISCONN, and legacy reject
returning 0 without setting retval), produce different results in
both the setsockopt syscall and the retval returned by the helper.
A few more tests verify the interaction between the retval of the
helper and the retval in getsockopt context.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/43ec60d679ae3f4f6fd2460559c28b63cb93cd12.1639619851.git.zhuyifei@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The helpers continue to use int for retval because all the hooks
are int-returning rather than long-returning. The return value of
bpf_set_retval is int for future-proofing, in case in the future
there may be errors trying to set the retval.
After the previous patch, if a program rejects a syscall by
returning 0, an -EPERM will be generated no matter if the retval
is already set to -err. This patch change it being forced only if
retval is not -err. This is because we want to support, for
example, invoking bpf_set_retval(-EINVAL) and return 0, and have
the syscall return value be -EINVAL not -EPERM.
For BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY, the prior behavior is
that, if the return value is NET_XMIT_DROP, the packet is silently
dropped. We preserve this behavior for backward compatibility
reasons, so even if an errno is set, the errno does not return to
caller. However, setting a non-err to retval cannot propagate so
this is not allowed and we return a -EFAULT in that case.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/b4013fd5d16bed0b01977c1fafdeae12e1de61fb.1639619851.git.zhuyifei@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The retval value is moved to struct bpf_cg_run_ctx for ease of access
in different prog types with different context structs layouts. The
helper implementation (to be added in a later patch in the series) can
simply perform a container_of from current->bpf_ctx to retrieve
bpf_cg_run_ctx.
Unfortunately, there is no easy way to access the current task_struct
via the verifier BPF bytecode rewrite, aside from possibly calling a
helper, so a pointer to current task is added to struct bpf_sockopt_kern
so that the rewritten BPF bytecode can access struct bpf_cg_run_ctx with
an indirection.
For backward compatibility, if a getsockopt program rejects a syscall
by returning 0, an -EPERM will be generated, by having the
BPF_PROG_RUN_ARRAY_CG family macros automatically set the retval to
-EPERM. Unlike prior to this patch, this -EPERM will be visible to
ctx->retval for any other hooks down the line in the prog array.
Additionally, the restriction that getsockopt filters can only set
the retval to 0 is removed, considering that certain getsockopt
implementations may return optlen. Filters are now able to set the
value arbitrarily.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/73b0325f5c29912ccea7ea57ec1ed4d388fc1d37.1639619851.git.zhuyifei@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Right now BPF_PROG_RUN_ARRAY and related macros return 1 or 0
for whether the prog array allows or rejects whatever is being
hooked. The caller of these macros then return -EPERM or continue
processing based on thw macro's return value. Unforunately this is
inflexible, since -EPERM is the only err that can be returned.
This patch should be a no-op; it prepares for the next patch. The
returning of the -EPERM is moved to inside the macros, so the outer
functions are directly returning what the macros returned if they
are non-zero.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/788abcdca55886d1f43274c918eaa9f792a9f33b.1639619851.git.zhuyifei@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a hashmap to map the string offsets from a source btf to the
string offsets from a target btf to reduce overheads.
btf__add_btf() calls btf__add_str() to add strings from a source to a
target btf. It causes many string comparisons, and it is a major
hotspot when adding a big btf. btf__add_str() uses strcmp() to check
if a hash entry is the right one. The extra hashmap here compares
offsets of strings, that are much cheaper. It remembers the results
of btf__add_str() for later uses to reduce the cost.
We are parallelizing BTF encoding for pahole by creating separated btf
instances for worker threads. These per-thread btf instances will be
added to the btf instance of the main thread by calling btf__add_str()
to deduplicate and write out. With this patch and -j4, the running
time of pahole drops to about 6.0s from 6.6s.
The following lines are the summary of 'perf stat' w/o the change.
6.668126396 seconds time elapsed
13.451054000 seconds user
0.715520000 seconds sys
The following lines are the summary w/ the change.
5.986973919 seconds time elapsed
12.939903000 seconds user
0.724152000 seconds sys
V4 fixes a bug of error checking against the pointer returned by
hashmap__new().
[v3] https://lore.kernel.org/bpf/20220118232053.2113139-1-kuifeng@fb.com/
[v2] https://lore.kernel.org/bpf/20220114193713.461349-1-kuifeng@fb.com/
Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220119180214.255634-1-kuifeng@fb.com
Currently the syscalls rst and subsequently man page are auto-generated
using function documentation present in bpf.h. If the documentation for the
syscall is missing or doesn't follow a specific format, then that syscall
is not dumped in the auto-generated rst.
This patch checks the number of syscalls documented within the header file
with those present as part of the enum bpf_cmd and raises an Exception if
they don't match. It is not needed with the currently documented upstream
syscalls, but can help in debugging when developing new syscalls when
there might be missing or misformatted documentation.
The function helper_number_check is moved to the Printer parent
class and renamed to elem_number_check as all the most derived children
classes are using this function now.
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20220119114442.1452088-3-usama.arif@bytedance.com
This enforce a minimal formatting consistency for the documentation. The
description and returns missing for a few helpers have also been added.
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20220119114442.1452088-2-usama.arif@bytedance.com
Both description and returns section will become mandatory
for helpers and syscalls in a later commit to generate man pages.
This commit also adds in the documentation that BPF_PROG_RUN is
an alias for BPF_PROG_TEST_RUN for anyone searching for the
syscall in the generated man pages.
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220119114442.1452088-1-usama.arif@bytedance.com
`bpftool prog list` and other bpftool subcommands that show
BPF program names currently get them from bpf_prog_info.name.
That field is limited to 16 (BPF_OBJ_NAME_LEN) chars which leads
to truncated names since many progs have much longer names.
The idea of this change is to improve all bpftool commands that
output prog name so that bpftool uses info from BTF to print
program names if available.
It tries bpf_prog_info.name first and fall back to btf only if
the name is suspected to be truncated (has 15 chars length).
Right now `bpftool p show id <id>` returns capped prog name
<id>: kprobe name example_cap_cap tag 712e...
...
With this change it would return
<id>: kprobe name example_cap_capable tag 712e...
...
Note, other commands that print prog names (e.g. "bpftool
cgroup tree") are also addressed in this change.
Signed-off-by: Raman Shukhau <ramasha@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220119100255.1068997-1-ramasha@fb.com
The btf.h header included with libbpf contains inline helper functions to
check for various BTF kinds. These helpers directly reference the
BTF_KIND_* constants defined in the kernel header, and because the header
file is included in user applications, this happens in the user application
compile units.
This presents a problem if a user application is compiled on a system with
older kernel headers because the constants are not available. To avoid
this, add #defines of the constants directly in btf.h before using them.
Since the kernel header moved to an enum for BTF_KIND_*, the #defines can
shadow the enum values without any errors, so we only need #ifndef guards
for the constants that predates the conversion to enum. We group these so
there's only one guard for groups of values that were added together.
[0] Closes: https://github.com/libbpf/libbpf/issues/436
Fixes: 223f903e9c ("bpf: Rename BTF_KIND_TAG to BTF_KIND_DECL_TAG")
Fixes: 5b84bd1036 ("libbpf: Add support for BTF_KIND_TAG")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/bpf/20220118141327.34231-1-toke@redhat.com
Kuniyuki Iwashima says:
====================
Last year the commit afd20b9290 ("af_unix: Replace the big lock with
small locks.") landed on bpf-next. Now we can use a batching algorithm
for AF_UNIX bpf iter as TCP bpf iter.
Changelog:
- Add the 1st patch.
- Call unix_get_first() in .start()/.next() to always acquire a lock in
each iteration in the 2nd patch.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The commit b8a58aa6fc ("af_unix: Cut unix_validate_addr() out of
unix_mkname().") moved the bound test part into unix_validate_addr().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/r/20220113002849.4384-6-kuniyu@amazon.co.jp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch adds a test for the batching and bpf_(get|set)sockopt in bpf
unix iter.
It does the following.
1. Creates an abstract UNIX domain socket
2. Call bpf_setsockopt()
3. Call bpf_getsockopt() and save the value
4. Call setsockopt()
5. Call getsockopt() and save the value
6. Compare the saved values
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/r/20220113002849.4384-5-kuniyu@amazon.co.jp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The commit 04c7820b77 ("bpf: tcp: Bpf iter batching and lock_sock")
introduces the batching algorithm to iterate TCP sockets with more
consistency.
This patch uses the same algorithm to iterate AF_UNIX sockets.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/r/20220113002849.4384-3-kuniyu@amazon.co.jp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, unix_next_socket() is overloaded depending on the 2nd argument.
If it is NULL, unix_next_socket() returns the first socket in the hash. If
not NULL, it returns the next socket in the same hash list or the first
socket in the next non-empty hash list.
This patch refactors unix_next_socket() into two functions unix_get_first()
and unix_get_next(). unix_get_first() newly acquires a lock and returns
the first socket in the list. unix_get_next() returns the next socket in a
list or releases a lock and falls back to unix_get_first().
In the following patch, bpf iter holds entire sockets in a list and always
releases the lock before .show(). It always calls unix_get_first() to
acquire a lock in each iteration. So, this patch makes the change easier
to follow.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/r/20220113002849.4384-2-kuniyu@amazon.co.jp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Kumar Kartikeya says:
====================
This series adds unstable conntrack lookup helpers using BPF kfunc support. The
patch adding the lookup helper is based off of Maxim's recent patch to aid in
rebasing their series on top of this, all adjusted to work with module kfuncs [0].
[0]: https://lore.kernel.org/bpf/20211019144655.3483197-8-maximmi@nvidia.com
To enable returning a reference to struct nf_conn, the verifier is extended to
support reference tracking for PTR_TO_BTF_ID, and kfunc is extended with support
for working as acquire/release functions, similar to existing BPF helpers. kfunc
returning pointer (limited to PTR_TO_BTF_ID in the kernel) can also return a
PTR_TO_BTF_ID_OR_NULL now, typically needed when acquiring a resource can fail.
kfunc can also receive PTR_TO_CTX and PTR_TO_MEM (with some limitations) as
arguments now. There is also support for passing a mem, len pair as argument
to kfunc now. In such cases, passing pointer to unsized type (void) is also
permitted.
Please see individual commits for details.
Changelog:
----------
v7 -> v8:
v7: https://lore.kernel.org/bpf/20220111180428.931466-1-memxor@gmail.com
* Move enum btf_kfunc_hook to btf.c (Alexei)
* Drop verbose log for unlikely failure case in __find_kfunc_desc_btf (Alexei)
* Remove unnecessary barrier in register_btf_kfunc_id_set (Alexei)
* Switch macro in bpf_nf test to __always_inline function (Alexei)
v6 -> v7:
v6: https://lore.kernel.org/bpf/20220102162115.1506833-1-memxor@gmail.com
* Drop try_module_get_live patch, use flag in btf_module struct (Alexei)
* Add comments and expand commit message detailing why we have to concatenate
and sort vmlinux kfunc BTF ID sets (Alexei)
* Use bpf_testmod for testing btf_try_get_module race (Alexei)
* Use bpf_prog_type for both btf_kfunc_id_set_contains and
register_btf_kfunc_id_set calls (Alexei)
* In case of module set registration, directly assign set (Alexei)
* Add CONFIG_USERFAULTFD=y to selftest config
* Fix other nits
v5 -> v6:
v5: https://lore.kernel.org/bpf/20211230023705.3860970-1-memxor@gmail.com
* Fix for a bug in btf_try_get_module leading to use-after-free
* Drop *kallsyms_on_each_symbol loop, reinstate register_btf_kfunc_id_set (Alexei)
* btf_free_kfunc_set_tab now takes struct btf, and handles resetting tab to NULL
* Check return value btf_name_by_offset for param_name
* Instead of using tmp_set, use btf->kfunc_set_tab directly, and simplify cleanup
v4 -> v5:
v4: https://lore.kernel.org/bpf/20211217015031.1278167-1-memxor@gmail.com
* Move nf_conntrack helpers code to its own separate file (Toke, Pablo)
* Remove verifier callbacks, put btf_id_sets in struct btf (Alexei)
* Convert the in-kernel users away from the old API
* Change len__ prefix convention to __sz suffix (Alexei)
* Drop parent_ref_obj_id patch (Alexei)
v3 -> v4:
v3: https://lore.kernel.org/bpf/20211210130230.4128676-1-memxor@gmail.com
* Guard unstable CT helpers with CONFIG_DEBUG_INFO_BTF_MODULES
* Move addition of prog_test test kfuncs to selftest commit
* Move negative kfunc tests to test_verifier suite
* Limit struct nesting depth to 4, which should be enough for now
v2 -> v3:
v2: https://lore.kernel.org/bpf/20211209170929.3485242-1-memxor@gmail.com
* Fix build error for !CONFIG_BPF_SYSCALL (Patchwork)
RFC v1 -> v2:
v1: https://lore.kernel.org/bpf/20211030144609.263572-1-memxor@gmail.com
* Limit PTR_TO_MEM support to pointer to scalar, or struct with scalars (Alexei)
* Use btf_id_set for checking acquire, release, ret type null (Alexei)
* Introduce opts struct for CT helpers, move int err parameter to it
* Add l4proto as parameter to CT helper's opts, remove separate tcp/udp helpers
* Add support for mem, len argument pair to kfunc
* Allow void * as pointer type for mem, len argument pair
* Extend selftests to cover new additions to kfuncs
* Copy ref_obj_id to PTR_TO_BTF_ID dst_reg on btf_struct_access, test it
* Fix other misc nits, bugs, and expand commit messages
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Use the prog_test kfuncs to test the referenced PTR_TO_BTF_ID kfunc
support, and PTR_TO_CTX, PTR_TO_MEM argument passing support. Also
testing the various failure cases for invalid kfunc prototypes.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220114163953.1455836-10-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This allows us to add tests (esp. negative tests) where we only want to
ensure the program doesn't pass through the verifier, and also verify
the error. The next commit will add the tests making use of this.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220114163953.1455836-9-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This tests that we return errors as documented, and also that the kfunc
calls work from both XDP and TC hooks.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220114163953.1455836-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This change adds conntrack lookup helpers using the unstable kfunc call
interface for the XDP and TC-BPF hooks. The primary usecase is
implementing a synproxy in XDP, see Maxim's patchset [0].
Export get_net_ns_by_id as nf_conntrack_bpf.c needs to call it.
This object is only built when CONFIG_DEBUG_INFO_BTF_MODULES is enabled.
[0]: https://lore.kernel.org/bpf/20211019144655.3483197-1-maximmi@nvidia.com
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220114163953.1455836-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch adds verifier support for PTR_TO_BTF_ID return type of kfunc
to be a reference, by reusing acquire_reference_state/release_reference
support for existing in-kernel bpf helpers.
We make use of the three kfunc types:
- BTF_KFUNC_TYPE_ACQUIRE
Return true if kfunc_btf_id is an acquire kfunc. This will
acquire_reference_state for the returned PTR_TO_BTF_ID (this is the
only allow return value). Note that acquire kfunc must always return a
PTR_TO_BTF_ID{_OR_NULL}, otherwise the program is rejected.
- BTF_KFUNC_TYPE_RELEASE
Return true if kfunc_btf_id is a release kfunc. This will release the
reference to the passed in PTR_TO_BTF_ID which has a reference state
(from earlier acquire kfunc).
The btf_check_func_arg_match returns the regno (of argument register,
hence > 0) if the kfunc is a release kfunc, and a proper referenced
PTR_TO_BTF_ID is being passed to it.
This is similar to how helper call check uses bpf_call_arg_meta to
store the ref_obj_id that is later used to release the reference.
Similar to in-kernel helper, we only allow passing one referenced
PTR_TO_BTF_ID as an argument. It can also be passed in to normal
kfunc, but in case of release kfunc there must always be one
PTR_TO_BTF_ID argument that is referenced.
- BTF_KFUNC_TYPE_RET_NULL
For kfunc returning PTR_TO_BTF_ID, tells if it can be NULL, hence
force caller to mark the pointer not null (using check) before
accessing it. Note that taking into account the case fixed by commit
93c230e3f5 ("bpf: Enforce id generation for all may-be-null register type")
we assign a non-zero id for mark_ptr_or_null_reg logic. Later, if more
return types are supported by kfunc, which have a _OR_NULL variant, it
might be better to move this id generation under a common
reg_type_may_be_null check, similar to the case in the commit.
Referenced PTR_TO_BTF_ID is currently only limited to kfunc, but can be
extended in the future to other BPF helpers as well. For now, we can
rely on the btf_struct_ids_match check to ensure we get the pointer to
the expected struct type. In the future, care needs to be taken to avoid
ambiguity for reference PTR_TO_BTF_ID passed to release function, in
case multiple candidates can release same BTF ID.
e.g. there might be two release kfuncs (or kfunc and helper):
foo(struct abc *p);
bar(struct abc *p);
... such that both release a PTR_TO_BTF_ID with btf_id of struct abc. In
this case we would need to track the acquire function corresponding to
the release function to avoid type confusion, and store this information
in the register state so that an incorrect program can be rejected. This
is not a problem right now, hence it is left as an exercise for the
future patch introducing such a case in the kernel.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220114163953.1455836-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF helpers can associate two adjacent arguments together to pass memory
of certain size, using ARG_PTR_TO_MEM and ARG_CONST_SIZE arguments.
Since we don't use bpf_func_proto for kfunc, we need to leverage BTF to
implement similar support.
The ARG_CONST_SIZE processing for helpers is refactored into a common
check_mem_size_reg helper that is shared with kfunc as well. kfunc
ptr_to_mem support follows logic similar to global functions, where
verification is done as if pointer is not null, even when it may be
null.
This leads to a simple to follow rule for writing kfunc: always check
the argument pointer for NULL, except when it is PTR_TO_CTX. Also, the
PTR_TO_CTX case is also only safe when the helper expecting pointer to
program ctx is not exposed to other programs where same struct is not
ctx type. In that case, the type check will fall through to other cases
and would permit passing other types of pointers, possibly NULL at
runtime.
Currently, we require the size argument to be suffixed with "__sz" in
the parameter name. This information is then recorded in kernel BTF and
verified during function argument checking. In the future we can use BTF
tagging instead, and modify the kernel function definitions. This will
be a purely kernel-side change.
This allows us to have some form of backwards compatibility for
structures that are passed in to the kernel function with their size,
and allow variable length structures to be passed in if they are
accompanied by a size parameter.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220114163953.1455836-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Completely remove the old code for check_kfunc_call to help it work
with modules, and also the callback itself.
The previous commit adds infrastructure to register all sets and put
them in vmlinux or module BTF, and concatenates all related sets
organized by the hook and the type. Once populated, these sets remain
immutable for the lifetime of the struct btf.
Also, since we don't need the 'owner' module anywhere when doing
check_kfunc_call, drop the 'btf_modp' module parameter from
find_kfunc_desc_btf.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220114163953.1455836-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch prepares the kernel to support putting all kinds of kfunc BTF
ID sets in the struct btf itself. The various kernel subsystems will
make register_btf_kfunc_id_set call in the initcalls (for built-in code
and modules).
The 'hook' is one of the many program types, e.g. XDP and TC/SCHED_CLS,
STRUCT_OPS, and 'types' are check (allowed or not), acquire, release,
and ret_null (with PTR_TO_BTF_ID_OR_NULL return type).
A maximum of BTF_KFUNC_SET_MAX_CNT (32) kfunc BTF IDs are permitted in a
set of certain hook and type for vmlinux sets, since they are allocated
on demand, and otherwise set as NULL. Module sets can only be registered
once per hook and type, hence they are directly assigned.
A new btf_kfunc_id_set_contains function is exposed for use in verifier,
this new method is faster than the existing list searching method, and
is also automatic. It also lets other code not care whether the set is
unallocated or not.
Note that module code can only do single register_btf_kfunc_id_set call
per hook. This is why sorting is only done for in-kernel vmlinux sets,
because there might be multiple sets for the same hook and type that
must be concatenated, hence sorting them is required to ensure bsearch
in btf_id_set_contains continues to work correctly.
Next commit will update the kernel users to make use of this
infrastructure.
Finally, add __maybe_unused annotation for BTF ID macros for the
!CONFIG_DEBUG_INFO_BTF case, so that they don't produce warnings during
build time.
The previous patch is also needed to provide synchronization against
initialization for module BTF's kfunc_set_tab introduced here, as
described below:
The kfunc_set_tab pointer in struct btf is write-once (if we consider
the registration phase (comprised of multiple register_btf_kfunc_id_set
calls) as a single operation). In this sense, once it has been fully
prepared, it isn't modified, only used for lookup (from the verifier
context).
For btf_vmlinux, it is initialized fully during the do_initcalls phase,
which happens fairly early in the boot process, before any processes are
present. This also eliminates the possibility of bpf_check being called
at that point, thus relieving us of ensuring any synchronization between
the registration and lookup function (btf_kfunc_id_set_contains).
However, the case for module BTF is a bit tricky. The BTF is parsed,
prepared, and published from the MODULE_STATE_COMING notifier callback.
After this, the module initcalls are invoked, where our registration
function will be called to populate the kfunc_set_tab for module BTF.
At this point, BTF may be available to userspace while its corresponding
module is still intializing. A BTF fd can then be passed to verifier
using bpf syscall (e.g. for kfunc call insn).
Hence, there is a race window where verifier may concurrently try to
lookup the kfunc_set_tab. To prevent this race, we must ensure the
operations are serialized, or waiting for the __init functions to
complete.
In the earlier registration API, this race was alleviated as verifier
bpf_check_mod_kfunc_call didn't find the kfunc BTF ID until it was added
by the registration function (called usually at the end of module __init
function after all module resources have been initialized). If the
verifier made the check_kfunc_call before kfunc BTF ID was added to the
list, it would fail verification (saying call isn't allowed). The
access to list was protected using a mutex.
Now, it would still fail verification, but for a different reason
(returning ENXIO due to the failed btf_try_get_module call in
add_kfunc_call), because if the __init call is in progress the module
will be in the middle of MODULE_STATE_COMING -> MODULE_STATE_LIVE
transition, and the BTF_MODULE_LIVE flag for btf_module instance will
not be set, so the btf_try_get_module call will fail.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220114163953.1455836-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>