Introduce types and configs for bpf programs that can be attached to
LSM hooks. The programs can be enabled by the config option
CONFIG_BPF_LSM.
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Reviewed-by: Florent Revest <revest@google.com>
Reviewed-by: Thomas Garnier <thgarnie@google.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/bpf/20200329004356.27286-2-kpsingh@chromium.org
This adds a test to exercise the new bpf_map__set_initial_value() function.
The test simply overrides the global data section with all zeroes, and
checks that the new value makes it into the kernel map on load.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200329132253.232541-2-toke@redhat.com
For internal maps (most notably the maps backing global variables), libbpf
uses an internal mmaped area to store the data after opening the object.
This data is subsequently copied into the kernel map when the object is
loaded.
This adds a function to set a new value for that data, which can be used to
before it is loaded into the kernel. This is especially relevant for RODATA
maps, since those are frozen on load.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200329132253.232541-1-toke@redhat.com
Fix a redefinition of 'net_gen_cookie' error that was overlooked
when net ns is not configured.
Fixes: f318903c0b ("bpf: Add netns cookie and enable it for bpf cgroup hooks")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Toke Høiland-Jørgensen says:
====================
This series adds support for atomically replacing the XDP program loaded on an
interface. This is achieved by means of a new netlink attribute that can specify
the expected previous program to replace on the interface. If set, the kernel
will compare this "expected fd" attribute with the program currently loaded on
the interface, and reject the operation if it does not match.
With this primitive, userspace applications can avoid stepping on each other's
toes when simultaneously updating the loaded XDP program.
Changelog:
v4:
- Switch back to passing FD instead of ID (Andrii)
- Rename flag to XDP_FLAGS_REPLACE (for consistency with other similar uses)
v3:
- Pass existing ID instead of FD (Jakub)
- Use opts struct for new libbpf function (Andrii)
v2:
- Fix checkpatch nits and add .strict_start_type to netlink policy (Jakub)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This adds a new function to set the XDP fd while specifying the FD of the
program to replace, using the newly added IFLA_XDP_EXPECTED_FD netlink
parameter. The new function uses the opts struct mechanism to be extendable
in the future.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158515700857.92963.7052131201257841700.stgit@toke.dk
While it is currently possible for userspace to specify that an existing
XDP program should not be replaced when attaching to an interface, there is
no mechanism to safely replace a specific XDP program with another.
This patch adds a new netlink attribute, IFLA_XDP_EXPECTED_FD, which can be
set along with IFLA_XDP_FD. If set, the kernel will check that the program
currently loaded on the interface matches the expected one, and fail the
operation if it does not. This corresponds to a 'cmpxchg' memory operation.
Setting the new attribute with a negative value means that no program is
expected to be attached, which corresponds to setting the UPDATE_IF_NOEXIST
flag.
A new companion flag, XDP_FLAGS_REPLACE, is also added to explicitly
request checking of the EXPECTED_FD attribute. This is needed for userspace
to discover whether the kernel supports the new attribute.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/158515700640.92963.3551295145441017022.stgit@toke.dk
Fix build warnings when building net/bpf/test_run.o with W=1 due
to missing prototype for bpf_fentry_test{1..6}.
Instead of declaring prototypes, turn off warnings with
__diag_{push,ignore,pop} as pointed out by Alexei.
Signed-off-by: Jean-Philippe Menil <jpmenil@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200327204713.28050-1-jpmenil@gmail.com
Fix a sharp edge in xsk_umem__create and xsk_socket__create. Almost all of
the members of the ring buffer structs are initialized, but the "cached_xxx"
variables are not all initialized. The caller is required to zero them.
This is needlessly dangerous. The results if you don't do it can be very bad.
For example, they can cause xsk_prod_nb_free and xsk_cons_nb_avail to return
values greater than the size of the queue. xsk_ring_cons__peek can return an
index that does not refer to an item that has been queued.
I have confirmed that without this change, my program misbehaves unless I
memset the ring buffers to zero before calling the function. Afterwards,
my program works without (or with) the memset.
Signed-off-by: Fletcher Dunn <fletcherd@valvesoftware.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/85f12913cde94b19bfcb598344701c38@valvesoftware.com
Daniel Borkmann says:
====================
This adds various straight-forward helper improvements and additions to BPF
cgroup based connect(), sendmsg(), recvmsg() and bind-related hooks which
would allow to implement more fine-grained policies and improve current load
balancer limitations we're seeing. For details please see individual patches.
I've tested them on Kubernetes & Cilium and also added selftests for the small
verifier extension. Thanks!
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We already have the bpf_get_current_uid_gid() helper enabled, and
given we now have perf event RB output available for connect(),
sendmsg(), recvmsg() and bind-related hooks, add a trivial change
to enable bpf_get_current_pid_tgid() and bpf_get_current_comm()
as well.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/18744744ed93c06343be8b41edcfd858706f39d7.1585323121.git.daniel@iogearbox.net
Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(),
recvmsg() and bind-related hooks in order to retrieve the cgroup v2
context which can then be used as part of the key for BPF map lookups,
for example. Given these hooks operate in process context 'current' is
always valid and pointing to the app that is performing mentioned
syscalls if it's subject to a v2 cgroup. Also with same motivation of
commit 7723628101 ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
enable retrieval of ancestor from current so the cgroup id can be used
for policy lookups which can then forbid connect() / bind(), for example.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
Today, Kubernetes is still operating on cgroups v1, however, it is
possible to retrieve the task's classid based on 'current' out of
connect(), sendmsg(), recvmsg() and bind-related hooks for orchestrators
which attach to the root cgroup v2 hook in a mixed env like in case
of Cilium, for example, in order to then correlate certain pod traffic
and use it as part of the key for BPF map lookups.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/555e1c69db7376c0947007b4951c260e1074efc3.1585323121.git.daniel@iogearbox.net
In Cilium we're mainly using BPF cgroup hooks today in order to implement
kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
between Cilium managed nodes. While this works in its current shape and avoids
packet-level NAT for inter Cilium managed node traffic, there is one major
limitation we're facing today, that is, lack of netns awareness.
In Kubernetes, the concept of Pods (which hold one or multiple containers)
has been built around network namespaces, so while we can use the global scope
of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
NodePort ports on loopback addresses), we also have the need to differentiate
between initial network namespaces and non-initial one. For example, ExternalIP
services mandate that non-local service IPs are not to be translated from the
host (initial) network namespace as one example. Right now, we have an ugly
work-around in place where non-local service IPs for ExternalIP services are
not xlated from connect() and friends BPF hooks but instead via less efficient
packet-level NAT on the veth tc ingress hook for Pod traffic.
On top of determining whether we're in initial or non-initial network namespace
we also have a need for a socket-cookie like mechanism for network namespaces
scope. Socket cookies have the nice property that they can be combined as part
of the key structure e.g. for BPF LRU maps without having to worry that the
cookie could be recycled. We are planning to use this for our sessionAffinity
implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
provide the cookie for the initial network namespace while passing the context
instead of NULL would provide the cookie from the application's network namespace.
We're using a hole, so no size increase; the assignment happens only once.
Therefore this allows for a comparison on initial namespace as well as regular
cookie usage as we have today with socket cookies. We could later on enable
this helper for other program types as well as we would see need.
(*) Both externalTrafficPolicy={Local|Cluster} types
[0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
Currently, connect(), sendmsg(), recvmsg() and bind-related hooks
are all lacking perf event rb output in order to push notifications
or monitoring events up to user space. Back in commit a5a3a828cd
("bpf: add perf event notificaton support for sock_ops"), I've worked
with Sowmini to enable them for sock_ops where the context part is
not used (as opposed to skbs for example where the packet data can
be appended). Make the bpf_sockopt_event_output() helper generic and
enable it for mentioned hooks.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/69c39daf87e076b31e52473c902e9bfd37559124.1585323121.git.daniel@iogearbox.net
We currently make heavy use of the socket cookie in BPF's connect(),
sendmsg() and recvmsg() hooks for load-balancing decisions. However,
it is currently not enabled/implemented in BPF {post-}bind hooks
where it can later be used in combination for correlation in the tc
egress path, for example.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/e9d71f310715332f12d238cc650c1edc5be55119.1585323121.git.daniel@iogearbox.net
Factor out logic mapping expected program attach type to program type and
subsequent handling of program attach/detach. Also list out all supported
cgroup BPF program types explicitly to prevent accidental bugs once more
program types are added to a mapping. Do the same for prog_query API.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200325065746.640559-3-andriin@fb.com
Refactor cgroup attach/detach code to abstract away common operations
performed on all types of cgroup storages. This makes high-level logic more
apparent, plus allows to reuse more code across multiple functions.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200325065746.640559-2-andriin@fb.com
After changes to add update_reg_bounds after ALU ops and adding ALU32
bounds tracking the error message is changed in the 32-bit right shift
tests.
Test "#70/u bounds check after 32-bit right shift with 64-bit input FAIL"
now fails with,
Unexpected error message!
EXP: R0 invalid mem access
RES: func#0 @0
7: (b7) r1 = 2
8: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=invP2 R10=fp0 fp-8_w=mmmmmmmm
8: (67) r1 <<= 31
9: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=invP4294967296 R10=fp0 fp-8_w=mmmmmmmm
9: (74) w1 >>= 31
10: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=invP0 R10=fp0 fp-8_w=mmmmmmmm
10: (14) w1 -= 2
11: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=invP4294967294 R10=fp0 fp-8_w=mmmmmmmm
11: (0f) r0 += r1
math between map_value pointer and 4294967294 is not allowed
And test "#70/p bounds check after 32-bit right shift with 64-bit input
FAIL" now fails with,
Unexpected error message!
EXP: R0 invalid mem access
RES: func#0 @0
7: (b7) r1 = 2
8: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv2 R10=fp0 fp-8_w=mmmmmmmm
8: (67) r1 <<= 31
9: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv4294967296 R10=fp0 fp-8_w=mmmmmmmm
9: (74) w1 >>= 31
10: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv0 R10=fp0 fp-8_w=mmmmmmmm
10: (14) w1 -= 2
11: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv4294967294 R10=fp0 fp-8_w=mmmmmmmm
11: (0f) r0 += r1
last_idx 11 first_idx 0
regs=2 stack=0 before 10: (14) w1 -= 2
regs=2 stack=0 before 9: (74) w1 >>= 31
regs=2 stack=0 before 8: (67) r1 <<= 31
regs=2 stack=0 before 7: (b7) r1 = 2
math between map_value pointer and 4294967294 is not allowed
Before this series we did not trip the "math between map_value pointer..."
error because check_reg_sane_offset is never called in
adjust_ptr_min_max_vals(). Instead we have a register state that looks
like this at line 11*,
11: R0_w=map_value(id=0,off=0,ks=8,vs=8,
smin_value=0,smax_value=0,
umin_value=0,umax_value=0,
var_off=(0x0; 0x0))
R1_w=invP(id=0,
smin_value=0,smax_value=4294967295,
umin_value=0,umax_value=4294967295,
var_off=(0xfffffffe; 0x0))
R10=fp(id=0,off=0,
smin_value=0,smax_value=0,
umin_value=0,umax_value=0,
var_off=(0x0; 0x0)) fp-8_w=mmmmmmmm
11: (0f) r0 += r1
In R1 'smin_val != smax_val' yet we have a tnum_const as seen
by 'var_off(0xfffffffe; 0x0))' with a 0x0 mask. So we hit this check
in adjust_ptr_min_max_vals()
if ((known && (smin_val != smax_val || umin_val != umax_val)) ||
smin_val > smax_val || umin_val > umax_val) {
/* Taint dst register if offset had invalid bounds derived from
* e.g. dead branches.
*/
__mark_reg_unknown(env, dst_reg);
return 0;
}
So we don't throw an error here and instead only throw an error
later in the verification when the memory access is made.
The root cause in verifier without alu32 bounds tracking is having
'umin_value = 0' and 'umax_value = U64_MAX' from BPF_SUB which we set
when 'umin_value < umax_val' here,
if (dst_reg->umin_value < umax_val) {
/* Overflow possible, we know nothing */
dst_reg->umin_value = 0;
dst_reg->umax_value = U64_MAX;
} else { ...}
Later in adjust_calar_min_max_vals we previously did a
coerce_reg_to_size() which will clamp the U64_MAX to U32_MAX by
truncating to 32bits. But either way without a call to update_reg_bounds
the less precise bounds tracking will fall out of the alu op
verification.
After latest changes we now exit adjust_scalar_min_max_vals with the
more precise umin value, due to zero extension propogating bounds from
alu32 bounds into alu64 bounds and then calling update_reg_bounds.
This then causes the verifier to trigger an earlier error and we get
the error in the output above.
This patch updates tests to reflect new error message.
* I have a local patch to print entire verifier state regardless if we
believe it is a constant so we can get a full picture of the state.
Usually if tnum_is_const() then bounds are also smin=smax, etc. but
this is not always true and is a bit subtle. Being able to see these
states helps understand dataflow imo. Let me know if we want something
similar upstream.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158507161475.15666.3061518385241144063.stgit@john-Precision-5820-Tower
Currently, for all op verification we call __red_deduce_bounds() and
__red_bound_offset() but we only call __update_reg_bounds() in bitwise
ops. However, we could benefit from calling __update_reg_bounds() in
BPF_ADD, BPF_SUB, and BPF_MUL cases as well.
For example, a register with state 'R1_w=invP0' when we subtract from
it,
w1 -= 2
Before coerce we will now have an smin_value=S64_MIN, smax_value=U64_MAX
and unsigned bounds umin_value=0, umax_value=U64_MAX. These will then
be clamped to S32_MIN, U32_MAX values by coerce in the case of alu32 op
as done in above example. However tnum will be a constant because the
ALU op is done on a constant.
Without update_reg_bounds() we have a scenario where tnum is a const
but our unsigned bounds do not reflect this. By calling update_reg_bounds
after coerce to 32bit we further refine the umin_value to U64_MAX in the
alu64 case or U32_MAX in the alu32 case above.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158507151689.15666.566796274289413203.stgit@john-Precision-5820-Tower
Pull per op ALU logic into individual functions. We are about to add
u32 versions of each of these by pull them out the code gets a bit
more readable here and nicer in the next patch.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158507149518.15666.15672349629329072411.stgit@john-Precision-5820-Tower
For each prog/btf load we allocate and free 16 megs of verifier buffer.
On production systems it doesn't really make sense because the
programs/btf have gone through extensive testing and (mostly) guaranteed
to successfully load.
Let's assume successful case by default and skip buffer allocation
on the first try. If there is an error, start with BPF_LOG_BUF_SIZE
and double it on each ENOSPC iteration.
v3:
* Return -ENOMEM when can't allocate log buffer (Andrii Nakryiko)
v2:
* Don't allocate the buffer at all on the first try (Andrii Nakryiko)
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200325195521.112210-1-sdf@google.com
Has been unused since commit ef99b02b23 ("libbpf: capture value in BTF
type info for BTF-defined map defs").
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200325113655.19341-1-tklauser@distanz.ch
It's a follow-up for discussion in [1].
drgn tool bpf_inspect.py was merged to drgn repo in [2]. Document it
in kernel tree to make BPF developers aware that the tool exists and
can help with getting BPF state unavailable via UAPI.
For now it's just one tool but the doc is written in a way that allows
to cover more tools in the future if needed.
Please refer to the doc itself for more details.
The patch was tested by `make htmldocs` and sanity-checking that
resulting html looks good.
v2 -> v3:
- two sections: "Description" and "Getting started" (Daniel);
- add examples in "Getting started" section (Daniel);
- add "Customization" section to show how tool can be customized.
v1 -> v2:
- better "BPF drgn tools" section (Alexei)
[1] https://lore.kernel.org/bpf/20200228201514.GB51456@rdna-mbp/T/#mefed65e8a98116bd5d07d09a570a3eac46724951
[2] https://github.com/osandov/drgn/pull/49
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200324185135.1431038-1-rdna@fb.com
The bpf_program__attach of libbpf(using bpf_link) is much more intuitive
than the previous method using ioctl.
bpf_program__attach_perf_event manages the enable of perf_event and
attach of BPF programs to it, so there's no neeed to do this
directly with ioctl.
In addition, bpf_link provides consistency in the use of API because it
allows disable (detach, destroy) for multiple events to be treated as
one bpf_link__destroy. Also, bpf_link__destroy manages the close() of
perf_event fd.
This commit refactors samples that attach the bpf program to perf_event
by using libbbpf instead of ioctl. Also the bpf_load in the samples were
removed and migrated to use libbbpf API.
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200321100424.1593964-3-danieltimlee@gmail.com
To reduce the reliance of trace samples (trace*_user) on bpf_load,
move read_trace_pipe to trace_helpers. By moving this bpf_loader helper
elsewhere, trace functions can be easily migrated to libbbpf.
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200321100424.1593964-2-danieltimlee@gmail.com
This patch adds test to exercise the bpf_sk_storage_get()
and bpf_sk_storage_delete() helper from the bpf_dctcp.c.
The setup and check on the sk_storage is done immediately
before and after the connect().
This patch also takes this chance to move the pthread_create()
after the connect() has been done. That will remove the need of
the "wait_thread" label.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200320152107.2169904-1-kafai@fb.com
This patch adds bpf_sk_storage_get() and bpf_sk_storage_delete()
helper to the bpf_tcp_ca's struct_ops. That would allow
bpf-tcp-cc to:
1) share sk private data with other bpf progs.
2) use bpf_sk_storage as a private storage for a bpf-tcp-cc
if the existing icsk_ca_priv is not big enough.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200320152101.2169498-1-kafai@fb.com
Clang's -Wmisleading-indentation warns about misleading indentations if
there's a mixture of spaces and tabs. Remove extraneous spaces.
Signed-off-by: Bill Wendling <morbo@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200320201510.217169-1-morbo@google.com
After commit f747632b60 ("bpf: sockmap: Move generic sockmap
hooks from BPF TCP"), tcp_bpf_recvmsg() is not used out of
tcp_bpf.c, so make it static and remove it from tcp.h. Also move
it to BPF_STREAM_PARSER #ifdef to fix unused function warnings.
Fixes: f747632b60 ("bpf: sockmap: Move generic sockmap hooks from BPF TCP")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200320023426.60684-3-yuehaibing@huawei.com
If BPF_STREAM_PARSER is not set, gcc warns:
net/ipv4/tcp_bpf.c:483:12: warning: 'tcp_bpf_sendpage' defined but not used [-Wunused-function]
net/ipv4/tcp_bpf.c:395:12: warning: 'tcp_bpf_sendmsg' defined but not used [-Wunused-function]
net/ipv4/tcp_bpf.c:13:13: warning: 'tcp_bpf_stream_read' defined but not used [-Wunused-function]
Moves the unused functions into the #ifdef CONFIG_BPF_STREAM_PARSER.
Fixes: f747632b60 ("bpf: sockmap: Move generic sockmap hooks from BPF TCP")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200320023426.60684-2-yuehaibing@huawei.com
This patch adds struct_ops support to the bpftool.
To recap a bit on the recent bpf_struct_ops feature on the kernel side:
It currently supports "struct tcp_congestion_ops" to be implemented
in bpf. At a high level, bpf_struct_ops is struct_ops map populated
with a number of bpf progs. bpf_struct_ops currently supports the
"struct tcp_congestion_ops". However, the bpf_struct_ops design is
generic enough that other kernel struct ops can be supported in
the future.
Although struct_ops is map+progs at a high lever, there are differences
in details. For example,
1) After registering a struct_ops, the struct_ops is held by the kernel
subsystem (e.g. tcp-cc). Thus, there is no need to pin a
struct_ops map or its progs in order to keep them around.
2) To iterate all struct_ops in a system, it iterates all maps
in type BPF_MAP_TYPE_STRUCT_OPS. BPF_MAP_TYPE_STRUCT_OPS is
the current usual filter. In the future, it may need to
filter by other struct_ops specific properties. e.g. filter by
tcp_congestion_ops or other kernel subsystem ops in the future.
3) struct_ops requires the running kernel having BTF info. That allows
more flexibility in handling other kernel structs. e.g. it can
always dump the latest bpf_map_info.
4) Also, "struct_ops" command is not intended to repeat all features
already provided by "map" or "prog". For example, if there really
is a need to pin the struct_ops map, the user can use the "map" cmd
to do that.
While the first attempt was to reuse parts from map/prog.c, it ended up
not a lot to share. The only obvious item is the map_parse_fds() but
that still requires modifications to accommodate struct_ops map specific
filtering (for the immediate and the future needs). Together with the
earlier mentioned differences, it is better to part away from map/prog.c.
The initial set of subcmds are, register, unregister, show, and dump.
For register, it registers all struct_ops maps that can be found in an
obj file. Option can be added in the future to specify a particular
struct_ops map. Also, the common bpf_tcp_cc is stateless (e.g.
bpf_cubic.c and bpf_dctcp.c). The "reuse map" feature is not
implemented in this patch and it can be considered later also.
For other subcmds, please see the man doc for details.
A sample output of dump:
[root@arch-fb-vm1 bpf]# bpftool struct_ops dump name cubic
[{
"bpf_map_info": {
"type": 26,
"id": 64,
"key_size": 4,
"value_size": 256,
"max_entries": 1,
"map_flags": 0,
"name": "cubic",
"ifindex": 0,
"btf_vmlinux_value_type_id": 18452,
"netns_dev": 0,
"netns_ino": 0,
"btf_id": 52,
"btf_key_type_id": 0,
"btf_value_type_id": 0
}
},{
"bpf_struct_ops_tcp_congestion_ops": {
"refcnt": {
"refs": {
"counter": 1
}
},
"state": "BPF_STRUCT_OPS_STATE_INUSE",
"data": {
"list": {
"next": 0,
"prev": 0
},
"key": 0,
"flags": 0,
"init": "void (struct sock *) bictcp_init/prog_id:138",
"release": "void (struct sock *) 0",
"ssthresh": "u32 (struct sock *) bictcp_recalc_ssthresh/prog_id:141",
"cong_avoid": "void (struct sock *, u32, u32) bictcp_cong_avoid/prog_id:140",
"set_state": "void (struct sock *, u8) bictcp_state/prog_id:142",
"cwnd_event": "void (struct sock *, enum tcp_ca_event) bictcp_cwnd_event/prog_id:139",
"in_ack_event": "void (struct sock *, u32) 0",
"undo_cwnd": "u32 (struct sock *) tcp_reno_undo_cwnd/prog_id:144",
"pkts_acked": "void (struct sock *, const struct ack_sample *) bictcp_acked/prog_id:143",
"min_tso_segs": "u32 (struct sock *) 0",
"sndbuf_expand": "u32 (struct sock *) 0",
"cong_control": "void (struct sock *, const struct rate_sample *) 0",
"get_info": "size_t (struct sock *, u32, int *, union tcp_cc_info *) 0",
"name": "bpf_cubic",
"owner": 0
}
}
}
]
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200318171656.129650-1-kafai@fb.com
The kernel struct_ops obj has kernel's func ptrs implemented by bpf_progs.
The bpf prog_id is stored as the value of the func ptr for introspection
purpose. In the latter patch, a struct_ops dump subcmd will be added
to introspect these func ptrs. It is desired to print the actual bpf
prog_name instead of only printing the prog_id.
Since struct_ops is the only usecase storing prog_id in the func ptr,
this patch adds a prog_id_as_func_ptr bool (default is false) to
"struct btf_dumper" in order not to mis-interpret the ptr value
for the other existing use-cases.
While printing a func_ptr as a bpf prog_name,
this patch also prefix the bpf prog_name with the ptr's func_proto.
[ Note that it is the ptr's func_proto instead of the bpf prog's
func_proto ]
It reuses the current btf_dump_func() to obtain the ptr's func_proto
string.
Here is an example from the bpf_cubic.c:
"void (struct sock *, u32, u32) bictcp_cong_avoid/prog_id:140"
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200318171650.129252-1-kafai@fb.com
A char[] is currently printed as an integer array.
This patch will print it as a string when
1) The array element type is an one byte int
2) The array element type has a BTF_INT_CHAR encoding or
the array element type's name is "char"
3) All characters is between (0x1f, 0x7f) and it is terminated
by a null character.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200318171643.129021-1-kafai@fb.com
This patch prints the enum's name if there is one found in
the array of btf_enum.
The commit 9eea984979 ("bpf: fix BTF verification of enums")
has details about an enum could have any power-of-2 size (up to 8 bytes).
This patch also takes this chance to accommodate these non 4 byte
enums.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200318171637.128862-1-kafai@fb.com
Simplify gen_btf logic to make it work with llvm-objcopy. The existing
'file format' and 'architecture' parsing logic is brittle and does not
work with llvm-objcopy/llvm-objdump.
'file format' output of llvm-objdump>=11 will match GNU objdump, but
'architecture' (bfdarch) may not.
.BTF in .tmp_vmlinux.btf is non-SHF_ALLOC. Add the SHF_ALLOC flag
because it is part of vmlinux image used for introspection. C code
can reference the section via linker script defined __start_BTF and
__stop_BTF. This fixes a small problem that previous .BTF had the
SHF_WRITE flag (objcopy -I binary -O elf* synthesized .data).
Additionally, `objcopy -I binary` synthesized symbols
_binary__btf_vmlinux_bin_start and _binary__btf_vmlinux_bin_stop (not
used elsewhere) are replaced with more commonplace __start_BTF and
__stop_BTF.
Add 2>/dev/null because GNU objcopy (but not llvm-objcopy) warns
"empty loadable segment detected at vaddr=0xffffffff81000000, is this intentional?"
We use a dd command to change the e_type field in the ELF header from
ET_EXEC to ET_REL so that lld will accept .btf.vmlinux.bin.o. Accepting
ET_EXEC as an input file is an extremely rare GNU ld feature that lld
does not intend to support, because this is error-prone.
The output section description .BTF in include/asm-generic/vmlinux.lds.h
avoids potential subtle orphan section placement issues and suppresses
--orphan-handling=warn warnings.
Fixes: df786c9b94 ("bpf: Force .BTF section start to zero when dumping from vmlinux")
Fixes: cb0cc635c7 ("powerpc: Include .BTF section")
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Stanislav Fomichev <sdf@google.com>
Tested-by: Andrii Nakryiko <andriin@fb.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Link: https://github.com/ClangBuiltLinux/linux/issues/871
Link: https://lore.kernel.org/bpf/20200318222746.173648-1-maskray@google.com
Use PT_REGS_RC instead of PT_REGS_RET to get ret correctly.
Fixes: df8ff35311 ("libbpf: Merge selftests' bpf_trace_helpers.h into libbpf's bpf_tracing.h")
Signed-off-by: Wenbo Zhang <ethercflow@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200315083252.22274-1-ethercflow@gmail.com
Some tests and sub-tests are setting "custom" thread/process affinity and
don't reset it back. Instead of requiring each test to undo all this, ensure
that thread affinity is restored by test_progs test runner itself.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200314013932.4035712-3-andriin@fb.com
When specifying disjoint set of tests, test_progs doesn't set skipped test's
array elements to false. This leads to spurious execution of tests that should
have been skipped. Fix it by explicitly initializing them to false.
Fixes: 3a516a0a3a ("selftests/bpf: add sub-tests support for test_progs")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200314013932.4035712-2-andriin@fb.com
Previous attempt to make tcp_rtt more robust introduced a new race, in which
server_done might be set to true before server can actually accept any
connection. Fix this by unconditionally waiting for accept(). Given socket is
non-blocking, if there are any problems with client side, it should eventually
close listening FD and let server thread exit with failure.
Fixes: 4cd729fa02 ("selftests/bpf: Make tcp_rtt test more robust to failures")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200314013932.4035712-1-andriin@fb.com
Amazingly, some libc implementations don't call __NR_nanosleep syscall from
their nanosleep() APIs. Hammer it down with explicit syscall() call and never
get back to it again. Also simplify code for timespec initialization.
I verified that nanosleep is called w/ printk and in exactly same Linux image
that is used in Travis CI. So it should both sleep and call correct syscall.
v1->v2:
- math is too hard, fix usec -> nsec convertion (Martin);
- test_vmlinux has explicit nanosleep() call, convert that one as well.
Fixes: 4e1fd25d19 ("selftests/bpf: Fix usleep() implementation")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200314002743.3782677-1-andriin@fb.com
Remove unused len variable, which causes compilation warnings.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200314001834.3727680-1-andriin@fb.com
XDP-redirect is broken in this driver sfc. XDP_REDIRECT requires
tailroom for skb_shared_info when creating an SKB based on the
redirected xdp_frame (both in cpumap and veth).
The fix requires some initial explaining. The driver uses RX page-split
when possible. It reserves the top 64 bytes in the RX-page for storing
dma_addr (struct efx_rx_page_state). It also have the XDP recommended
headroom of XDP_PACKET_HEADROOM (256 bytes). As it doesn't reserve any
tailroom, it can still fit two standard MTU (1500) frames into one page.
The sizeof struct skb_shared_info in 320 bytes. Thus drivers like ixgbe
and i40e, reduce their XDP headroom to 192 bytes, which allows them to
fit two frames with max 1536 bytes into a 4K page (192+1536+320=2048).
The fix is to reduce this drivers headroom to 128 bytes and add the 320
bytes tailroom. This account for reserved top 64 bytes in the page, and
still fit two frame in a page for normal MTUs.
We must never go below 128 bytes of headroom for XDP, as one cacheline
is for xdp_frame area and next cacheline is reserved for metadata area.
Fixes: eb9a36be7f ("sfc: perform XDP processing on received packets")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rearrange the config files for remoteproc and IPA to fix their
interdependencies.
First, have CONFIG_QCOM_Q6V5_MSS select QCOM_Q6V5_IPA_NOTIFY so the
notification code is built regardless of whether IPA needs it.
Next, represent QCOM_IPA as being dependent on QCOM_Q6V5_MSS rather
than setting its value to match QCOM_Q6V5_COMMON (which is selected
by QCOM_Q6V5_MSS).
Drop all dependencies from QCOM_Q6V5_IPA_NOTIFY. The notification
code will be built whenever QCOM_Q6V5_MSS is set, and it has no other
dependencies.
Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This path fixes the suspicious RCU usage warning reported by
kernel test robot.
net/kcm/kcmproc.c:#RCU-list_traversed_in_non-reader_section
There is no need to use list_for_each_entry_rcu() in
kcm_stats_seq_show() as the list is always traversed under
knet->mutex held.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove set but not used variables 'sw_comp_cons' and 'hw_comp_cons'
to fix gcc '-Wunused-but-set-variable' warning:
drivers/net/ethernet/qlogic/qede/qede_ethtool.c: In function qede_selftest_receive_traffic:
drivers/net/ethernet/qlogic/qede/qede_ethtool.c:1569:20:
warning: variable sw_comp_cons set but not used [-Wunused-but-set-variable]
drivers/net/ethernet/qlogic/qede/qede_ethtool.c: In function qede_selftest_receive_traffic:
drivers/net/ethernet/qlogic/qede/qede_ethtool.c:1569:6:
warning: variable hw_comp_cons set but not used [-Wunused-but-set-variable]
After removing 'hw_comp_cons',the memory barrier 'rmb()' and its comments become useless,
so remove them as well.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>