Commit Graph

998313 Commits

Author SHA1 Message Date
Alexei Starovoitov
89d69c5d0f Merge branch 'sockmap: introduce BPF_SK_SKB_VERDICT and support UDP'
Cong Wang says:

====================

From: Cong Wang <cong.wang@bytedance.com>

We have thousands of services connected to a daemon on every host
via AF_UNIX dgram sockets, after they are moved into VM, we have to
add a proxy to forward these communications from VM to host, because
rewriting thousands of them is not practical. This proxy uses an
AF_UNIX socket connected to services and a UDP socket to connect to
the host. It is inefficient because data is copied between kernel
space and user space twice, and we can not use splice() which only
supports TCP. Therefore, we want to use sockmap to do the splicing
without going to user-space at all (after the initial setup).

Currently sockmap only fully supports TCP, UDP is partially supported
as it is only allowed to add into sockmap. This patchset, as the second
part of the original large patchset, extends sockmap with:
1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support.

On the high level, ->read_sock() is required for each protocol to support
sockmap redirection, and in order to do sock proto update, a new ops
->psock_update_sk_prot() is introduced, which is also required. And the
BPF ->recvmsg() is also needed to replace the original ->recvmsg() to
retrieve skmsg. To make life easier, we have to get rid of lock_sock()
in sk_psock_handle_skb(), otherwise we would have to implement
->sendmsg_locked() on top of ->sendmsg(), which is ugly.

Please see each patch for more details.

To see the big picture, the original patchset is available here:
https://github.com/congwang/linux/tree/sockmap
this patchset is also available:
https://github.com/congwang/linux/tree/sockmap2
---
v8: get rid of 'offset' in udp_read_sock()
    add checks for skb_verdict/stream_verdict conflict
    add two cleanup patches for sock_map_link()
    add a new test case

v7: use work_mutex to protect psock->work
    return err in udp_read_sock()
    add patch 6/13
    clean up test case

v6: get rid of sk_psock_zap_ingress()
    add rcu work patch

v5: use INDIRECT_CALL_2() for function pointers
    use ingress_lock to fix a race condition found by Jacub
    rename two helper functions

v4: get rid of lock_sock() in sk_psock_handle_skb()
    get rid of udp_sendmsg_locked()
    remove an empty line
    update cover letter

v3: export tcp/udp_update_proto()
    rename sk->sk_prot->psock_update_sk_prot()
    improve changelogs

v2: separate from the original large patchset
    rebase to the latest bpf-next
    split UDP test case
    move inet_csk_has_ulp() check to tcp_bpf.c
    clean up udp_read_sock()
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-04-01 10:56:15 -07:00
Cong Wang
8d7cb74f2c selftests/bpf: Add a test case for loading BPF_SK_SKB_VERDICT
This adds a test case to ensure BPF_SK_SKB_VERDICT and
BPF_SK_STREAM_VERDICT will never be attached at the same time.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-17-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Cong Wang
d6378af615 selftests/bpf: Add a test case for udp sockmap
Add a test case to ensure redirection between two UDP sockets work.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-16-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Cong Wang
122e6c79ef sock_map: Update sock type checks for UDP
Now UDP supports sockmap and redirection, we can safely update
the sock type checks for it accordingly.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-15-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Cong Wang
1f5be6b3b0 udp: Implement udp_bpf_recvmsg() for sockmap
We have to implement udp_bpf_recvmsg() to replace the ->recvmsg()
to retrieve skmsg from ingress_msg.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-14-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Cong Wang
2bc793e327 skmsg: Extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()
Although these two functions are only used by TCP, they are not
specific to TCP at all, both operate on skmsg and ingress_msg,
so fit in net/core/skmsg.c very well.

And we will need them for non-TCP, so rename and move them to
skmsg.c and export them to modules.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-13-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Cong Wang
d7f571188e udp: Implement ->read_sock() for sockmap
This is similar to tcp_read_sock(), except we do not need
to worry about connections, we just need to retrieve skb
from UDP receive queue.

Note, the return value of ->read_sock() is unused in
sk_psock_verdict_data_ready(), and UDP still does not
support splice() due to lack of ->splice_read(), so users
can not reach udp_read_sock() directly.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-12-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Cong Wang
8a59f9d1e3 sock: Introduce sk->sk_prot->psock_update_sk_prot()
Currently sockmap calls into each protocol to update the struct
proto and replace it. This certainly won't work when the protocol
is implemented as a module, for example, AF_UNIX.

Introduce a new ops sk->sk_prot->psock_update_sk_prot(), so each
protocol can implement its own way to replace the struct proto.
This also helps get rid of symbol dependencies on CONFIG_INET.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-11-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Cong Wang
a7ba4558e6 sock_map: Introduce BPF_SK_SKB_VERDICT
Reusing BPF_SK_SKB_STREAM_VERDICT is possible but its name is
confusing and more importantly we still want to distinguish them
from user-space. So we can just reuse the stream verdict code but
introduce a new type of eBPF program, skb_verdict. Users are not
allowed to attach stream_verdict and skb_verdict programs to the
same map.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-10-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Cong Wang
b017055255 sock_map: Kill sock_map_link_no_progs()
Now we can fold sock_map_link_no_progs() into sock_map_link()
and get rid of sock_map_link_no_progs().

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-9-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Cong Wang
2004fdbd8a sock_map: Simplify sock_map_link() a bit
sock_map_link() passes down map progs, but it is confusing
to see both map progs and psock progs. Make the map progs
more obvious by retrieving it directly with sock_map_progs()
inside sock_map_link(). Now it is aligned with
sock_map_link_no_progs() too.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-8-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Cong Wang
190179f65b skmsg: Use GFP_KERNEL in sk_psock_create_ingress_msg()
This function is only called in process context.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-7-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Cong Wang
7786dfc41a skmsg: Use rcu work for destroying psock
The RCU callback sk_psock_destroy() only queues work psock->gc,
so we can just switch to rcu work to simplify the code.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-6-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Cong Wang
799aa7f98d skmsg: Avoid lock_sock() in sk_psock_backlog()
We do not have to lock the sock to avoid losing sk_socket,
instead we can purge all the ingress queues when we close
the socket. Sending or receiving packets after orphaning
socket makes no sense.

We do purge these queues when psock refcnt reaches zero but
here we want to purge them explicitly in sock_map_close().
There are also some nasty race conditions on testing bit
SK_PSOCK_TX_ENABLED and queuing/canceling the psock work,
we can expand psock->ingress_lock a bit to protect them too.

As noticed by John, we still have to lock the psock->work,
because the same work item could be running concurrently on
different CPU's.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-5-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Cong Wang
0739cd28f2 net: Introduce skb_send_sock() for sock_map
We only have skb_send_sock_locked() which requires callers
to use lock_sock(). Introduce a variant skb_send_sock()
which locks on its own, callers do not need to lock it
any more. This will save us from adding a ->sendmsg_locked
for each protocol.

To reuse the code, pass function pointers to __skb_send_sock()
and build skb_send_sock() and skb_send_sock_locked() on top.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-4-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Cong Wang
b01fd6e802 skmsg: Introduce a spinlock to protect ingress_msg
Currently we rely on lock_sock to protect ingress_msg,
it is too big for this, we can actually just use a spinlock
to protect this list like protecting other skb queues.

__tcp_bpf_recvmsg() is still special because of peeking,
it still has to use lock_sock.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-3-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Cong Wang
37f0e514db skmsg: Lock ingress_skb when purging
Currently we purge the ingress_skb queue only when psock
refcnt goes down to 0, so locking the queue is not necessary,
but in order to be called during ->close, we have to lock it
here.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-2-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Stanislav Fomichev
e27bfefb21 tools/resolve_btfids: Fix warnings
* make eprintf static, used only in main.c
* initialize ret in eprintf
* remove unused *tmp

v3:
* remove another err (Song Liu)

v2:
* remove unused 'int err = -1'

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210329223143.3659983-1-sdf@google.com
2021-03-31 12:09:12 -07:00
KP Singh
63f8af0fc3 selftests/bpf: Add an option for a debug shell in vmtest.sh
The newly introduced -s command line option starts an interactive shell.
If a command is specified, the shell is started after the command
finishes executing. It's useful to have a shell especially when
debugging failing tests or developing new tests.

Since the user may terminate the VM forcefully, an extra "sync" is added
after the execution of the command to persist any logs from the command
into the log file.

Signed-off-by: KP Singh <kpsingh@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210323014752.3198283-1-kpsingh@kernel.org
2021-03-30 22:05:23 -07:00
Colin Ian King
235fc0e36d bpf: Remove redundant assignment of variable id
The variable id is being assigned a value that is never read, the
assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210326194348.623782-1-colin.king@canonical.com
2021-03-30 22:58:53 +02:00
He Fengqing
913d550376 bpf: Remove unused bpf_load_pointer
Remove unused bpf_load_pointer function in filter.h. The last user of it has
been removed with 24dea04767 ("bpf, x32: remove ld_abs/ld_ind").

Signed-off-by: He Fengqing <hefengqing@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210330024843.3479844-1-hefengqing@huawei.com
2021-03-30 22:37:06 +02:00
Alexei Starovoitov
2976706f34 Merge branch 'AF_XDP selftests improvements & bpf_link'
Maciej Fijalkowski says:

====================

Changes since v4 (all in patch 6):
- do not close potentially invalid bpf_link fd (Toke)
- fix misspelling in label (Toke)
- mask out XDP_FLAGS_UPDATE_IF_NOEXIST and XDP_FLAGS_REPLACE explicitly when
  creating bpf_link (Toke)

Changes since v3:
- do not unload netlink-based XDP prog when updating map elem failed and
  current socket was not the creator of XDP resources (Toke)
- pull out code paths based on prog_id value within __xsk_setup_xdp_prog
  so that teardown in case of error at any point is more clear

Changes since v2:
- fix c&p failure in veth's get_channels implementation (Magnus)
- provide a backward compatibilty if bpf_link is not supported (Andrii)
- check for a link type while looking up existing bpf_links (Andrii)

Changes since v1:
- selftests improvements and test case for bpf_link persistence itself
- do not unload netlink-based prog when --force flag is set (John)
- simplify return semantics in xsk_link_lookup (John)

v4: https://lore.kernel.org/bpf/20210326230938.49998-1-maciej.fijalkowski@intel.com/
v3: https://lore.kernel.org/bpf/20210322205816.65159-1-maciej.fijalkowski@intel.com/
v2: https://lore.kernel.org/bpf/20210311152910.56760-1-maciej.fijalkowski@intel.com/
v1: https://lore.kernel.org/bpf/20210215154638.4627-1-maciej.fijalkowski@intel.com/

--------------------------------------------------

This set is another approach towards addressing the below issue:

// load xdp prog and xskmap and add entry to xskmap at idx 10
$ sudo ./xdpsock -i ens801f0 -t -q 10

// add entry to xskmap at idx 11
$ sudo ./xdpsock -i ens801f0 -t -q 11

terminate one of the processes and another one is unable to work due to
the fact that the XDP prog was unloaded from interface.

Previous attempt was, to put it mildly, a bit broken, as there was no
synchronization between updates to additional map, as Bjorn pointed out.
See https://lore.kernel.org/netdev/20190603131907.13395-5-maciej.fijalkowski@intel.com/

In the meantime bpf_link was introduced and it seems that it can address
the issue of refcounting the XDP prog on interface.

Although the bpf_link is the meat of the set, selftests improvements are a
bigger part of it. Overall, we've been able to reduce the complexity of xsk
selftests by removing a bunch of synchronization resources and
simplifying logic and structs.

Last but not least, for multiqueue veth working with AF-XDP, ethtool's
get_channels API needs to be implemented, so it's also included in that
set.

Note also that in order to make it work, a commit from bpf tree:
veth: store queue_mapping independently of XDP prog presence
https://lore.kernel.org/bpf/20210303152903.11172-1-maciej.fijalkowski@intel.com/

is needed.

Thanks,
Maciej

Björn Töpel (3):
  selftests: xsk: remove thread attribute
  selftests: xsk: Remove mutex and condition variable
  selftests: xsk: Remove unused defines
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-03-30 09:24:40 -07:00
Björn Töpel
ae6b6a1780 selftests: xsk: Remove unused defines
Remove two unused defines.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-18-maciej.fijalkowski@intel.com
2021-03-30 09:24:40 -07:00
Björn Töpel
96539f1c5e selftests: xsk: Remove mutex and condition variable
The usage of the condition variable is broken, and overkill. Replace it
with a pthread barrier.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-17-maciej.fijalkowski@intel.com
2021-03-30 09:24:40 -07:00
Björn Töpel
7651910257 selftests: xsk: Remove thread attribute
There is really no reason to have a non-default thread stack
size. Remove that.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-16-maciej.fijalkowski@intel.com
2021-03-30 09:24:40 -07:00
Maciej Fijalkowski
27e1ca2525 selftests: xsk: Implement bpf_link test
Introduce a test that is supposed to verify the persistence of BPF
resources based on underlying bpf_link usage.

Test will:
1) create and bind two sockets on queue ids 0 and 1
2) run a traffic on queue ids 0
3) remove xsk sockets from queue 0 on both veth interfaces
4) run a traffic on queues ids 1

Running traffic successfully on qids 1 means that BPF resources were
not removed on step 3).

In order to make it work, change the command that creates veth pair to
have the 4 queue pairs by default.

Introduce the arrays of xsks and umems to ifobject struct but keep a
pointers to single entities, so rest of the logic around Rx/Tx can be
kept as-is.

For umem handling, double the size of mmapped space and split that
between the two sockets.

Rename also bidi_pass to a variable 'second_step' of a boolean type as
it's now used also for the test that is introduced here and it doesn't
have anything in common with bi-directional testing.

Drop opt_queue command line argument as it wasn't working before anyway.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-15-maciej.fijalkowski@intel.com
2021-03-30 09:24:39 -07:00
Maciej Fijalkowski
34829eec3b veth: Implement ethtool's get_channels() callback
Libbpf's xsk part calls get_channels() API to retrieve the queue count
of the underlying driver so that XSKMAP is sized accordingly.

Implement that in veth so multi queue scenarios can work properly.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-14-maciej.fijalkowski@intel.com
2021-03-30 09:24:39 -07:00
Maciej Fijalkowski
0464b1ed07 selftests: xsk: Remove sync_mutex_tx and atomic var
Although thread_common_ops() are called in both Tx and Rx threads,
testapp_validate() will not spawn Tx thread until Rx thread signals that
it has finished its initialization via condition variable.

Therefore, locking in thread_common_ops is not needed and furthermore Tx
thread does not have to spin on atomic variable.

Note that this simplification wouldn't be possible if there would still
be a common worker thread.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-13-maciej.fijalkowski@intel.com
2021-03-30 09:24:39 -07:00
Maciej Fijalkowski
9445f8c765 selftests: xsk: Refactor teardown/bidi test cases and testapp_validate
Currently, there is a testapp_sockets() that acts like a wrapper around
testapp_validate() and it is called for bidi and teardown test types.
Other test types call testapp_validate() directly.

Split testapp_sockets() onto two separate functions so a bunch of bidi
specific logic can be moved there and out of testapp_validate() itself.

Introduce function pointer to ifobject struct which will be used for
assigning the Rx/Tx function that is assigned to worker thread. Let's
also have a global ifobject Rx/Tx pointers so it's easier to swap the
vectors on a second run of a bi-directional test. Thread creation now is
easey to follow.

switching_notify variable is useless, info about vector switch can be
printed based on bidi_pass state.

Last but not least, init/destroy synchronization variables only once,
not per each test.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-12-maciej.fijalkowski@intel.com
2021-03-30 09:24:39 -07:00
Maciej Fijalkowski
99f9bcb657 selftests: xsk: Remove Tx synchronization resources
Tx thread needs to be started after the Rx side is fully initialized so
that packets are not xmitted until xsk Rx socket is ready to be used.

It can be observed that atomic variable spinning_tx is not checked from
Rx side in any way, so thread_common_ops can be modified to only address
the spinning_rx. This means that spinning_tx can be removed altogheter.

signal_tx_condition is never utilized, so simply remove it.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-11-maciej.fijalkowski@intel.com
2021-03-30 09:24:39 -07:00
Maciej Fijalkowski
9866bcd663 selftests: xsk: Split worker thread
Let's a have a separate Tx/Rx worker threads instead of a one common
thread packed with Tx/Rx specific checks.

Move mmap for umem buffer space and a switch_namespace() call to
thread_common_ops.

This also allows for a bunch of simplifactions that are the subject of
the next commits. The final result will be a code base that is much
easier to follow.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-10-maciej.fijalkowski@intel.com
2021-03-30 09:24:39 -07:00
Maciej Fijalkowski
ef92807897 selftests: xsk: Remove thread for netns switch
Currently, there is a dedicated thread for following remote ns operations:
- grabbing the ifindex of the interface moved to remote netns
- removing xdp prog from that interface

With bpf_link usage in place, this can be simply omitted, so remove
mentioned thread, as BPF resources will be managed by bpf_link itself,
so there's no further need for creating the thread that will switch to
remote netns and do the cleanup.

Keep most of the logic for switching the ns, though, but make
switch_namespace() return the fd so that it will be possible to close it
at the process termination time. Get rid of logic around making sure
that it's possible to switch ns in validate_interfaces().

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-9-maciej.fijalkowski@intel.com
2021-03-30 09:24:39 -07:00
Maciej Fijalkowski
c9d27c9e8d samples: bpf: Do not unload prog within xdpsock
With the introduction of bpf_link in xsk's libbpf part, there's no
further need for explicit unload of prog on xdpsock's termination. When
process dies, the bpf_link's refcount will be decremented and resources
will be unloaded/freed under the hood in case when there are no more
active users.

While at it, don't dump stats on error path.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-8-maciej.fijalkowski@intel.com
2021-03-30 09:24:39 -07:00
Maciej Fijalkowski
10397994d3 libbpf: xsk: Use bpf_link
Currently, if there are multiple xdpsock instances running on a single
interface and in case one of the instances is terminated, the rest of
them are left in an inoperable state due to the fact of unloaded XDP
prog from interface.

Consider the scenario below:

// load xdp prog and xskmap and add entry to xskmap at idx 10
$ sudo ./xdpsock -i ens801f0 -t -q 10

// add entry to xskmap at idx 11
$ sudo ./xdpsock -i ens801f0 -t -q 11

terminate one of the processes and another one is unable to work due to
the fact that the XDP prog was unloaded from interface.

To address that, step away from setting bpf prog in favour of bpf_link.
This means that refcounting of BPF resources will be done automatically
by bpf_link itself.

Provide backward compatibility by checking if underlying system is
bpf_link capable. Do this by looking up/creating bpf_link on loopback
device. If it failed in any way, stick with netlink-based XDP prog.
therwise, use bpf_link-based logic.

When setting up BPF resources during xsk socket creation, check whether
bpf_link for a given ifindex already exists via set of calls to
bpf_link_get_next_id -> bpf_link_get_fd_by_id -> bpf_obj_get_info_by_fd
and comparing the ifindexes from bpf_link and xsk socket.

For case where resources exist but they are not AF_XDP related, bail out
and ask user to remove existing prog and then retry.

Lastly, do a bit of refactoring within __xsk_setup_xdp_prog and pull out
existing code branches based on prog_id value onto separate functions
that are responsible for resource initialization if prog_id was 0 and
for lookup existing resources for non-zero prog_id as that implies that
XDP program is present on the underlying net device. This in turn makes
it easier to follow, especially the teardown part of both branches.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-7-maciej.fijalkowski@intel.com
2021-03-30 09:24:38 -07:00
Maciej Fijalkowski
aa2d61c154 selftests: xsk: Simplify frame traversal in dumping thread
Store offsets to each layer in a separate variables rather than compute
them every single time.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-6-maciej.fijalkowski@intel.com
2021-03-30 09:24:38 -07:00
Maciej Fijalkowski
965d2cb0f6 selftests: xsk: Remove inline keyword from source file
Follow the kernel coding style guidelines and let compiler do the
decision about inlining.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-5-maciej.fijalkowski@intel.com
2021-03-30 09:24:38 -07:00
Maciej Fijalkowski
7519c387e6 selftests: xsk: Remove unused function
Probably it was ported from xdpsock but is not used anywhere.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-4-maciej.fijalkowski@intel.com
2021-03-30 09:24:38 -07:00
Maciej Fijalkowski
e623bfdef7 selftests: xsk: Remove struct ifaceconfigobj
ifaceconfigobj is not really useful, it is possible to keep the
functionality and simplify the code.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-3-maciej.fijalkowski@intel.com
2021-03-30 09:24:38 -07:00
Maciej Fijalkowski
9f33df73a9 selftests: xsk: Don't call worker_pkt_dump() for stats test
For TEST_TYPE_STATS, worker_pkt_validate() that places frames onto
pkt_buf is not called. Therefore, when dump mode is set, don't call
worker_pkt_dump() for mentioned test type, so that it won't crash on
pkt_buf() access.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329224316.17793-2-maciej.fijalkowski@intel.com
2021-03-30 09:24:38 -07:00
Andrii Nakryiko
05d817031f libbpf: Fix memory leak when emitting final btf_ext
Free temporary allocated memory used to construct finalized .BTF.ext data.
Found by Coverity static analysis on libbpf's Github repo.

Fixes: 8fd27bf69b ("libbpf: Add BPF static linker BTF and BTF.ext support")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210327042502.969745-1-andrii@kernel.org
2021-03-30 07:38:36 -07:00
Alexei Starovoitov
b83fd195c2 Merge branch 'bpf: Update doc about calling kernel function'
Martin KaFai says:

====================

This set updates the document about the bpf program calling kernel
function.  In particular, updates are regarding to the clang
requirement in selftests and kfunc-call not an ABI.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-03-30 07:31:01 -07:00
Martin KaFai Lau
2ba4badca9 bpf: selftests: Update clang requirement in README.rst for testing kfunc call
This patch updates the README.rst to specify the clang requirement
to compile the bpf selftests that call kernel function.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210330054156.2933804-1-kafai@fb.com
2021-03-30 07:31:01 -07:00
Martin KaFai Lau
5bdca94ff3 bpf: Update bpf_design_QA.rst to clarify the kfunc call is not ABI
This patch updates bpf_design_QA.rst to clarify that the kernel
function callable by bpf program is not an ABI.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210330054150.2933542-1-kafai@fb.com
2021-03-30 07:31:01 -07:00
Martin KaFai Lau
7aae231ac9 bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACE
pahole currently only generates the btf_id for external function and
ftrace-able function.  Some functions in the bpf_tcp_ca_kfunc_ids
are static (e.g. cubictcp_init).  Thus, unless CONFIG_DYNAMIC_FTRACE
is set, btf_ids for those functions will not be generated and the
compilation fails during resolve_btfids.

This patch limits those functions to CONFIG_DYNAMIC_FTRACE.  I will
address the pahole generation in a followup and then remove the
CONFIG_DYNAMIC_FTRACE limitation.

Fixes: e78aea8b21 ("bpf: tcp: Put some tcp cong functions in allowlist for bpf-tcp-cc")
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Reported-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329221357.834438-1-kafai@fb.com
2021-03-29 18:42:43 -07:00
Martin KaFai Lau
21cfd2db9f bpf: tcp: Fix an error in the bpf_tcp_ca_kfunc_ids list
There is a typo in the bbr function, s/even/event/.
This patch fixes it.

Fixes: e78aea8b21 ("bpf: tcp: Put some tcp cong functions in allowlist for bpf-tcp-cc")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210329003213.2274210-1-kafai@fb.com
2021-03-28 18:17:08 -07:00
Atul Gopinathan
7e32a09fdc bpf: tcp: Remove comma which is causing build error
Currently, building the bpf-next source with the CONFIG_BPF_SYSCALL
enabled is causing a compilation error:

"net/ipv4/bpf_tcp_ca.c:209:28: error: expected identifier or '(' before
',' token"

Fix this by removing an unnecessary comma.

Fixes: e78aea8b21 ("bpf: tcp: Put some tcp cong functions in allowlist for bpf-tcp-cc")
Reported-by: syzbot+0b74d8ec3bf0cc4e4209@syzkaller.appspotmail.com
Signed-off-by: Atul Gopinathan <atulgopinathan@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210328120515.113895-1-atulgopinathan@gmail.com
2021-03-28 11:23:55 -07:00
Alexei Starovoitov
fddbf4b6dc Merge branch 'bpf: Support calling kernel function'
Martin KaFai says:

====================

This series adds support to allow bpf program calling kernel function.

The use case included in this set is to allow bpf-tcp-cc to directly
call some tcp-cc helper functions (e.g. "tcp_cong_avoid_ai()").  Those
functions have already been used by some kernel tcp-cc implementations.

This set will also allow the bpf-tcp-cc program to directly call the
kernel tcp-cc implementation,  For example, a bpf_dctcp may only want to
implement its own dctcp_cwnd_event() and reuse other dctcp_*() directly
from the kernel tcp_dctcp.c instead of reimplementing (or
copy-and-pasting) them.

The tcp-cc kernel functions mentioned above will be white listed
for the struct_ops bpf-tcp-cc programs to use in a later patch.
The white listed functions are not bounded to a fixed ABI contract.
Those functions have already been used by the existing kernel tcp-cc.
If any of them has changed, both in-tree and out-of-tree kernel tcp-cc
implementations have to be changed.  The same goes for the struct_ops
bpf-tcp-cc programs which have to be adjusted accordingly.

Please see individual patch for details.

v2:
- Patch 2 in v1 is removed.  No need to support extern func in kernel.
  Changed libbpf to adjust the .ksyms datasec for extern func
  in patch 11. (Andrii)
- Name change: btf_check_func_arg_match() and btf_check_subprog_arg_match()
  in patch 2. (Andrii)
- Always set unreliable on any error in patch 2 since it does not
  matter. (Andrii)
- s/kern_func/kfunc/ and s/descriptor/desc/ in this set. (Andrii)
- Remove some unnecessary changes in disasm.h and disasm.c
  in patch 3.  In particular, no need to change the function
  signature in bpf_insn_revmap_call_t.  Also, removed the changes
  in print_bpf_insn().
- Fixed an issue in check_kfunc_call() when the calling kernel function
  returns a pointer in patch 3.  Added a selftest.
- Adjusted the verifier selftests due to the changes in the verifier log
  in patch 3.
- Fixed a comparison issue in kfunc_desc_cmp_by_imm() in patch 3. (Andrii)
- Name change: is_ldimm64_insn(),
  new helper: is_call_insn() in patch 10 (Andrii)
- Move btf_func_linkage() from btf.h to libbpf.c in patch 11. (Andrii)
- Fixed the linker error when CONFIG_BPF_SYSCALL is not defined.
  Moved the check_kfunc_call from filter.c to test_run.c in patch 14.
  (kernel test robot)
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-03-26 20:41:52 -07:00
Martin KaFai Lau
7bd1590d4e bpf: selftests: Add kfunc_call test
This patch adds a few kernel function bpf_kfunc_call_test*() for the
selftest's test_run purpose.  They will be allowed for tc_cls prog.

The selftest calling the kernel function bpf_kfunc_call_test*()
is also added in this patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210325015252.1551395-1-kafai@fb.com
2021-03-26 20:41:52 -07:00
Martin KaFai Lau
78e60bbbe8 bpf: selftests: Bpf_cubic and bpf_dctcp calling kernel functions
This patch removes the bpf implementation of tcp_slow_start()
and tcp_cong_avoid_ai().  Instead, it directly uses the kernel
implementation.

It also replaces the bpf_cubic_undo_cwnd implementation by directly
calling tcp_reno_undo_cwnd().  bpf_dctcp also directly calls
tcp_reno_cong_avoid() instead.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210325015246.1551062-1-kafai@fb.com
2021-03-26 20:41:52 -07:00
Martin KaFai Lau
39cd9e0f67 bpf: selftests: Rename bictcp to bpf_cubic
As a similar chanage in the kernel, this patch gives the proper
name to the bpf cubic.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210325015240.1550074-1-kafai@fb.com
2021-03-26 20:41:51 -07:00