bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
|
|
|
/* Copyright (c) 2017 - 2018 Covalent IO, Inc. http://covalent.io */
|
|
|
|
|
|
|
|
#include <linux/skmsg.h>
|
|
|
|
#include <linux/filter.h>
|
|
|
|
#include <linux/bpf.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/wait.h>
|
|
|
|
|
|
|
|
#include <net/inet_common.h>
|
bpf: sk_msg, sock{map|hash} redirect through ULP
A sockmap program that redirects through a kTLS ULP enabled socket
will not work correctly because the ULP layer is skipped. This
fixes the behavior to call through the ULP layer on redirect to
ensure any operations required on the data stream at the ULP layer
continue to be applied.
To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
calling the BPF layer on a redirected message. This is
required to avoid calling the BPF layer multiple times (possibly
recursively) which is not the current/expected behavior without
ULPs. In the future we may add a redirect flag if users _do_
want the policy applied again but this would need to work for both
ULP and non-ULP sockets and be opt-in to avoid breaking existing
programs.
Also to avoid polluting the flag space with an internal flag we
reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
to verify is user space API is masked correctly to ensure the flag
can not be set by user. (Note this needs to be true regardless
because we have internal flags already in-use that user space
should not be able to set). But for completeness we have two UAPI
paths into sendpage, sendfile and splice.
In the sendfile case the function do_sendfile() zero's flags,
./fs/read_write.c:
static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
size_t count, loff_t max)
{
...
fl = 0;
#if 0
/*
* We need to debate whether we can enable this or not. The
* man page documents EAGAIN return for the output at least,
* and the application is arguably buggy if it doesn't expect
* EAGAIN on a non-blocking file descriptor.
*/
if (in.file->f_flags & O_NONBLOCK)
fl = SPLICE_F_NONBLOCK;
#endif
file_start_write(out.file);
retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
}
In the splice case the pipe_to_sendpage "actor" is used which
masks flags with SPLICE_F_MORE.
./fs/splice.c:
static int pipe_to_sendpage(struct pipe_inode_info *pipe,
struct pipe_buffer *buf, struct splice_desc *sd)
{
...
more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
...
}
Confirming what we expect that internal flags are in fact internal
to socket side.
Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-20 19:35:35 +00:00
|
|
|
#include <net/tls.h>
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
|
|
|
|
int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
|
2018-10-16 18:08:04 +00:00
|
|
|
struct msghdr *msg, int len, int flags)
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
{
|
|
|
|
struct iov_iter *iter = &msg->msg_iter;
|
2018-10-16 18:08:04 +00:00
|
|
|
int peek = flags & MSG_PEEK;
|
|
|
|
struct sk_msg *msg_rx;
|
bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made
If copy_page_to_iter() fails or even partially completes, but with fewer
bytes copied than expected we currently reset sg.start and return EFAULT.
This proves problematic if we already copied data into the user buffer
before we return an error. Because we leave the copied data in the user
buffer and fail to unwind the scatterlist so kernel side believes data
has been copied and user side believes data has _not_ been received.
Expected behavior should be to return number of bytes copied and then
on the next read we need to return the error assuming its still there. This
can happen if we have a copy length spanning multiple scatterlist elements
and one or more complete before the error is hit.
The error is rare enough though that my normal testing with server side
programs, such as nginx, httpd, envoy, etc., I have never seen this. The
only reliable way to reproduce that I've found is to stream movies over
my browser for a day or so and wait for it to hang. Not very scientific,
but with a few extra WARN_ON()s in the code the bug was obvious.
When we review the errors from copy_page_to_iter() it seems we are hitting
a page fault from copy_page_to_iter_iovec() where the code checks
fault_in_pages_writeable(buf, copy) where buf is the user buffer. It
also seems typical server applications don't hit this case.
The other way to try and reproduce this is run the sockmap selftest tool
test_sockmap with data verification enabled, but it doesn't reproduce the
fault. Perhaps we can trigger this case artificially somehow from the
test tools. I haven't sorted out a way to do that yet though.
Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556566659.73229.15694973114605301063.stgit@john-XPS-13-9370
2020-11-16 22:27:46 +00:00
|
|
|
int i, copied = 0;
|
2018-10-16 18:08:04 +00:00
|
|
|
|
|
|
|
msg_rx = list_first_entry_or_null(&psock->ingress_msg,
|
|
|
|
struct sk_msg, list);
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
|
|
|
|
while (copied != len) {
|
|
|
|
struct scatterlist *sge;
|
|
|
|
|
|
|
|
if (unlikely(!msg_rx))
|
|
|
|
break;
|
|
|
|
|
|
|
|
i = msg_rx->sg.start;
|
|
|
|
do {
|
|
|
|
struct page *page;
|
|
|
|
int copy;
|
|
|
|
|
|
|
|
sge = sk_msg_elem(msg_rx, i);
|
|
|
|
copy = sge->length;
|
|
|
|
page = sg_page(sge);
|
|
|
|
if (copied + copy > len)
|
|
|
|
copy = len - copied;
|
bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made
If copy_page_to_iter() fails or even partially completes, but with fewer
bytes copied than expected we currently reset sg.start and return EFAULT.
This proves problematic if we already copied data into the user buffer
before we return an error. Because we leave the copied data in the user
buffer and fail to unwind the scatterlist so kernel side believes data
has been copied and user side believes data has _not_ been received.
Expected behavior should be to return number of bytes copied and then
on the next read we need to return the error assuming its still there. This
can happen if we have a copy length spanning multiple scatterlist elements
and one or more complete before the error is hit.
The error is rare enough though that my normal testing with server side
programs, such as nginx, httpd, envoy, etc., I have never seen this. The
only reliable way to reproduce that I've found is to stream movies over
my browser for a day or so and wait for it to hang. Not very scientific,
but with a few extra WARN_ON()s in the code the bug was obvious.
When we review the errors from copy_page_to_iter() it seems we are hitting
a page fault from copy_page_to_iter_iovec() where the code checks
fault_in_pages_writeable(buf, copy) where buf is the user buffer. It
also seems typical server applications don't hit this case.
The other way to try and reproduce this is run the sockmap selftest tool
test_sockmap with data verification enabled, but it doesn't reproduce the
fault. Perhaps we can trigger this case artificially somehow from the
test tools. I haven't sorted out a way to do that yet though.
Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556566659.73229.15694973114605301063.stgit@john-XPS-13-9370
2020-11-16 22:27:46 +00:00
|
|
|
copy = copy_page_to_iter(page, sge->offset, copy, iter);
|
|
|
|
if (!copy)
|
|
|
|
return copied ? copied : -EFAULT;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
|
|
|
|
copied += copy;
|
2018-10-16 18:08:04 +00:00
|
|
|
if (likely(!peek)) {
|
|
|
|
sge->offset += copy;
|
|
|
|
sge->length -= copy;
|
2020-11-16 22:28:06 +00:00
|
|
|
if (!msg_rx->skb)
|
|
|
|
sk_mem_uncharge(sk, copy);
|
2018-10-16 18:08:04 +00:00
|
|
|
msg_rx->sg.size -= copy;
|
|
|
|
|
|
|
|
if (!sge->length) {
|
|
|
|
sk_msg_iter_var_next(i);
|
|
|
|
if (!msg_rx->skb)
|
|
|
|
put_page(page);
|
|
|
|
}
|
|
|
|
} else {
|
bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made
If copy_page_to_iter() fails or even partially completes, but with fewer
bytes copied than expected we currently reset sg.start and return EFAULT.
This proves problematic if we already copied data into the user buffer
before we return an error. Because we leave the copied data in the user
buffer and fail to unwind the scatterlist so kernel side believes data
has been copied and user side believes data has _not_ been received.
Expected behavior should be to return number of bytes copied and then
on the next read we need to return the error assuming its still there. This
can happen if we have a copy length spanning multiple scatterlist elements
and one or more complete before the error is hit.
The error is rare enough though that my normal testing with server side
programs, such as nginx, httpd, envoy, etc., I have never seen this. The
only reliable way to reproduce that I've found is to stream movies over
my browser for a day or so and wait for it to hang. Not very scientific,
but with a few extra WARN_ON()s in the code the bug was obvious.
When we review the errors from copy_page_to_iter() it seems we are hitting
a page fault from copy_page_to_iter_iovec() where the code checks
fault_in_pages_writeable(buf, copy) where buf is the user buffer. It
also seems typical server applications don't hit this case.
The other way to try and reproduce this is run the sockmap selftest tool
test_sockmap with data verification enabled, but it doesn't reproduce the
fault. Perhaps we can trigger this case artificially somehow from the
test tools. I haven't sorted out a way to do that yet though.
Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556566659.73229.15694973114605301063.stgit@john-XPS-13-9370
2020-11-16 22:27:46 +00:00
|
|
|
/* Lets not optimize peek case if copy_page_to_iter
|
|
|
|
* didn't copy the entire length lets just break.
|
|
|
|
*/
|
|
|
|
if (copy != sge->length)
|
|
|
|
return copied;
|
2018-10-16 18:08:04 +00:00
|
|
|
sk_msg_iter_var_next(i);
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (copied == len)
|
|
|
|
break;
|
|
|
|
} while (i != msg_rx->sg.end);
|
|
|
|
|
2018-10-16 18:08:04 +00:00
|
|
|
if (unlikely(peek)) {
|
2020-06-05 08:46:25 +00:00
|
|
|
if (msg_rx == list_last_entry(&psock->ingress_msg,
|
|
|
|
struct sk_msg, list))
|
|
|
|
break;
|
2018-10-16 18:08:04 +00:00
|
|
|
msg_rx = list_next_entry(msg_rx, list);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
msg_rx->sg.start = i;
|
|
|
|
if (!sge->length && msg_rx->sg.start == msg_rx->sg.end) {
|
|
|
|
list_del(&msg_rx->list);
|
|
|
|
if (msg_rx->skb)
|
|
|
|
consume_skb(msg_rx->skb);
|
|
|
|
kfree(msg_rx);
|
|
|
|
}
|
2018-10-16 18:08:04 +00:00
|
|
|
msg_rx = list_first_entry_or_null(&psock->ingress_msg,
|
|
|
|
struct sk_msg, list);
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return copied;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__tcp_bpf_recvmsg);
|
|
|
|
|
|
|
|
static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock,
|
|
|
|
struct sk_msg *msg, u32 apply_bytes, int flags)
|
|
|
|
{
|
|
|
|
bool apply = apply_bytes;
|
|
|
|
struct scatterlist *sge;
|
|
|
|
u32 size, copied = 0;
|
|
|
|
struct sk_msg *tmp;
|
|
|
|
int i, ret = 0;
|
|
|
|
|
|
|
|
tmp = kzalloc(sizeof(*tmp), __GFP_NOWARN | GFP_KERNEL);
|
|
|
|
if (unlikely(!tmp))
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
lock_sock(sk);
|
|
|
|
tmp->sg.start = msg->sg.start;
|
|
|
|
i = msg->sg.start;
|
|
|
|
do {
|
|
|
|
sge = sk_msg_elem(msg, i);
|
|
|
|
size = (apply && apply_bytes < sge->length) ?
|
|
|
|
apply_bytes : sge->length;
|
|
|
|
if (!sk_wmem_schedule(sk, size)) {
|
|
|
|
if (!copied)
|
|
|
|
ret = -ENOMEM;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
sk_mem_charge(sk, size);
|
|
|
|
sk_msg_xfer(tmp, msg, i, size);
|
|
|
|
copied += size;
|
|
|
|
if (sge->length)
|
|
|
|
get_page(sk_msg_page(tmp, i));
|
|
|
|
sk_msg_iter_var_next(i);
|
|
|
|
tmp->sg.end = i;
|
|
|
|
if (apply) {
|
|
|
|
apply_bytes -= size;
|
|
|
|
if (!apply_bytes)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
} while (i != msg->sg.end);
|
|
|
|
|
|
|
|
if (!ret) {
|
|
|
|
msg->sg.start = i;
|
|
|
|
sk_psock_queue_msg(psock, tmp);
|
2018-12-20 19:35:33 +00:00
|
|
|
sk_psock_data_ready(sk, psock);
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
} else {
|
|
|
|
sk_msg_free(sk, tmp);
|
|
|
|
kfree(tmp);
|
|
|
|
}
|
|
|
|
|
|
|
|
release_sock(sk);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int tcp_bpf_push(struct sock *sk, struct sk_msg *msg, u32 apply_bytes,
|
|
|
|
int flags, bool uncharge)
|
|
|
|
{
|
|
|
|
bool apply = apply_bytes;
|
|
|
|
struct scatterlist *sge;
|
|
|
|
struct page *page;
|
|
|
|
int size, ret = 0;
|
|
|
|
u32 off;
|
|
|
|
|
|
|
|
while (1) {
|
bpf: sk_msg, sock{map|hash} redirect through ULP
A sockmap program that redirects through a kTLS ULP enabled socket
will not work correctly because the ULP layer is skipped. This
fixes the behavior to call through the ULP layer on redirect to
ensure any operations required on the data stream at the ULP layer
continue to be applied.
To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
calling the BPF layer on a redirected message. This is
required to avoid calling the BPF layer multiple times (possibly
recursively) which is not the current/expected behavior without
ULPs. In the future we may add a redirect flag if users _do_
want the policy applied again but this would need to work for both
ULP and non-ULP sockets and be opt-in to avoid breaking existing
programs.
Also to avoid polluting the flag space with an internal flag we
reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
to verify is user space API is masked correctly to ensure the flag
can not be set by user. (Note this needs to be true regardless
because we have internal flags already in-use that user space
should not be able to set). But for completeness we have two UAPI
paths into sendpage, sendfile and splice.
In the sendfile case the function do_sendfile() zero's flags,
./fs/read_write.c:
static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
size_t count, loff_t max)
{
...
fl = 0;
#if 0
/*
* We need to debate whether we can enable this or not. The
* man page documents EAGAIN return for the output at least,
* and the application is arguably buggy if it doesn't expect
* EAGAIN on a non-blocking file descriptor.
*/
if (in.file->f_flags & O_NONBLOCK)
fl = SPLICE_F_NONBLOCK;
#endif
file_start_write(out.file);
retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
}
In the splice case the pipe_to_sendpage "actor" is used which
masks flags with SPLICE_F_MORE.
./fs/splice.c:
static int pipe_to_sendpage(struct pipe_inode_info *pipe,
struct pipe_buffer *buf, struct splice_desc *sd)
{
...
more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
...
}
Confirming what we expect that internal flags are in fact internal
to socket side.
Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-20 19:35:35 +00:00
|
|
|
bool has_tx_ulp;
|
|
|
|
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
sge = sk_msg_elem(msg, msg->sg.start);
|
|
|
|
size = (apply && apply_bytes < sge->length) ?
|
|
|
|
apply_bytes : sge->length;
|
|
|
|
off = sge->offset;
|
|
|
|
page = sg_page(sge);
|
|
|
|
|
|
|
|
tcp_rate_check_app_limited(sk);
|
|
|
|
retry:
|
bpf: sk_msg, sock{map|hash} redirect through ULP
A sockmap program that redirects through a kTLS ULP enabled socket
will not work correctly because the ULP layer is skipped. This
fixes the behavior to call through the ULP layer on redirect to
ensure any operations required on the data stream at the ULP layer
continue to be applied.
To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
calling the BPF layer on a redirected message. This is
required to avoid calling the BPF layer multiple times (possibly
recursively) which is not the current/expected behavior without
ULPs. In the future we may add a redirect flag if users _do_
want the policy applied again but this would need to work for both
ULP and non-ULP sockets and be opt-in to avoid breaking existing
programs.
Also to avoid polluting the flag space with an internal flag we
reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
to verify is user space API is masked correctly to ensure the flag
can not be set by user. (Note this needs to be true regardless
because we have internal flags already in-use that user space
should not be able to set). But for completeness we have two UAPI
paths into sendpage, sendfile and splice.
In the sendfile case the function do_sendfile() zero's flags,
./fs/read_write.c:
static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
size_t count, loff_t max)
{
...
fl = 0;
#if 0
/*
* We need to debate whether we can enable this or not. The
* man page documents EAGAIN return for the output at least,
* and the application is arguably buggy if it doesn't expect
* EAGAIN on a non-blocking file descriptor.
*/
if (in.file->f_flags & O_NONBLOCK)
fl = SPLICE_F_NONBLOCK;
#endif
file_start_write(out.file);
retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
}
In the splice case the pipe_to_sendpage "actor" is used which
masks flags with SPLICE_F_MORE.
./fs/splice.c:
static int pipe_to_sendpage(struct pipe_inode_info *pipe,
struct pipe_buffer *buf, struct splice_desc *sd)
{
...
more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
...
}
Confirming what we expect that internal flags are in fact internal
to socket side.
Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-20 19:35:35 +00:00
|
|
|
has_tx_ulp = tls_sw_has_ctx_tx(sk);
|
|
|
|
if (has_tx_ulp) {
|
|
|
|
flags |= MSG_SENDPAGE_NOPOLICY;
|
|
|
|
ret = kernel_sendpage_locked(sk,
|
|
|
|
page, off, size, flags);
|
|
|
|
} else {
|
|
|
|
ret = do_tcp_sendpages(sk, page, off, size, flags);
|
|
|
|
}
|
|
|
|
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
if (ret <= 0)
|
|
|
|
return ret;
|
|
|
|
if (apply)
|
|
|
|
apply_bytes -= ret;
|
|
|
|
msg->sg.size -= ret;
|
|
|
|
sge->offset += ret;
|
|
|
|
sge->length -= ret;
|
|
|
|
if (uncharge)
|
|
|
|
sk_mem_uncharge(sk, ret);
|
|
|
|
if (ret != size) {
|
|
|
|
size -= ret;
|
|
|
|
off += ret;
|
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
if (!sge->length) {
|
|
|
|
put_page(page);
|
|
|
|
sk_msg_iter_next(msg, start);
|
|
|
|
sg_init_table(sge, 1);
|
|
|
|
if (msg->sg.start == msg->sg.end)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (apply && !apply_bytes)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int tcp_bpf_push_locked(struct sock *sk, struct sk_msg *msg,
|
|
|
|
u32 apply_bytes, int flags, bool uncharge)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
lock_sock(sk);
|
|
|
|
ret = tcp_bpf_push(sk, msg, apply_bytes, flags, uncharge);
|
|
|
|
release_sock(sk);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int tcp_bpf_sendmsg_redir(struct sock *sk, struct sk_msg *msg,
|
|
|
|
u32 bytes, int flags)
|
|
|
|
{
|
|
|
|
bool ingress = sk_msg_to_ingress(msg);
|
|
|
|
struct sk_psock *psock = sk_psock_get(sk);
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (unlikely(!psock)) {
|
|
|
|
sk_msg_free(sk, msg);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
ret = ingress ? bpf_tcp_ingress(sk, psock, msg, bytes, flags) :
|
|
|
|
tcp_bpf_push_locked(sk, msg, bytes, flags, false);
|
|
|
|
sk_psock_put(sk, psock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(tcp_bpf_sendmsg_redir);
|
|
|
|
|
2020-03-20 02:34:25 +00:00
|
|
|
#ifdef CONFIG_BPF_STREAM_PARSER
|
|
|
|
static bool tcp_bpf_stream_read(const struct sock *sk)
|
|
|
|
{
|
|
|
|
struct sk_psock *psock;
|
|
|
|
bool empty = true;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
psock = sk_psock(sk);
|
|
|
|
if (likely(psock))
|
|
|
|
empty = list_empty(&psock->ingress_msg);
|
|
|
|
rcu_read_unlock();
|
|
|
|
return !empty;
|
|
|
|
}
|
|
|
|
|
2020-03-20 02:34:26 +00:00
|
|
|
static int tcp_bpf_wait_data(struct sock *sk, struct sk_psock *psock,
|
|
|
|
int flags, long timeo, int *err)
|
|
|
|
{
|
|
|
|
DEFINE_WAIT_FUNC(wait, woken_wake_function);
|
|
|
|
int ret = 0;
|
|
|
|
|
2020-06-10 10:19:43 +00:00
|
|
|
if (sk->sk_shutdown & RCV_SHUTDOWN)
|
|
|
|
return 1;
|
|
|
|
|
2020-03-20 02:34:26 +00:00
|
|
|
if (!timeo)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
add_wait_queue(sk_sleep(sk), &wait);
|
|
|
|
sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
|
|
|
|
ret = sk_wait_event(sk, &timeo,
|
|
|
|
!list_empty(&psock->ingress_msg) ||
|
|
|
|
!skb_queue_empty(&sk->sk_receive_queue), &wait);
|
|
|
|
sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
|
|
|
|
remove_wait_queue(sk_sleep(sk), &wait);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
|
|
|
|
int nonblock, int flags, int *addr_len)
|
|
|
|
{
|
|
|
|
struct sk_psock *psock;
|
|
|
|
int copied, ret;
|
|
|
|
|
2020-04-26 03:35:15 +00:00
|
|
|
if (unlikely(flags & MSG_ERRQUEUE))
|
|
|
|
return inet_recv_error(sk, msg, len, addr_len);
|
|
|
|
|
2020-03-20 02:34:26 +00:00
|
|
|
psock = sk_psock_get(sk);
|
|
|
|
if (unlikely(!psock))
|
|
|
|
return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
|
|
|
|
if (!skb_queue_empty(&sk->sk_receive_queue) &&
|
2020-04-26 03:35:15 +00:00
|
|
|
sk_psock_queue_empty(psock)) {
|
|
|
|
sk_psock_put(sk, psock);
|
2020-03-20 02:34:26 +00:00
|
|
|
return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
|
2020-04-26 03:35:15 +00:00
|
|
|
}
|
2020-03-20 02:34:26 +00:00
|
|
|
lock_sock(sk);
|
|
|
|
msg_bytes_ready:
|
|
|
|
copied = __tcp_bpf_recvmsg(sk, psock, msg, len, flags);
|
|
|
|
if (!copied) {
|
|
|
|
int data, err = 0;
|
|
|
|
long timeo;
|
|
|
|
|
|
|
|
timeo = sock_rcvtimeo(sk, nonblock);
|
|
|
|
data = tcp_bpf_wait_data(sk, psock, flags, timeo, &err);
|
|
|
|
if (data) {
|
|
|
|
if (!sk_psock_queue_empty(psock))
|
|
|
|
goto msg_bytes_ready;
|
|
|
|
release_sock(sk);
|
|
|
|
sk_psock_put(sk, psock);
|
|
|
|
return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
|
|
|
|
}
|
|
|
|
if (err) {
|
|
|
|
ret = err;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
copied = -EAGAIN;
|
|
|
|
}
|
|
|
|
ret = copied;
|
|
|
|
out:
|
|
|
|
release_sock(sk);
|
|
|
|
sk_psock_put(sk, psock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
|
|
|
|
struct sk_msg *msg, int *copied, int flags)
|
|
|
|
{
|
2019-11-27 20:16:41 +00:00
|
|
|
bool cork = false, enospc = sk_msg_full(msg);
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
struct sock *sk_redir;
|
2018-11-26 22:16:17 +00:00
|
|
|
u32 tosend, delta = 0;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
more_data:
|
2018-11-26 22:16:17 +00:00
|
|
|
if (psock->eval == __SK_NONE) {
|
|
|
|
/* Track delta in msg size to add/subtract it on SK_DROP from
|
|
|
|
* returned to user copied size. This ensures user doesn't
|
|
|
|
* get a positive return code with msg_cut_data and SK_DROP
|
|
|
|
* verdict.
|
|
|
|
*/
|
|
|
|
delta = msg->sg.size;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
psock->eval = sk_psock_msg_verdict(sk, psock, msg);
|
2020-01-11 06:12:06 +00:00
|
|
|
delta -= msg->sg.size;
|
2018-11-26 22:16:17 +00:00
|
|
|
}
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
|
|
|
|
if (msg->cork_bytes &&
|
|
|
|
msg->cork_bytes > msg->sg.size && !enospc) {
|
|
|
|
psock->cork_bytes = msg->cork_bytes - msg->sg.size;
|
|
|
|
if (!psock->cork) {
|
|
|
|
psock->cork = kzalloc(sizeof(*psock->cork),
|
|
|
|
GFP_ATOMIC | __GFP_NOWARN);
|
|
|
|
if (!psock->cork)
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
memcpy(psock->cork, msg, sizeof(*msg));
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
tosend = msg->sg.size;
|
|
|
|
if (psock->apply_bytes && psock->apply_bytes < tosend)
|
|
|
|
tosend = psock->apply_bytes;
|
|
|
|
|
|
|
|
switch (psock->eval) {
|
|
|
|
case __SK_PASS:
|
|
|
|
ret = tcp_bpf_push(sk, msg, tosend, flags, true);
|
|
|
|
if (unlikely(ret)) {
|
|
|
|
*copied -= sk_msg_free(sk, msg);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
sk_msg_apply_bytes(psock, tosend);
|
|
|
|
break;
|
|
|
|
case __SK_REDIRECT:
|
|
|
|
sk_redir = psock->sk_redir;
|
|
|
|
sk_msg_apply_bytes(psock, tosend);
|
|
|
|
if (psock->cork) {
|
|
|
|
cork = true;
|
|
|
|
psock->cork = NULL;
|
|
|
|
}
|
|
|
|
sk_msg_return(sk, msg, tosend);
|
|
|
|
release_sock(sk);
|
|
|
|
ret = tcp_bpf_sendmsg_redir(sk_redir, msg, tosend, flags);
|
|
|
|
lock_sock(sk);
|
|
|
|
if (unlikely(ret < 0)) {
|
|
|
|
int free = sk_msg_free_nocharge(sk, msg);
|
|
|
|
|
|
|
|
if (!cork)
|
|
|
|
*copied -= free;
|
|
|
|
}
|
|
|
|
if (cork) {
|
|
|
|
sk_msg_free(sk, msg);
|
|
|
|
kfree(msg);
|
|
|
|
msg = NULL;
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case __SK_DROP:
|
|
|
|
default:
|
|
|
|
sk_msg_free_partial(sk, msg, tosend);
|
|
|
|
sk_msg_apply_bytes(psock, tosend);
|
2018-11-26 22:16:17 +00:00
|
|
|
*copied -= (tosend + delta);
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
return -EACCES;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (likely(!ret)) {
|
|
|
|
if (!psock->apply_bytes) {
|
|
|
|
psock->eval = __SK_NONE;
|
|
|
|
if (psock->sk_redir) {
|
|
|
|
sock_put(psock->sk_redir);
|
|
|
|
psock->sk_redir = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (msg &&
|
|
|
|
msg->sg.data[msg->sg.start].page_link &&
|
|
|
|
msg->sg.data[msg->sg.start].length)
|
|
|
|
goto more_data;
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
|
|
|
|
{
|
|
|
|
struct sk_msg tmp, *msg_tx = NULL;
|
|
|
|
int copied = 0, err = 0;
|
|
|
|
struct sk_psock *psock;
|
|
|
|
long timeo;
|
2019-08-08 00:03:59 +00:00
|
|
|
int flags;
|
|
|
|
|
|
|
|
/* Don't let internal do_tcp_sendpages() flags through */
|
|
|
|
flags = (msg->msg_flags & ~MSG_SENDPAGE_DECRYPTED);
|
|
|
|
flags |= MSG_NO_SHARED_FRAGS;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
|
|
|
|
psock = sk_psock_get(sk);
|
|
|
|
if (unlikely(!psock))
|
|
|
|
return tcp_sendmsg(sk, msg, size);
|
|
|
|
|
|
|
|
lock_sock(sk);
|
|
|
|
timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
|
|
|
|
while (msg_data_left(msg)) {
|
|
|
|
bool enospc = false;
|
|
|
|
u32 copy, osize;
|
|
|
|
|
|
|
|
if (sk->sk_err) {
|
|
|
|
err = -sk->sk_err;
|
|
|
|
goto out_err;
|
|
|
|
}
|
|
|
|
|
|
|
|
copy = msg_data_left(msg);
|
|
|
|
if (!sk_stream_memory_free(sk))
|
|
|
|
goto wait_for_sndbuf;
|
|
|
|
if (psock->cork) {
|
|
|
|
msg_tx = psock->cork;
|
|
|
|
} else {
|
|
|
|
msg_tx = &tmp;
|
|
|
|
sk_msg_init(msg_tx);
|
|
|
|
}
|
|
|
|
|
|
|
|
osize = msg_tx->sg.size;
|
|
|
|
err = sk_msg_alloc(sk, msg_tx, msg_tx->sg.size + copy, msg_tx->sg.end - 1);
|
|
|
|
if (err) {
|
|
|
|
if (err != -ENOSPC)
|
|
|
|
goto wait_for_memory;
|
|
|
|
enospc = true;
|
|
|
|
copy = msg_tx->sg.size - osize;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = sk_msg_memcopy_from_iter(sk, &msg->msg_iter, msg_tx,
|
|
|
|
copy);
|
|
|
|
if (err < 0) {
|
|
|
|
sk_msg_trim(sk, msg_tx, osize);
|
|
|
|
goto out_err;
|
|
|
|
}
|
|
|
|
|
|
|
|
copied += copy;
|
|
|
|
if (psock->cork_bytes) {
|
|
|
|
if (size > psock->cork_bytes)
|
|
|
|
psock->cork_bytes = 0;
|
|
|
|
else
|
|
|
|
psock->cork_bytes -= size;
|
|
|
|
if (psock->cork_bytes && !enospc)
|
|
|
|
goto out_err;
|
|
|
|
/* All cork bytes are accounted, rerun the prog. */
|
|
|
|
psock->eval = __SK_NONE;
|
|
|
|
psock->cork_bytes = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = tcp_bpf_send_verdict(sk, psock, msg_tx, &copied, flags);
|
|
|
|
if (unlikely(err < 0))
|
|
|
|
goto out_err;
|
|
|
|
continue;
|
|
|
|
wait_for_sndbuf:
|
|
|
|
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
|
|
|
|
wait_for_memory:
|
|
|
|
err = sk_stream_wait_memory(sk, &timeo);
|
|
|
|
if (err) {
|
|
|
|
if (msg_tx && msg_tx != psock->cork)
|
|
|
|
sk_msg_free(sk, msg_tx);
|
|
|
|
goto out_err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
out_err:
|
|
|
|
if (err < 0)
|
|
|
|
err = sk_stream_error(sk, msg->msg_flags, err);
|
|
|
|
release_sock(sk);
|
|
|
|
sk_psock_put(sk, psock);
|
|
|
|
return copied ? copied : err;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int tcp_bpf_sendpage(struct sock *sk, struct page *page, int offset,
|
|
|
|
size_t size, int flags)
|
|
|
|
{
|
|
|
|
struct sk_msg tmp, *msg = NULL;
|
|
|
|
int err = 0, copied = 0;
|
|
|
|
struct sk_psock *psock;
|
|
|
|
bool enospc = false;
|
|
|
|
|
|
|
|
psock = sk_psock_get(sk);
|
|
|
|
if (unlikely(!psock))
|
|
|
|
return tcp_sendpage(sk, page, offset, size, flags);
|
|
|
|
|
|
|
|
lock_sock(sk);
|
|
|
|
if (psock->cork) {
|
|
|
|
msg = psock->cork;
|
|
|
|
} else {
|
|
|
|
msg = &tmp;
|
|
|
|
sk_msg_init(msg);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Catch case where ring is full and sendpage is stalled. */
|
|
|
|
if (unlikely(sk_msg_full(msg)))
|
|
|
|
goto out_err;
|
|
|
|
|
|
|
|
sk_msg_page_add(msg, page, size, offset);
|
|
|
|
sk_mem_charge(sk, size);
|
|
|
|
copied = size;
|
|
|
|
if (sk_msg_full(msg))
|
|
|
|
enospc = true;
|
|
|
|
if (psock->cork_bytes) {
|
|
|
|
if (size > psock->cork_bytes)
|
|
|
|
psock->cork_bytes = 0;
|
|
|
|
else
|
|
|
|
psock->cork_bytes -= size;
|
|
|
|
if (psock->cork_bytes && !enospc)
|
|
|
|
goto out_err;
|
|
|
|
/* All cork bytes are accounted, rerun the prog. */
|
|
|
|
psock->eval = __SK_NONE;
|
|
|
|
psock->cork_bytes = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = tcp_bpf_send_verdict(sk, psock, msg, &copied, flags);
|
|
|
|
out_err:
|
|
|
|
release_sock(sk);
|
|
|
|
sk_psock_put(sk, psock);
|
|
|
|
return copied ? copied : err;
|
|
|
|
}
|
|
|
|
|
|
|
|
enum {
|
|
|
|
TCP_BPF_IPV4,
|
|
|
|
TCP_BPF_IPV6,
|
|
|
|
TCP_BPF_NUM_PROTS,
|
|
|
|
};
|
|
|
|
|
|
|
|
enum {
|
|
|
|
TCP_BPF_BASE,
|
|
|
|
TCP_BPF_TX,
|
|
|
|
TCP_BPF_NUM_CFGS,
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct proto *tcpv6_prot_saved __read_mostly;
|
|
|
|
static DEFINE_SPINLOCK(tcpv6_prot_lock);
|
|
|
|
static struct proto tcp_bpf_prots[TCP_BPF_NUM_PROTS][TCP_BPF_NUM_CFGS];
|
|
|
|
|
|
|
|
static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
|
|
|
|
struct proto *base)
|
|
|
|
{
|
|
|
|
prot[TCP_BPF_BASE] = *base;
|
2020-03-09 11:12:36 +00:00
|
|
|
prot[TCP_BPF_BASE].unhash = sock_map_unhash;
|
|
|
|
prot[TCP_BPF_BASE].close = sock_map_close;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
prot[TCP_BPF_BASE].recvmsg = tcp_bpf_recvmsg;
|
|
|
|
prot[TCP_BPF_BASE].stream_memory_read = tcp_bpf_stream_read;
|
|
|
|
|
|
|
|
prot[TCP_BPF_TX] = prot[TCP_BPF_BASE];
|
|
|
|
prot[TCP_BPF_TX].sendmsg = tcp_bpf_sendmsg;
|
|
|
|
prot[TCP_BPF_TX].sendpage = tcp_bpf_sendpage;
|
|
|
|
}
|
|
|
|
|
2020-08-21 10:29:43 +00:00
|
|
|
static void tcp_bpf_check_v6_needs_rebuild(struct proto *ops)
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
{
|
2020-08-21 10:29:43 +00:00
|
|
|
if (unlikely(ops != smp_load_acquire(&tcpv6_prot_saved))) {
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
spin_lock_bh(&tcpv6_prot_lock);
|
|
|
|
if (likely(ops != tcpv6_prot_saved)) {
|
|
|
|
tcp_bpf_rebuild_protos(tcp_bpf_prots[TCP_BPF_IPV6], ops);
|
|
|
|
smp_store_release(&tcpv6_prot_saved, ops);
|
|
|
|
}
|
|
|
|
spin_unlock_bh(&tcpv6_prot_lock);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __init tcp_bpf_v4_build_proto(void)
|
|
|
|
{
|
|
|
|
tcp_bpf_rebuild_protos(tcp_bpf_prots[TCP_BPF_IPV4], &tcp_prot);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
core_initcall(tcp_bpf_v4_build_proto);
|
|
|
|
|
|
|
|
static int tcp_bpf_assert_proto_ops(struct proto *ops)
|
|
|
|
{
|
|
|
|
/* In order to avoid retpoline, we make assumptions when we call
|
|
|
|
* into ops if e.g. a psock is not present. Make sure they are
|
|
|
|
* indeed valid assumptions.
|
|
|
|
*/
|
|
|
|
return ops->recvmsg == tcp_recvmsg &&
|
|
|
|
ops->sendmsg == tcp_sendmsg &&
|
|
|
|
ops->sendpage == tcp_sendpage ? 0 : -ENOTSUPP;
|
|
|
|
}
|
|
|
|
|
2020-03-09 11:12:36 +00:00
|
|
|
struct proto *tcp_bpf_get_proto(struct sock *sk, struct sk_psock *psock)
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
{
|
2020-03-09 11:12:34 +00:00
|
|
|
int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
|
|
|
|
int config = psock->progs.msg_parser ? TCP_BPF_TX : TCP_BPF_BASE;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
|
2020-08-21 10:29:43 +00:00
|
|
|
if (sk->sk_family == AF_INET6) {
|
|
|
|
if (tcp_bpf_assert_proto_ops(psock->sk_proto))
|
2020-03-09 11:12:34 +00:00
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
2020-08-21 10:29:43 +00:00
|
|
|
tcp_bpf_check_v6_needs_rebuild(psock->sk_proto);
|
2020-03-09 11:12:34 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return &tcp_bpf_prots[family][config];
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
}
|
|
|
|
|
2020-02-18 17:10:15 +00:00
|
|
|
/* If a child got cloned from a listening socket that had tcp_bpf
|
|
|
|
* protocol callbacks installed, we need to restore the callbacks to
|
|
|
|
* the default ones because the child does not inherit the psock state
|
|
|
|
* that tcp_bpf callbacks expect.
|
|
|
|
*/
|
|
|
|
void tcp_bpf_clone(const struct sock *sk, struct sock *newsk)
|
|
|
|
{
|
|
|
|
int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
|
|
|
|
struct proto *prot = newsk->sk_prot;
|
|
|
|
|
|
|
|
if (prot == &tcp_bpf_prots[family][TCP_BPF_BASE])
|
|
|
|
newsk->sk_prot = sk->sk_prot_creator;
|
|
|
|
}
|
2020-03-09 11:12:36 +00:00
|
|
|
#endif /* CONFIG_BPF_STREAM_PARSER */
|