forked from Minki/linux
cc8889ae82
Documentation for this feature was missing from the patchset. Copied a lot from the netdev 2.1 paper, addressing some small interface changes since then. Changes v1 -> v2 - change email discussion URL format - clarify that u32 counter is per-syscall, unsigned and wraps after UINT_MAX calls - describe errno on send failure specific to MSG_ZEROCOPY - a few very minor rewordings Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
258 lines
8.6 KiB
ReStructuredText
258 lines
8.6 KiB
ReStructuredText
|
|
============
|
|
MSG_ZEROCOPY
|
|
============
|
|
|
|
Intro
|
|
=====
|
|
|
|
The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
|
|
The feature is currently implemented for TCP sockets.
|
|
|
|
|
|
Opportunity and Caveats
|
|
-----------------------
|
|
|
|
Copying large buffers between user process and kernel can be
|
|
expensive. Linux supports various interfaces that eschew copying,
|
|
such as sendpage and splice. The MSG_ZEROCOPY flag extends the
|
|
underlying copy avoidance mechanism to common socket send calls.
|
|
|
|
Copy avoidance is not a free lunch. As implemented, with page pinning,
|
|
it replaces per byte copy cost with page accounting and completion
|
|
notification overhead. As a result, MSG_ZEROCOPY is generally only
|
|
effective at writes over around 10 KB.
|
|
|
|
Page pinning also changes system call semantics. It temporarily shares
|
|
the buffer between process and network stack. Unlike with copying, the
|
|
process cannot immediately overwrite the buffer after system call
|
|
return without possibly modifying the data in flight. Kernel integrity
|
|
is not affected, but a buggy program can possibly corrupt its own data
|
|
stream.
|
|
|
|
The kernel returns a notification when it is safe to modify data.
|
|
Converting an existing application to MSG_ZEROCOPY is not always as
|
|
trivial as just passing the flag, then.
|
|
|
|
|
|
More Info
|
|
---------
|
|
|
|
Much of this document was derived from a longer paper presented at
|
|
netdev 2.1. For more in-depth information see that paper and talk,
|
|
the excellent reporting over at LWN.net or read the original code.
|
|
|
|
paper, slides, video
|
|
https://netdevconf.org/2.1/session.html?debruijn
|
|
|
|
LWN article
|
|
https://lwn.net/Articles/726917/
|
|
|
|
patchset
|
|
[PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
|
|
http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
|
|
|
|
|
|
Interface
|
|
=========
|
|
|
|
Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy
|
|
avoidance, but not the only one.
|
|
|
|
Socket Setup
|
|
------------
|
|
|
|
The kernel is permissive when applications pass undefined flags to the
|
|
send system call. By default it simply ignores these. To avoid enabling
|
|
copy avoidance mode for legacy processes that accidentally already pass
|
|
this flag, a process must first signal intent by setting a socket option:
|
|
|
|
::
|
|
|
|
if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
|
|
error(1, errno, "setsockopt zerocopy");
|
|
|
|
|
|
Transmission
|
|
------------
|
|
|
|
The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
|
|
Pass the new flag.
|
|
|
|
::
|
|
|
|
ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
|
|
|
|
A zerocopy failure will return -1 with errno ENOBUFS. This happens if
|
|
the socket option was not set, the socket exceeds its optmem limit or
|
|
the user exceeds its ulimit on locked pages.
|
|
|
|
|
|
Mixing copy avoidance and copying
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Many workloads have a mixture of large and small buffers. Because copy
|
|
avoidance is more expensive than copying for small packets, the
|
|
feature is implemented as a flag. It is safe to mix calls with the flag
|
|
with those without.
|
|
|
|
|
|
Notifications
|
|
-------------
|
|
|
|
The kernel has to notify the process when it is safe to reuse a
|
|
previously passed buffer. It queues completion notifications on the
|
|
socket error queue, akin to the transmit timestamping interface.
|
|
|
|
The notification itself is a simple scalar value. Each socket
|
|
maintains an internal unsigned 32-bit counter. Each send call with
|
|
MSG_ZEROCOPY that successfully sends data increments the counter. The
|
|
counter is not incremented on failure or if called with length zero.
|
|
The counter counts system call invocations, not bytes. It wraps after
|
|
UINT_MAX calls.
|
|
|
|
|
|
Notification Reception
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The below snippet demonstrates the API. In the simplest case, each
|
|
send syscall is followed by a poll and recvmsg on the error queue.
|
|
|
|
Reading from the error queue is always a non-blocking operation. The
|
|
poll call is there to block until an error is outstanding. It will set
|
|
POLLERR in its output flags. That flag does not have to be set in the
|
|
events field. Errors are signaled unconditionally.
|
|
|
|
::
|
|
|
|
pfd.fd = fd;
|
|
pfd.events = 0;
|
|
if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
|
|
error(1, errno, "poll");
|
|
|
|
ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
|
|
if (ret == -1)
|
|
error(1, errno, "recvmsg");
|
|
|
|
read_notification(msg);
|
|
|
|
The example is for demonstration purpose only. In practice, it is more
|
|
efficient to not wait for notifications, but read without blocking
|
|
every couple of send calls.
|
|
|
|
Notifications can be processed out of order with other operations on
|
|
the socket. A socket that has an error queued would normally block
|
|
other operations until the error is read. Zerocopy notifications have
|
|
a zero error code, however, to not block send and recv calls.
|
|
|
|
|
|
Notification Batching
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Multiple outstanding packets can be read at once using the recvmmsg
|
|
call. This is often not needed. In each message the kernel returns not
|
|
a single value, but a range. It coalesces consecutive notifications
|
|
while one is outstanding for reception on the error queue.
|
|
|
|
When a new notification is about to be queued, it checks whether the
|
|
new value extends the range of the notification at the tail of the
|
|
queue. If so, it drops the new notification packet and instead increases
|
|
the range upper value of the outstanding notification.
|
|
|
|
For protocols that acknowledge data in-order, like TCP, each
|
|
notification can be squashed into the previous one, so that no more
|
|
than one notification is outstanding at any one point.
|
|
|
|
Ordered delivery is the common case, but not guaranteed. Notifications
|
|
may arrive out of order on retransmission and socket teardown.
|
|
|
|
|
|
Notification Parsing
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The below snippet demonstrates how to parse the control message: the
|
|
read_notification() call in the previous snippet. A notification
|
|
is encoded in the standard error format, sock_extended_err.
|
|
|
|
The level and type fields in the control data are protocol family
|
|
specific, IP_RECVERR or IPV6_RECVERR.
|
|
|
|
Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
|
|
as explained before, to avoid blocking read and write system calls on
|
|
the socket.
|
|
|
|
The 32-bit notification range is encoded as [ee_info, ee_data]. This
|
|
range is inclusive. Other fields in the struct must be treated as
|
|
undefined, bar for ee_code, as discussed below.
|
|
|
|
::
|
|
|
|
struct sock_extended_err *serr;
|
|
struct cmsghdr *cm;
|
|
|
|
cm = CMSG_FIRSTHDR(msg);
|
|
if (cm->cmsg_level != SOL_IP &&
|
|
cm->cmsg_type != IP_RECVERR)
|
|
error(1, 0, "cmsg");
|
|
|
|
serr = (void *) CMSG_DATA(cm);
|
|
if (serr->ee_errno != 0 ||
|
|
serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
|
|
error(1, 0, "serr");
|
|
|
|
printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
|
|
|
|
|
|
Deferred copies
|
|
~~~~~~~~~~~~~~~
|
|
|
|
Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
|
|
avoidance, and a contract that the kernel will queue a completion
|
|
notification. It is not a guarantee that the copy is elided.
|
|
|
|
Copy avoidance is not always feasible. Devices that do not support
|
|
scatter-gather I/O cannot send packets made up of kernel generated
|
|
protocol headers plus zerocopy user data. A packet may need to be
|
|
converted to a private copy of data deep in the stack, say to compute
|
|
a checksum.
|
|
|
|
In all these cases, the kernel returns a completion notification when
|
|
it releases its hold on the shared pages. That notification may arrive
|
|
before the (copied) data is fully transmitted. A zerocopy completion
|
|
notification is not a transmit completion notification, therefore.
|
|
|
|
Deferred copies can be more expensive than a copy immediately in the
|
|
system call, if the data is no longer warm in the cache. The process
|
|
also incurs notification processing cost for no benefit. For this
|
|
reason, the kernel signals if data was completed with a copy, by
|
|
setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
|
|
A process may use this signal to stop passing flag MSG_ZEROCOPY on
|
|
subsequent requests on the same socket.
|
|
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Loopback
|
|
--------
|
|
|
|
Data sent to local sockets can be queued indefinitely if the receive
|
|
process does not read its socket. Unbound notification latency is not
|
|
acceptable. For this reason all packets generated with MSG_ZEROCOPY
|
|
that are looped to a local socket will incur a deferred copy. This
|
|
includes looping onto packet sockets (e.g., tcpdump) and tun devices.
|
|
|
|
|
|
Testing
|
|
=======
|
|
|
|
More realistic example code can be found in the kernel source under
|
|
tools/testing/selftests/net/msg_zerocopy.c.
|
|
|
|
Be cognizant of the loopback constraint. The test can be run between
|
|
a pair of hosts. But if run between a local pair of processes, for
|
|
instance when run with msg_zerocopy.sh between a veth pair across
|
|
namespaces, the test will not show any improvement. For testing, the
|
|
loopback restriction can be temporarily relaxed by making
|
|
skb_orphan_frags_rx identical to skb_orphan_frags.
|