Add internal part of send zerocopy notifications. There are two main
structures, the first one is struct io_notif, which carries inside
struct ubuf_info and maps 1:1 to it. io_uring will be binding a number
of zerocopy send requests to it and ask to complete (aka flush) it. When
flushed and all attached requests and skbs complete, it'll generate one
and only one CQE. There are intended to be passed into the network layer
as struct msghdr::msg_ubuf.
The second concept is notification slots. The userspace will be able to
register an array of slots and subsequently addressing them by the index
in the array. Slots are independent of each other. Each slot can have
only one notifier at a time (called active notifier) but many notifiers
during the lifetime. When active, a notifier not going to post any
completion but the userspace can attach requests to it by specifying
the corresponding slot while issueing send zc requests. Eventually, the
userspace will want to "flush" the notifier losing any way to attach
new requests to it, however it can use the next atomatically added
notifier of this slot or of any other slot.
When the network layer is done with all enqueued skbs attached to a
notifier and doesn't need the specified in them user data, the flushed
notifier will post a CQE.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3ecf54c31a85762bf679b0a432c9f43ecf7e61cc.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* for-5.20/io_uring: (716 commits)
io_uring: ensure REQ_F_ISREG is set async offload
net: fix compat pointer in get_compat_msghdr()
io_uring: Don't require reinitable percpu_ref
io_uring: fix types in io_recvmsg_multishot_overflow
io_uring: Use atomic_long_try_cmpxchg in __io_account_mem
io_uring: support multishot in recvmsg
net: copy from user before calling __get_compat_msghdr
net: copy from user before calling __copy_msghdr
io_uring: support 0 length iov in buffer select in compat
io_uring: fix multishot ending when not polled
io_uring: add netmsg cache
io_uring: impose max limit on apoll cache
io_uring: add abstraction around apoll cache
io_uring: move apoll cache to poll.c
io_uring: consolidate hash_locked io-wq handling
io_uring: clear REQ_F_HASH_LOCKED on hash removal
io_uring: don't race double poll setting REQ_F_ASYNC_DATA
io_uring: don't miss setting REQ_F_DOUBLE_POLL
io_uring: disable multishot recvmsg
io_uring: only trace one of complete or overflow
...
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge prep net series for io_uring tx zc from the Jakub's tree.
* 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux:
net: fix uninitialised msghdr->sg_from_iter
tcp: support externally provided ubufs
ipv6/udp: support externally provided ubufs
ipv4/udp: support externally provided ubufs
net: introduce __skb_fill_page_desc_noacc
net: introduce managed frags infrastructure
net: Allow custom iter handler in msghdr
skbuff: carry external ubuf_info in msghdr
skbuff: add SKBFL_DONT_ORPHAN flag
skbuff: don't mix ubuf_info from different sources
ipv6: avoid partial copy for zc
ipv4: avoid partial copy for zc
If we're offloading requests directly to io-wq because IOSQE_ASYNC was
set in the sqe, we can miss hashing writes appropriately because we
haven't set REQ_F_ISREG yet. This can cause a performance regression
with buffered writes, as io-wq then no longer correctly serializes writes
to that file.
Ensure that we set the flags in io_prep_async_work(), which will cause
the io-wq work item to be hashed appropriately.
Fixes: 584b0180f0 ("io_uring: move read/write file prep state into actual opcode handler")
Link: https://lore.kernel.org/io-uring/20220608080054.GB22428@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A previous change enabled external users to copy the data before
calling __get_compat_msghdr(), but didn't modify get_compat_msghdr() or
__io_compat_recvmsg_copy_hdr() to take that into account. They are both
stil passing in the __user pointer rather than the copied version.
Ensure we pass in the kernel struct, not the pointer to the user data.
Link: https://lore.kernel.org/all/46439555-644d-08a1-7d66-16f8f9a320f0@samsung.com/
Fixes: 1a3e4e94a1b9 ("net: copy from user before calling __get_compat_msghdr")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The commit 8bb649ee1d ("io_uring: remove ring quiesce for
io_uring_register") removed the worklow relying on reinit/resurrection
of the percpu_ref, hence, initialization with that requested is a relic.
This is based on code review, this causes no real bug (and theoretically
can't). Technically it's a revert of commit 214828962d ("io_uring:
initialize percpu refcounters using PERCU_REF_ALLOW_REINIT") but since
the flag omission is now justified, I'm not making this a revert.
Fixes: 8bb649ee1d ("io_uring: remove ring quiesce for io_uring_register")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_recvmsg_multishot_overflow had incorrect types on non x64 system.
But also it had an unnecessary INT_MAX check, which could just be done
by changing the type of the accumulator to int (also simplifying the
casts).
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: a8b38c4ce724 ("io_uring: support multishot in recvmsg")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220715130252.610639-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use atomic_long_try_cmpxchg instead of
atomic_long_cmpxchg (*ptr, old, new) == old in __io_account_mem.
x86 CMPXCHG instruction returns success in ZF flag, so this
change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).
Also, atomic_long_try_cmpxchg implicitly assigns old *ptr value
to "old" when cmpxchg fails, enabling further code simplifications.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Similar to multishot recv, this will require provided buffers to be
used. However recvmsg is much more complex than recv as it has multiple
outputs. Specifically flags, name, and control messages.
Support this by introducing a new struct io_uring_recvmsg_out with 4
fields. namelen, controllen and flags match the similar out fields in
msghdr from standard recvmsg(2), payloadlen is the length of the payload
following the header.
This struct is placed at the start of the returned buffer. Based on what
the user specifies in struct msghdr, the next bytes of the buffer will be
name (the next msg_namelen bytes), and then control (the next
msg_controllen bytes). The payload will come at the end. The return value
in the CQE is the total used size of the provided buffer.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220714110258.1336200-4-dylany@fb.com
[axboe: style fixups, see link]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
this is in preparation for multishot receive from io_uring, where it needs
to have access to the original struct user_msghdr.
functionally this should be a no-op.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220714110258.1336200-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
this is in preparation for multishot receive from io_uring, where it needs
to have access to the original struct user_msghdr.
functionally this should be a no-op.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220714110258.1336200-2-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Match up work done in "io_uring: allow iov_len = 0 for recvmsg and buffer
select", but for compat code path.
Fixes: a68caad69ce5 ("io_uring: allow iov_len = 0 for recvmsg and buffer select")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220708181838.1495428-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If multishot is not actually polling then return IOU_OK rather than the
result.
If the result was > 0 this will confuse things further up the callstack
which expect a return <= 0.
Fixes: 1300ebb20286 ("io_uring: multishot recv")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220708181838.1495428-2-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For recvmsg/sendmsg, if they don't complete inline, we currently need
to allocate a struct io_async_msghdr for each request. This is a
somewhat large struct.
Hook up sendmsg/recvmsg to use the io_alloc_cache. This reduces the
alloc + free overhead considerably, yielding 4-5% of extra performance
running netbench.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caches like this tend to grow to the peak size, and then never get any
smaller. Impose a max limit on the size, to prevent it from growing too
big.
A somewhat randomly chosen 512 is the max size we'll allow the cache
to get. If a batch of frees come in and would bring it over that, we
simply start kfree'ing the surplus.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for adding limits, and one more user, abstract out the
core bits of the allocation+free cache.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just as with io_poll_double_prepare() setting REQ_F_DOUBLE_POLL, we can
race with the first poll entry when setting REQ_F_ASYNC_DATA. Move it
under io_poll_double_prepare().
Fixes: a18427bb2d9b ("io_uring: optimise submission side poll_refs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/df6920f509c11115aa2bce8b34dc5fdb0eb98920.1657203020.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
recvmsg has semantics that do not make it trivial to extend to
multishot. Specifically it has user pointers and returns data in the
original parameter. In order to make this API useful these will need to be
somehow included with the provided buffers.
For now remove multishot for recvmsg as it is not useful.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220704140106.200167-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In overflow we see a duplcate line in the trace, and in some cases 3
lines (if initial io_post_aux_cqe fails).
Instead just trace once for each CQE
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-13-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Support multishot receive for io_uring.
Typical server applications will run a loop where for each recv CQE it
requeues another recv/recvmsg.
This can be simplified by using the existing multishot functionality
combined with io_uring's provided buffers.
The API is to add the IORING_RECV_MULTISHOT flag to the SQE. CQEs will
then be posted (with IORING_CQE_F_MORE flag set) when data is available
and is read. Once an error occurs or the socket ends, the multishot will
be removed and a completion without IORING_CQE_F_MORE will be posted.
The benefit to this is that the recv is much more performant.
* Subsequent receives are queued up straight away without requiring the
application to finish a processing loop.
* If there are more data in the socket (sat the provided buffer size is
smaller than the socket buffer) then the data is immediately
returned, improving batching.
* Poll is only armed once and reused, saving CPU cycles
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-11-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On overflow, multishot poll can still complete with the IORING_CQE_F_MORE
flag set.
If in the meantime the user clears a CQE and a the poll was cancelled then
the poll will post a CQE without the IORING_CQE_F_MORE (and likely result
-ECANCELED).
However when processing the application will encounter the non-overflow
CQE which indicates that there will be no more events posted. Typical
userspace applications would free memory associated with the poll in this
case.
It will then subsequently receive the earlier CQE which has overflowed,
which breaks the contract given by the IORING_CQE_F_MORE flag.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-9-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some use cases of io_post_aux_cqe would not want to overflow as is, but
might want to change the flags/result. For example multishot receive
requires in order CQE, and so if there is an overflow it would need to
stop receiving until the overflow is taken care of.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-8-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For multishot we want a way to signal the caller that multishot has ended
but also this might not be an error return.
For example sockets return 0 when closed, which should end a multishot
recv, but still have a CQE with result 0
Introduce IOU_STOP_MULTISHOT which does this and indicates that the return
code is stored inside req->cqe
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-7-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The values returned are a bit confusing, where 0 and 1 have implied
meaning, so add some definitions for them.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-6-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than passing an error back to the user with a buffer attached,
recycle the buffer immediately.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-5-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When using BUFFER_SELECT there is no technical requirement that the user
actually provides iov, and this removes one copy_from_user call.
So allow iov_len to be 0.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-4-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Attempt to restore bgid. This is needed when recycling unused buffers as
the next time around it will want the correct bgid.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
From recently io_uring provides an option to allocate a file index for
operation registering fixed files. However, it's utterly unusable with
mixed approaches when for a part of files the userspace knows better
where to place it, as it may race and users don't have any sane way to
pick a slot and hoping it will not be taken.
Let the userspace to register a range of fixed file slots in which the
auto-allocation happens. The use case is splittting the fixed table in
two parts, where on of them is used for auto-allocation and another for
slot-specified operations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/66ab0394e436f38437cf7c44676e1920d09687ad.1656154403.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With IORING_OP_MSG_RING, one ring can send a message to another ring.
Extend that support to also allow sending a fixed file descriptor to
that ring, enabling one ring to pass a registered descriptor to another
one.
Arguments are extended to pass in:
sqe->addr3 fixed file slot in source ring
sqe->file_index fixed file slot in destination ring
IORING_OP_MSG_RING is extended to take a command argument in sqe->addr.
If set to zero (or IORING_MSG_DATA), it sends just a message like before.
If set to IORING_MSG_SEND_FD, a fixed file descriptor is sent according
to the above arguments.
Two common use cases for this are:
1) Server needs to be shutdown or restarted, pass file descriptors to
another onei
2) Backend is split, and one accepts connections, while others then get
the fd passed and handle the actual connection.
Both of those are classic SCM_RIGHTS use cases, and it's not possible to
support them with direct descriptors today.
By default, this will post a CQE to the target ring, similarly to how
IORING_MSG_DATA does it. If IORING_MSG_RING_CQE_SKIP is set, no message
is posted to the target ring. The issuer is expected to notify the
receiver side separately.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Put it with the filetable code, which is where it belongs. While doing
so, have the helpers take a ctx rather than an io_kiocb. It doesn't make
sense to use a request, as it's not an operation on the request itself.
It applies to the ring itself.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring_enter() takes ctx->refs, which was previously preventing racing
with register quiesce. However, as register now doesn't touch the refs,
we can freely kill extra ctx pinning and rely on the fact that we're
holding a file reference preventing the ring from being destroyed.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a11c57ad33a1be53541fce90669c1b79cf4d8940.1656153286.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_fail_links() is called with ->completion_lock held and for that
reason we'd want to keep it as small as we can. Instead of doing
__io_req_complete_post() for each linked request under the lock, fail
them in a task_work handler under ->uring_lock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a2f68708b970a21f4e84ddfa7b3abd67a8fffb27.1656153285.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We really don't care about this at all in terms of performance. Outside
of having it already be marked unlikely(), shove it into a separate
__cold function.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The final poll_refs put in __io_arm_poll_handler() takes quite some
cycles. When we're arming from the original task context task_work won't
be run, so in this case we can assume that we won't race with task_works
and so not take the initial ownership ref.
One caveat is that after arming a poll we may race with it, so we have
to add a bunch of io_poll_get_ownership() hidden inside of
io_poll_can_finish_inline() whenever we want to complete arming inline.
For the same reason we can't just set REQ_F_DOUBLE_POLL in
__io_queue_proc() and so need to sync with the first poll entry by
taking its wq head lock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8825315d7f5e182ac1578a031e546f79b1c97d01.1655990418.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__io_arm_poll_handler() errors parsing is a horror, in case it failed it
returns 0 and the caller is expected to look at ipt.error, which already
led us to a number of problems before.
When it returns a valid mask, leave it as it's not, i.e. return 1 and
store the mask in ipt.result_mask. In case of a failure that can be
handled inline return an error code (negative value), and return 0 if
__io_arm_poll_handler() took ownership of the request and will complete
it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/018cacdaef5fe95d7dc56b32e85d752cab7607f6.1655990418.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>