Commit Graph

1032 Commits

Author SHA1 Message Date
Pavel Begunkov
6b231248e9 io_uring: consolidate overflow flushing
Consolidate __io_cqring_overflow_flush and io_cqring_overflow_kill()
into a single function as it once was, it's easier to work with it this
way.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/986b42c35e76a6be7aa0cdcda0a236a2222da3a7.1712708261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:27 -06:00
Pavel Begunkov
8d09a88ef9 io_uring: always lock __io_cqring_overflow_flush
Conditional locking is never great, in case of
__io_cqring_overflow_flush(), which is a slow path, it's not justified.
Don't handle IOPOLL separately, always grab uring_lock for overflow
flushing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/162947df299aa12693ac4b305dacedab32ec7976.1712708261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Pavel Begunkov
408024b959 io_uring: open code io_cqring_overflow_flush()
There is only one caller of io_cqring_overflow_flush(), open code it

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a1fecd56d9dba923ed8d4d159727fa939d3baa2a.1712708261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Pavel Begunkov
e45ec969d1 io_uring: remove extra SQPOLL overflow flush
c1edbf5f08 ("io_uring: flag SQPOLL busy condition to userspace")
added an extra overflowed CQE flush in the SQPOLL submission path due to
backpressure, which was later removed. Remove the flush and let
io_cqring_wait() / iopoll handle it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2a83b0724ca6ca9d16c7d79a51b77c81876b2e39.1712708261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Pavel Begunkov
a5bff51850 io_uring: unexport io_req_cqe_overflow()
There are no users of io_req_cqe_overflow() apart from io_uring.c, make
it static.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f4295eb2f9eb98d5db38c0578f57f0b86bfe0d8c.1712708261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Pavel Begunkov
8c9a6f549e io_uring: separate header for exported net bits
We're exporting some io_uring bits to networking, e.g. for implementing
a net callback for io_uring cmds, but we don't want to expose more than
needed. Add a separate header for networking.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://lore.kernel.org/r/20240409210554.1878789-1-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Pavel Begunkov
d285da7dbd io_uring/net: set MSG_ZEROCOPY for sendzc in advance
We can set MSG_ZEROCOPY at the preparation step, do it so we don't have
to care about it later in the issue callback.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c2c22aaa577624977f045979a6db2b9fb2e5648c.1712534031.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Pavel Begunkov
6b7f864bb7 io_uring/net: get rid of io_notif_complete_tw_ext
io_notif_complete_tw_ext() can be removed and combined with
io_notif_complete_tw to make it simpler without sacrificing
anything.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/025a124a5e20e2474a57e2f04f16c422eb83063c.1712534031.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Pavel Begunkov
998632921d io_uring/net: merge ubuf sendzc callbacks
Splitting io_tx_ubuf_callback_ext from io_tx_ubuf_callback is a pre
mature optimisation that doesn't give us much. Merge the functions into
one and reclaim some simplicity back.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d44d68f6f7add33a0dcf0b7fd7b73c2dc543604f.1712534031.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Ming Lei
bbbef3e9d2 io_uring: return void from io_put_kbuf_comp()
The only caller doesn't handle the return value of io_put_kbuf_comp(), so
change its return type into void.

Also follow Jens's suggestion to rename it as io_put_kbuf_drop().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240407132759.4056167-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Pavel Begunkov
c29006a245 io_uring: remove io_req_put_rsrc_locked()
io_req_put_rsrc_locked() is a weird shim function around
io_req_put_rsrc(). All calls to io_req_put_rsrc() require holding
->uring_lock, so we can just use it directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a195bc78ac3d2c6fbaea72976e982fe51e50ecdd.1712331455.git.asml.silence@gmail.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Pavel Begunkov
d9713ad3fa io_uring: remove async request cache
io_req_complete_post() was a sole user of ->locked_free_list, but
since we just gutted the function, the cache is not used anymore and
can be removed.

->locked_free_list served as an asynhronous counterpart of the main
request (i.e. struct io_kiocb) cache for all unlocked cases like io-wq.
Now they're all forced to be completed into the main cache directly,
off of the normal completion path or via io_free_req().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7bffccd213e370abd4de480e739d8b08ab6c1326.1712331455.git.asml.silence@gmail.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Pavel Begunkov
de96e9ae69 io_uring: turn implicit assumptions into a warning
io_req_complete_post() is now io-wq only and shouldn't be used outside
of it, i.e. it relies that io-wq holds a ref for the request as
explained in a comment below. Let's add a warning to enforce the
assumption and make sure nobody would try to do anything weird.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1013b60c35d431d0698cafbc53c06f5917348c20.1712331455.git.asml.silence@gmail.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Ming Lei
f39130004d io_uring: kill dead code in io_req_complete_post
Since commit 8f6c829491fe ("io_uring: remove struct io_tw_state::locked"),
io_req_complete_post() is only called from io-wq submit work, where the
request reference is guaranteed to be grabbed and won't drop to zero
in io_req_complete_post().

Kill the dead code, meantime add req_ref_put() to put the reference.

Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1d8297e2046553153e763a52574f0e0f4d512f86.1712331455.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Jens Axboe
285207f67c io_uring/kbuf: remove dead define
We no longer use IO_BUFFER_LIST_BUF_PER_PAGE, kill it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Jens Axboe
1da2f311ba io_uring: fix warnings on shadow variables
There are a few of those:

io_uring/fdinfo.c:170:16: warning: declaration shadows a local variable [-Wshadow]
  170 |                 struct file *f = io_file_from_index(&ctx->file_table, i);
      |                              ^
io_uring/fdinfo.c:53:67: note: previous declaration is here
   53 | __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f)
      |                                                                   ^
io_uring/cancel.c:187:25: warning: declaration shadows a local variable [-Wshadow]
  187 |                 struct io_uring_task *tctx = node->task->io_uring;
      |                                       ^
io_uring/cancel.c:166:31: note: previous declaration is here
  166 |                              struct io_uring_task *tctx,
      |                                                    ^
io_uring/register.c:371:25: warning: declaration shadows a local variable [-Wshadow]
  371 |                 struct io_uring_task *tctx = node->task->io_uring;
      |                                       ^
io_uring/register.c:312:24: note: previous declaration is here
  312 |         struct io_uring_task *tctx = NULL;
      |                               ^

and a simple cleanup gets rid of them. For the fdinfo case, make a
distinction between the file being passed in (for the ring), and the
registered files we iterate. For the other two cases, just get rid of
shadowed variable, there's no reason to have a new one.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Jens Axboe
f15ed8b4d0 io_uring: move mapping/allocation helpers to a separate file
Move the related code from io_uring.c into memmap.c. No functional
changes in this patch, just cleaning it up a bit now that the full
transition is done.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Jens Axboe
18595c0a58 io_uring: use unpin_user_pages() where appropriate
There are a few cases of open-rolled loops around unpin_user_page(), use
the generic helper instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Jens Axboe
87585b0575 io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ring
Rather than use remap_pfn_range() for this and manually free later,
switch to using vm_insert_page() and have it Just Work.

This requires a bit of effort on the mmap lookup side, as the ctx
uring_lock isn't held, which  otherwise protects buffer_lists from being
torn down, and it's not safe to grab from mmap context that would
introduce an ABBA deadlock between the mmap lock and the ctx uring_lock.
Instead, lookup the buffer_list under RCU, as the the list is RCU freed
already. Use the existing reference count to determine whether it's
possible to safely grab a reference to it (eg if it's not zero already),
and drop that reference when done with the mapping. If the mmap
reference is the last one, the buffer_list and the associated memory can
go away, since the vma insertion has references to the inserted pages at
that point.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Jens Axboe
e270bfd22a io_uring/kbuf: vmap pinned buffer ring
This avoids needing to care about HIGHMEM, and it makes the buffer
indexing easier as both ring provided buffer methods are now virtually
mapped in a contigious fashion.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Jens Axboe
1943f96b38 io_uring: unify io_pin_pages()
Move it into io_uring.c where it belongs, and use it in there as well
rather than have two implementations of this.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Jens Axboe
09fc75e0c0 io_uring: use vmap() for ring mapping
This is the last holdout which does odd page checking, convert it to
vmap just like what is done for the non-mmap path.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Jens Axboe
3ab1db3c60 io_uring: get rid of remap_pfn_range() for mapping rings/sqes
Rather than use remap_pfn_range() for this and manually free later,
switch to using vm_insert_pages() and have it Just Work.

If possible, allocate a single compound page that covers the range that
is needed. If that works, then we can just use page_address() on that
page. If we fail to get a compound page, allocate single pages and use
vmap() to map them into the kernel virtual address space.

This just covers the rings/sqes, the other remaining user of the mmap
remap_pfn_range() user will be converted separately. Once that is done,
we can kill the old alloc/free code.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Jens Axboe
22537c9f79 io_uring: use the right type for work_llist empty check
io_task_work_pending() uses wq_list_empty() on ctx->work_llist, but it's
not an io_wq_work_list, it's a struct llist_head. They both have
->first as head-of-list, and it turns out the checks are identical. But
be proper and use the right helper.

Fixes: dac6a0eae7 ("io_uring: ensure iopoll runs local task work as well")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Joel Granados
a80929d1cd io_uring: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Remove sentinel element from kernel_io_uring_disabled_table

Signed-off-by: Joel Granados <j.granados@samsung.com>
Link: https://lore.kernel.org/r/20240328-jag-sysctl_remset_misc-v1-6-47c1463b3af2@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:26 -06:00
Jiapeng Chong
4e9706c6c8 io_uring: Remove unused function
The function are defined in the io_uring.c file, but not called
elsewhere, so delete the unused function.

io_uring/io_uring.c:646:20: warning: unused function '__io_cq_unlock'.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8660
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240328022324.78029-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
77a1cd5e79 io_uring: re-arrange Makefile order
The object list is a bit of a mess, with core and opcode files mixed in.
Re-arrange it so that we have the core bits first, and then opcode
specific files after that.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
05eb5fe226 io_uring: refill request cache in memory order
The allocator will generally return memory in order, but
__io_alloc_req_refill() then adds them to a stack and we'll extract them
in the opposite order. This obviously isn't a huge deal, but:

1) it makes debugging easier when they are in order
2) keeping them in-order is the right thing to do
3) reduces the code for adding them to the stack

Just add them in reverse to the stack.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
da22bdf38b io_uring/poll: shrink alloc cache size to 32
This should be plenty, rather than the default of 128, and matches what
we have on the rsrc and futex side as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
414d0f45c3 io_uring/alloc_cache: switch to array based caching
Currently lists are being used to manage this, but best practice is
usually to have these in an array instead as that it cheaper to manage.

Outside of that detail, games are also played with KASAN as the list
is inside the cached entry itself.

Finally, all users of this need a struct io_cache_entry embedded in
their struct, which is union'ized with something else in there that
isn't used across the free -> realloc cycle.

Get rid of all of that, and simply have it be an array. This will not
change the memory used, as we're just trading an 8-byte member entry
for the per-elem array size.

This reduces the overhead of the recycled allocations, and it reduces
the amount of code code needed to support recycling to about half of
what it currently is.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
e10677a8f6 io_uring: drop ->prep_async()
It's now unused, drop the code related to it. This includes the
io_issue_defs->manual alloc field.

While in there, and since ->async_size is now being used a bit more
frequently and in the issue path, move it to io_issue_defs[].

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
5eff57fa9f io_uring/uring_cmd: defer SQE copying until it's needed
The previous commit turned on async data for uring_cmd, and did the
basic conversion of setting everything up on the prep side. However, for
a lot of use cases, -EIOCBQUEUED will get returned on issue, as the
operation got successfully queued. For that case, a persistent SQE isn't
needed, as it's just used for issue.

Unless execution goes async immediately, defer copying the double SQE
until it's necessary.

This greatly reduces the overhead of such commands, as evidenced by
a perf diff from before and after this change:

    10.60%     -8.58%  [kernel.vmlinux]  [k] io_uring_cmd_prep

where the prep side drops from 10.60% to ~2%, which is more expected.
Performance also rises from ~113M IOPS to ~122M IOPS, bringing us back
to where it was before the async command prep.

Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
d10f19dff5 io_uring/uring_cmd: switch to always allocating async data
Basic conversion ensuring async_data is allocated off the prep path. Adds
a basic alloc cache as well, as passthrough IO can be quite high in rate.

Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
e2ea5a7069 io_uring/net: move connect to always using async data
While doing that, get rid of io_async_connect and just use the generic
io_async_msghdr. Both of them have a struct sockaddr_storage in there,
and while io_async_msghdr is bigger, if the same type can be used then
the netmsg_cache can get reused for connect as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
d6f911a6b2 io_uring/rw: add iovec recycling
Let the io_async_rw hold on to the iovec and reuse it, rather than always
allocate and free them.

Also enables KASAN for the iovec entries, so that reuse can be detected
even while they are in the cache.

While doing so, shrink io_async_rw by getting rid of the bigger embedded
fast iovec. Since iovecs are being recycled now, shrink it from 8 to 1.
This reduces the io_async_rw size from 264 to 160 bytes, a 40% reduction.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
cca6571381 io_uring/rw: cleanup retry path
We no longer need to gate a potential retry on whether or not the
context matches our original task, as all read/write operations have
been fully prepared upfront. This means there's never any re-import
needed, and hence we can always retry requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
0d10bd77a1 io_uring: get rid of struct io_rw_state
A separate state struct is not needed anymore, just fold it in with
io_async_rw.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
a9165b83c1 io_uring/rw: always setup io_async_rw for read/write requests
read/write requests try to put everything on the stack, and then alloc
and copy if a retry is needed. This necessitates a bunch of nasty code
that deals with intermediate state.

Get rid of this, and have the prep side setup everything that is needed
upfront, which greatly simplifies the opcode handlers.

This includes adding an alloc cache for io_async_rw, to make it cheap
to handle.

In terms of cost, this should be basically free and transparent. For
the worst case of {READ,WRITE}_FIXED which didn't need it before,
performance is unaffected in the normal peak workload that is being
used to test that. Still runs at 122M IOPS.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
d80f940701 io_uring/net: drop 'kmsg' parameter from io_req_msg_cleanup()
Now that iovec recycling is being done, the iovec is no longer being
freed in there. Hence the kmsg parameter is now useless.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
7519134178 io_uring/net: add iovec recycling
Right now the io_async_msghdr is recycled to avoid the overhead of
allocating+freeing it for every request. But the iovec is not included,
hence that will be allocated and freed for each transfer regardless.
This commit enables recyling of the iovec between io_async_msghdr
recycles. This avoids alloc+free for each one if an iovec is used, and
on top of that, it extends the cache hot nature of msg to the iovec as
well.

Also enables KASAN for the iovec entries, so that reuse can be detected
even while they are in the cache.

The io_async_msghdr also shrinks from 376 -> 288 bytes, an 88 byte
saving (or ~23% smaller), as the fast_iovec entry is dropped from 8
entries to a single entry. There's no point keeping a big fast iovec
entry, if iovecs aren't being allocated and freed continually.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
9f8539fe29 io_uring/net: remove (now) dead code in io_netmsg_recycle()
All net commands have async data at this point, there's no reason to
check if this is the case or not.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
6498c5c97c io_uring: kill io_msg_alloc_async_prep()
We now ONLY call io_msg_alloc_async() from inside prep handling, which
is always locked. No need for this helper anymore, or the check in
io_msg_alloc_async() on whether the ring is locked or not.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
50220d6ac8 io_uring/net: get rid of ->prep_async() for send side
Move the io_async_msghdr out of the issue path and into prep handling,
e it's now done unconditionally and hence does not need to be part
of the issue path. This means any usage of io_sendrecv_prep_async() and
io_sendmsg_prep_async(), and hence the forced async setup path is now
unified with the normal prep setup.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
c6f32c7d9e io_uring/net: get rid of ->prep_async() for receive side
Move the io_async_msghdr out of the issue path and into prep handling,
since it's now done unconditionally and hence does not need to be part
of the issue path. This reduces the footprint of the multishot fast
path of multiple invocations of ->issue() per prep, and also means that
using ->prep_async() can be dropped for recvmsg asthis is now done via
setup on the prep side.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
3ba8345aec io_uring/net: always set kmsg->msg.msg_control_user before issue
We currently set this separately for async/sync entry, but let's just
move it to a generic pre-issue spot and eliminate the difference
between the two.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
790b68b32a io_uring/net: always setup an io_async_msghdr
Rather than use an on-stack one and then need to allocate and copy if
async execution is required, always grab one upfront. This should be
very cheap, and potentially even have cache hotness benefits for
back-to-back send/recv requests.

For any recv type of request, this is probably a good choice in general,
as it's expected that no data is available initially. For send this is
not necessarily the case, as space in the socket buffer is expected to
be available. However, getting a cached io_async_msghdr is very cheap,
and as it should be cache hot, probably the difference here is neglible,
if any.

A nice side benefit is that io_setup_async_msg can get killed
completely, which has some nasty iovec manipulation code.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
f5b00ab222 io_uring/net: unify cleanup handling
Now that recv/recvmsg both do the same cleanup, put it in the retry and
finish handlers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
4a3223f7bf io_uring/net: switch io_recv() to using io_async_msghdr
No functional changes in this patch, just in preparation for carrying
more state than what is available now, if necessary.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
54cdcca05a io_uring/net: switch io_send() and io_send_zc() to using io_async_msghdr
No functional changes in this patch, just in preparation for carrying
more state then what is being done now, if necessary. While unifying
some of this code, add a generic send setup prep handler that they can
both use.

This gets rid of some manual msghdr and sockaddr on the stack, and makes
it look a bit more like the sendmsg/recvmsg variants. Going forward, more
can get unified on top.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
0ae9b9a14d io_uring/alloc_cache: shrink default max entries from 512 to 128
In practice, we just need to recycle a few elements for (by far) most
use cases. Shrink the total size down from 512 to 128, which should be
more than plenty.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
29f858a7c6 io_uring: remove timeout/poll specific cancelations
For historical reasons these were special cased, as they were the only
ones that needed cancelation. But now we handle cancelations generally,
and hence there's no need to check for these in
io_ring_ctx_wait_and_kill() when io_uring_try_cancel_requests() handles
both these and the rest as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
2541762342 io_uring: flush delayed fallback task_work in cancelation
Just like we run the inline task_work, ensure we also factor in and
run the fallback task_work.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Pavel Begunkov
c133b3b06b io_uring: clean up io_lockdep_assert_cq_locked
Move CONFIG_PROVE_LOCKING checks inside of io_lockdep_assert_cq_locked()
and kill the else branch.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/bbf33c429c9f6d7207a8fe66d1a5866ec2c99850.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Pavel Begunkov
0667db14e1 io_uring: refactor io_req_complete_post()
Make io_req_complete_post() to push all IORING_SETUP_IOPOLL requests
to task_work, it's much cleaner and should normally happen. We couldn't
do it before because there was a possibility of looping in

complete_post() -> tw -> complete_post() -> ...

Also, unexport the function and inline __io_req_complete_post().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/ea19c032ace3e0dd96ac4d991a063b0188037014.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Pavel Begunkov
23fbdde620 io_uring: remove current check from complete_post
task_work execution is now always locked, and we shouldn't get into
io_req_complete_post() from them. That means that complete_post() is
always called out of the original task context and we don't even need to
check current.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/24ec27f27db0d8f58c974d8118dca1d345314ddc.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Pavel Begunkov
902ce82c2a io_uring: get rid of intermediate aux cqe caches
io_post_aux_cqe(), which is used for multishot requests, delays
completions by putting CQEs into a temporary array for the purpose
completion lock/flush batching.

DEFER_TASKRUN doesn't need any locking, so for it we can put completions
directly into the CQ and defer post completion handling with a flag.
That leaves !DEFER_TASKRUN, which is not that interesting / hot for
multishot requests, so have conditional locking with deferred flush
for them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/b1d05a81fd27aaa2a07f9860af13059e7ad7a890.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Pavel Begunkov
e5c12945be io_uring: refactor io_fill_cqe_req_aux
The restriction on multishot execution context disallowing io-wq is
driven by rules of io_fill_cqe_req_aux(), it should only be called in
the master task context, either from the syscall path or in task_work.
Since task_work now always takes the ctx lock implying
IO_URING_F_COMPLETE_DEFER, we can just assume that the function is
always called with its defer argument set to true.

Kill the argument. Also rename the function for more consistency as
"fill" in CQE related functions was usually meant for raw interfaces
only copying data into the CQ without any locking, waking the user
and other accounting "post" functions take care of.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/93423d106c33116c7d06bf277f651aa68b427328.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Pavel Begunkov
8e5b3b89ec io_uring: remove struct io_tw_state::locked
ctx is always locked for task_work now, so get rid of struct
io_tw_state::locked. Note I'm stopping one step before removing
io_tw_state altogether, which is not empty, because it still serves the
purpose of indicating which function is a tw callback and forcing users
not to invoke them carelessly out of a wrong context. The removal can
always be done later.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/e95e1ea116d0bfa54b656076e6a977bc221392a4.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Pavel Begunkov
92219afb98 io_uring: force tw ctx locking
We can run normal task_work without locking the ctx, however we try to
lock anyway and most handlers prefer or require it locked. It might have
been interesting to multi-submitter ring with high contention completing
async read/write requests via task_work, however that will still need to
go through io_req_complete_post() and potentially take the lock for
rsrc node putting or some other case.

In other words, it's hard to care about it, so alawys force the locking.
The case described would also because of various io_uring caches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/6ae858f2ef562e6ed9f13c60978c0d48926954ba.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Pavel Begunkov
6e6b8c6212 io_uring/rw: avoid punting to io-wq directly
kiocb_done() should care to specifically redirecting requests to io-wq.
Remove the hopping to tw to then queue an io-wq, return -EAGAIN and let
the core code io_uring handle offloading.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/413564e550fe23744a970e1783dfa566291b0e6f.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Pavel Begunkov
e1eef2e56c io_uring/cmd: fix tw <-> issue_flags conversion
!IO_URING_F_UNLOCKED does not translate to availability of the deferred
completion infra, IO_URING_F_COMPLETE_DEFER does, that what we should
pass and look for to use io_req_complete_defer() and other variants.

Luckily, it's not a real problem as two wrongs actually made it right,
at least as far as io_uring_cmd_work() goes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/aef76d34fe9410df8ecc42a14544fd76cd9d8b9e.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Pavel Begunkov
6edd953b6e io_uring/cmd: kill one issue_flags to tw conversion
io_uring cmd converts struct io_tw_state to issue_flags and later back
to io_tw_state, it's awfully ill-fated, not to mention that intermediate
issue_flags state is not correct.

Get rid of the last conversion, drag through tw everything that came
with IO_URING_F_UNLOCKED, and replace io_req_complete_defer() with a
direct call to io_req_complete_defer(), at least for the time being.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/c53fa3df749752bd058cf6f824a90704822d6bcc.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Pavel Begunkov
da12d9ab58 io_uring/cmd: move io_uring_try_cancel_uring_cmd()
io_uring_try_cancel_uring_cmd() is a part of the cmd handling so let's
move it closer to all cmd bits into uring_cmd.c

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/43a3937af4933655f0fd9362c381802f804f43de.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Pavel Begunkov
4fe82aedeb io_uring/net: restore msg_control on sendzc retry
cac9e4418f ("io_uring/net: save msghdr->msg_control for retries")
reinstatiates msg_control before every __sys_sendmsg_sock(), since the
function can overwrite the value in msghdr. We need to do same for
zerocopy sendmsg.

Cc: stable@vger.kernel.org
Fixes: 493108d95f ("io_uring/net: zerocopy sendmsg")
Link: https://github.com/axboe/liburing/issues/1067
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cc1d5d9df0576fa66ddad4420d240a98a020b267.1712596179.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-08 21:48:41 -06:00
Christian Brauner
210a03c9d5
fs: claw back a few FMODE_* bits
There's a bunch of flags that are purely based on what the file
operations support while also never being conditionally set or unset.
IOW, they're not subject to change for individual files. Imho, such
flags don't need to live in f_mode they might as well live in the fops
structs itself. And the fops struct already has that lonely
mmap_supported_flags member. We might as well turn that into a generic
fop_flags member and move a few flags from FMODE_* space into FOP_*
space. That gets us four FMODE_* bits back and the ability for new
static flags that are about file ops to not have to live in FMODE_*
space but in their own FOP_* space. It's not the most beautiful thing
ever but it gets the job done. Yes, there'll be an additional pointer
chase but hopefully that won't matter for these flags.

I suspect there's a few more we can move into there and that we can also
redirect a bunch of new flag suggestions that follow this pattern into
the fop_flags field instead of f_mode.

Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-07 13:49:02 +02:00
Alexey Izbyshev
978e5c19df io_uring: Fix io_cqring_wait() not restoring sigmask on get_timespec64() failure
This bug was introduced in commit 950e79dd73 ("io_uring: minor
io_cqring_wait() optimization"), which was made in preparation for
adc8682ec6 ("io_uring: Add support for napi_busy_poll"). The latter
got reverted in cb31821673 ("Revert "io_uring: Add support for
napi_busy_poll""), so simply undo the former as well.

Cc: stable@vger.kernel.org
Fixes: 950e79dd73 ("io_uring: minor io_cqring_wait() optimization")
Signed-off-by: Alexey Izbyshev <izbyshev@ispras.ru>
Link: https://lore.kernel.org/r/20240405125551.237142-1-izbyshev@ispras.ru
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-05 20:05:41 -06:00
Jens Axboe
561e4f9451 io_uring/kbuf: hold io_buffer_list reference over mmap
If we look up the kbuf, ensure that it doesn't get unregistered until
after we're done with it. Since we're inside mmap, we cannot safely use
the io_uring lock. Rely on the fact that we can lookup the buffer list
under RCU now and grab a reference to it, preventing it from being
unregistered until we're done with it. The lookup returns the
io_buffer_list directly with it referenced.

Cc: stable@vger.kernel.org # v6.4+
Fixes: 5cf4f52e6d ("io_uring: free io_buffer_list entries via RCU")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02 19:03:27 -06:00
Jens Axboe
6b69c4ab4f io_uring/kbuf: protect io_buffer_list teardown with a reference
No functional changes in this patch, just in preparation for being able
to keep the buffer list alive outside of the ctx->uring_lock.

Cc: stable@vger.kernel.org # v6.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02 19:03:26 -06:00
Jens Axboe
3b80cff5a4 io_uring/kbuf: get rid of bl->is_ready
Now that xarray is being exclusively used for the buffer_list lookup,
this check is no longer needed. Get rid of it and the is_ready member.

Cc: stable@vger.kernel.org # v6.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02 19:03:24 -06:00
Jens Axboe
09ab7eff38 io_uring/kbuf: get rid of lower BGID lists
Just rely on the xarray for any kind of bgid. This simplifies things, and
it really doesn't bring us much, if anything.

Cc: stable@vger.kernel.org # v6.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02 19:03:13 -06:00
Jens Axboe
73eaa2b583 io_uring: use private workqueue for exit work
Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: https://github.com/axboe/liburing/issues/1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02 07:35:16 -06:00
Jens Axboe
bee1d5becd io_uring: disable io-wq execution of multishot NOWAIT requests
Do the same check for direct io-wq execution for multishot requests that
commit 2a975d426c did for the inline execution, and disable multishot
mode (and revert to single shot) if the file type doesn't support NOWAIT,
and isn't opened in O_NONBLOCK mode. For multishot to work properly, it's
a requirement that nonblocking read attempts can be done.

Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-01 11:46:22 -06:00
Jens Axboe
2a975d426c io_uring/rw: don't allow multishot reads without NOWAIT support
Supporting multishot reads requires support for NOWAIT, as the
alternative would be always having io-wq execute the work item whenever
the poll readiness triggered. Any fast file type will have NOWAIT
support (eg it understands both O_NONBLOCK and IOCB_NOWAIT). If the
given file type does not, then simply resort to single shot execution.

Cc: stable@vger.kernel.org
Fixes: fc68fcda04 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-01 11:41:50 -06:00
Jens Axboe
1251d2025c io_uring/sqpoll: early exit thread if task_context wasn't allocated
Ideally we'd want to simply kill the task rather than wake it, but for
now let's just add a startup check that causes the thread to exit.
This can only happen if io_uring_alloc_task_context() fails, which
generally requires fault injection.

Reported-by: Ubisectech Sirius <bugreport@ubisectech.com>
Fixes: af5d68f889 ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-18 20:22:42 -06:00
Jens Axboe
e21e1c45e1 io_uring: clear opcode specific data for an early failure
If failure happens before the opcode prep handler is called, ensure that
we clear the opcode specific area of the request, which holds data
specific to that request type. This prevents errors where opcode
handlers either don't get to clear per-request private data since prep
isn't even called.

Reported-and-tested-by: syzbot+f8e9a371388aa62ecab4@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-16 11:24:50 -06:00
Jens Axboe
f3a640cca9 io_uring/net: ensure async prep handlers always initialize ->done_io
If we get a request with IOSQE_ASYNC set, then we first run the prep
async handlers. But if we then fail setting it up and want to post
a CQE with -EINVAL, we use ->done_io. This was previously guarded with
REQ_F_PARTIAL_IO, and the normal setup handlers do set it up before any
potential errors, but we need to cover the async setup too.

Fixes: 9817ad8589 ("io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-16 10:33:19 -06:00
Jens Axboe
2b35b8b43e io_uring/waitid: always remove waitid entry for cancel all
We know the request is either being removed, or already in the process of
being removed through task_work, so we can delete it from our waitid list
upfront. This is important for remove all conditions, as we otherwise
will find it multiple times and prevent cancelation progress.

Remove the dead check in cancelation as well for the hash_node being
empty or not. We already have a waitid reference check for ownership,
so we don't need to check the list too.

Cc: stable@vger.kernel.org
Fixes: f31ecf671d ("io_uring: add IORING_OP_WAITID support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-15 15:42:49 -06:00
Jens Axboe
30dab608c3 io_uring/futex: always remove futex entry for cancel all
We know the request is either being removed, or already in the process of
being removed through task_work, so we can delete it from our futex list
upfront. This is important for remove all conditions, as we otherwise
will find it multiple times and prevent cancelation progress.

Cc: stable@vger.kernel.org
Fixes: 194bb58c60 ("io_uring: add support for futex wake and wait")
Fixes: 8f350194d5 ("io_uring: add support for vectored futex waits")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-15 15:37:15 -06:00
Pavel Begunkov
5e3afe580a io_uring: fix poll_remove stalled req completion
Taking the ctx lock is not enough to use the deferred request completion
infrastructure, it'll get queued into the list but no one would expect
it there, so it will sit there until next io_submit_flush_completions().
It's hard to care about the cancellation path, so complete it via tw.

Fixes: ef7dfac51d ("io_uring/poll: serialize poll linked timer start with poll removal")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c446740bc16858f8a2a8dcdce899812f21d15f23.1710514702.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-15 09:36:56 -06:00
Gabriel Krisman Bertazi
67d1189d10 io_uring: Fix release of pinned pages when __io_uaddr_map fails
Looking at the error path of __io_uaddr_map, if we fail after pinning
the pages for any reasons, ret will be set to -EINVAL and the error
handler won't properly release the pinned pages.

I didn't manage to trigger it without forcing a failure, but it can
happen in real life when memory is heavily fragmented.

Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Fixes: 223ef47431 ("io_uring: don't allow IORING_SETUP_NO_MMAP rings on highmem pages")
Link: https://lore.kernel.org/r/20240313213912.1920-1-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13 16:08:25 -06:00
Pavel Begunkov
9219e4a9d4 io_uring/kbuf: rename is_mapped
In buffer lists we have ->is_mapped as well as ->is_mmap, it's
pretty hard to stay sane double checking which one means what,
and in the long run there is a high chance of an eventual bug.
Rename ->is_mapped into ->is_buf_ring.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c4838f4d8ad506ad6373f1c305aee2d2c1a89786.1710343154.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13 14:50:42 -06:00
Pavel Begunkov
2c5c0ba117 io_uring: simplify io_pages_free
We never pass a null (top-level) pointer, remove the check.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0e1a46f9a5cd38e6876905e8030bdff9b0845e96.1710343154.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13 14:50:42 -06:00
Pavel Begunkov
cef59d1ea7 io_uring: clean rings on NO_MMAP alloc fail
We make a few cancellation judgements based on ctx->rings, so let's
zero it afer deallocation for IORING_SETUP_NO_MMAP just like it's
done with the mmap case. Likely, it's not a real problem, but zeroing
is safer and better tested.

Cc: stable@vger.kernel.org
Fixes: 03d89a2de2 ("io_uring: support for user allocated memory for rings/sqes")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9ff6cdf91429b8a51699c210e1f6af6ea3f8bdcf.1710255382.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-12 09:21:36 -06:00
Jens Axboe
0a3737db84 io_uring/rw: return IOU_ISSUE_SKIP_COMPLETE for multishot retry
If read multishot is being invoked from the poll retry handler, then we
should return IOU_ISSUE_SKIP_COMPLETE rather than -EAGAIN. If not, then
a CQE will be posted with -EAGAIN rather than triggering the retry when
the file is flagged as readable again.

Cc: stable@vger.kernel.org
Reported-by: Sargun Dhillon <sargun@meta.com>
Fixes: fc68fcda04 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-12 08:29:47 -06:00
Jens Axboe
6f0974eccb io_uring: don't save/restore iowait state
This kind of state is per-syscall, and since we're doing the waiting off
entering the io_uring_enter(2) syscall, there's no way that iowait can
already be set for this case. Simplify it by setting it if we need to,
and always clearing it to 0 when done.

Fixes: 7b72d661f1 ("io_uring: gate iowait schedule on having pending requests")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-11 15:02:59 -06:00
Linus Torvalds
d2c84bdce2 for-6.9/io_uring-20240310
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmXuD/AQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsojEACNlJKqsebZv24szCR5ViBGqoDi/A5v5vZv
 1p7f0sVgpwFLuDu3CCb9IG1tuAiuhBa5yvBKKpyGuGglQd+7Sxqsgdc2Bv/76D7S
 Ej/fc1x5dxuvAvAetYk4yH2idPhYIBVIx3g2oz44bO4Ur3jFZ/yXzp+JtuKEuTba
 7kQmAXfN7c497XDsmSv1eJM/+D/LKjmvjqMX2gnXprw2qPgdAklXcUSnBYaS2JEt
 o4HGWAImJOV416d7QkOWgKfk6ksJbO3lFzQ6R+JdQCl6KVqc0+5u0oT06ZGVpSUf
 fQqfcV+cJw41dQB47Qr017ku0EdDI19L3YpL9/WOnNMBM421j1QER1cKiKfiHD2B
 LCOn+tvunxcGMzYonAFfgSF4XXFJWSK33TpvmmVsU3w0+YSC9oIqFfCxOdHuAJqB
 tHSuGHgzkufgqhNIQWHiWZEJJUW+MO4Dv2rUV6n+dfCz6JQG48Gs9clDv/tAEY4U
 4NzErfYLCsWlNaMPQK1f/b9dWjBXAnpJA4yq8jPyYB3GqjnVuX3Ze14UfwOWgv0B
 E++qgPsh30ShbP/NRHqS9tNQC2hIy27x/jzpTyKwxuoSs/nyeZg7lFXIPaQQo7wt
 GZhGzsMasbhoylqblB171NFlxpRetY9aYvHZ3OfUP4xAt1THVOzR6hZrBurOKMv/
 e8FBGBh/cg==
 =Hy//
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Make running of task_work internal loops more fair, and unify how the
   different methods deal with them (me)

 - Support for per-ring NAPI. The two minor networking patches are in a
   shared branch with netdev (Stefan)

 - Add support for truncate (Tony)

 - Export SQPOLL utilization stats (Xiaobing)

 - Multishot fixes (Pavel)

 - Fix for a race in manipulating the request flags via poll (Pavel)

 - Cleanup the multishot checking by making it generic, moving it out of
   opcode handlers (Pavel)

 - Various tweaks and cleanups (me, Kunwu, Alexander)

* tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux: (53 commits)
  io_uring: Fix sqpoll utilization check racing with dying sqpoll
  io_uring/net: dedup io_recv_finish req completion
  io_uring: refactor DEFER_TASKRUN multishot checks
  io_uring: fix mshot io-wq checks
  io_uring/net: add io_req_msg_cleanup() helper
  io_uring/net: simplify msghd->msg_inq checking
  io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE
  io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io
  io_uring/net: correctly handle multishot recvmsg retry setup
  io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler
  io_uring: fix io_queue_proc modifying req->flags
  io_uring: fix mshot read defer taskrun cqe posting
  io_uring/net: fix overflow check in io_recvmsg_mshot_prep()
  io_uring/net: correct the type of variable
  io_uring/sqpoll: statistics of the true utilization of sq threads
  io_uring/net: move recv/recvmsg flags out of retry loop
  io_uring/kbuf: flag request if buffer pool is empty after buffer pick
  io_uring/net: improve the usercopy for sendmsg/recvmsg
  io_uring/net: move receive multishot out of the generic msghdr path
  io_uring/net: unify how recvmsg and sendmsg copy in the msghdr
  ...
2024-03-11 11:35:31 -07:00
Gabriel Krisman Bertazi
606559dc4f io_uring: Fix sqpoll utilization check racing with dying sqpoll
Commit 3fcb9d1720 ("io_uring/sqpoll: statistics of the true
utilization of sq threads"), currently in Jens for-next branch, peeks at
io_sq_data->thread to report utilization statistics. But, If
io_uring_show_fdinfo races with sqpoll terminating, even though we hold
the ctx lock, sqd->thread might be NULL and we hit the Oops below.

Note that we could technically just protect the getrusage() call and the
sq total/work time calculations.  But showing some sq
information (pid/cpu) and not other information (utilization) is more
confusing than not reporting anything, IMO.  So let's hide it all if we
happen to race with a dying sqpoll.

This can be triggered consistently in my vm setup running
sqpoll-cancel-hang.t in a loop.

BUG: kernel NULL pointer dereference, address: 00000000000007b0
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 16587 Comm: systemd-coredum Not tainted 6.8.0-rc3-g3fcb9d17206e-dirty #69
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
RIP: 0010:getrusage+0x21/0x3e0
Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89
RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282
RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60
RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000
R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000
R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000
FS:  00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __die_body+0x1a/0x60
 ? page_fault_oops+0x154/0x440
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? do_user_addr_fault+0x174/0x7c0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? exc_page_fault+0x63/0x140
 ? asm_exc_page_fault+0x22/0x30
 ? getrusage+0x21/0x3e0
 ? seq_printf+0x4e/0x70
 io_uring_show_fdinfo+0x9db/0xa10
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? vsnprintf+0x101/0x4d0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? seq_vprintf+0x34/0x50
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? seq_printf+0x4e/0x70
 ? seq_show+0x16b/0x1d0
 ? __pfx_io_uring_show_fdinfo+0x10/0x10
 seq_show+0x16b/0x1d0
 seq_read_iter+0xd7/0x440
 seq_read+0x102/0x140
 vfs_read+0xae/0x320
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __do_sys_newfstat+0x35/0x60
 ksys_read+0xa5/0xe0
 do_syscall_64+0x50/0x110
 entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7f786ec1db4d
Code: e8 46 e3 01 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 80 3d d9 ce 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
RSP: 002b:00007ffcb361a4b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 000055a4c8fe42f0 RCX: 00007f786ec1db4d
RDX: 0000000000000400 RSI: 000055a4c8fe48a0 RDI: 0000000000000006
RBP: 00007f786ecfb0b0 R08: 00007f786ecfb2a8 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f786ecfaf60
R13: 000055a4c8fe42f0 R14: 0000000000000000 R15: 00007ffcb361a628
 </TASK>
Modules linked in:
CR2: 00000000000007b0
---[ end trace 0000000000000000 ]---
RIP: 0010:getrusage+0x21/0x3e0
Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89
RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282
RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60
RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000
R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000
R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000
FS:  00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0
PKRU: 55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x1ce00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Fixes: 3fcb9d1720 ("io_uring/sqpoll: statistics of the true utilization of sq threads")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20240309003256.358-1-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-09 07:27:09 -07:00
Pavel Begunkov
1af04699c5 io_uring/net: dedup io_recv_finish req completion
There are two block in io_recv_finish() completing the request, which we
can combine and remove jumping.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0e338dcb33c88de83809fda021cba9e7c9681620.1709905727.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:59:20 -07:00
Pavel Begunkov
e0e4ab52d1 io_uring: refactor DEFER_TASKRUN multishot checks
We disallow DEFER_TASKRUN multishots from running by io-wq, which is
checked by individual opcodes in the issue path. We can consolidate all
it in io_wq_submit_work() at the same time moving the checks out of the
hot path.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e492f0f11588bb5aa11d7d24e6f53b7c7628afdb.1709905727.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:58:23 -07:00
Pavel Begunkov
3a96378e22 io_uring: fix mshot io-wq checks
When checking for concurrent CQE posting, we're not only interested in
requests running from the poll handler but also strayed requests ended
up in normal io-wq execution. We're disallowing multishots in general
from io-wq, not only when they came in a certain way.

Cc: stable@vger.kernel.org
Fixes: 17add5cea2 ("io_uring: force multishot CQEs into task context")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d8c5b36a39258036f93301cd60d3cd295e40653d.1709905727.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:58:23 -07:00
Jens Axboe
d9b441889c io_uring/net: add io_req_msg_cleanup() helper
For the fast inline path, we manually recycle the io_async_msghdr and
free the iovec, and then clear the REQ_F_NEED_CLEANUP flag to avoid
that needing doing in the slower path. We already do that in 2 spots, and
in preparation for adding more, add a helper and use it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:57:27 -07:00
Jens Axboe
fb6328bc2a io_uring/net: simplify msghd->msg_inq checking
Just check for larger than zero rather than check for non-zero and
not -1. This is easier to read, and also protects against any errants
< 0 values that aren't -1.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:56:31 -07:00
Jens Axboe
186daf2385 io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE
We only use the flag for this purpose, so rename it accordingly. This
further prevents various other use cases of it, keeping it clean and
consistent. Then we can also check it in one spot, when it's being
attempted recycled, and remove some dead code in io_kbuf_recycle_ring().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:56:27 -07:00
Jens Axboe
9817ad8589 io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io
Ensure that prep handlers always initialize sr->done_io before any
potential failure conditions, and with that, we now it's always been
set even for the failure case.

With that, we don't need to use the REQ_F_PARTIAL_IO flag to gate on that.
Additionally, we should not overwrite req->cqe.res unless sr->done_io is
actually positive.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:56:21 -07:00
Jens Axboe
deaef31bc1 io_uring/net: correctly handle multishot recvmsg retry setup
If we loop for multishot receive on the initial attempt, and then abort
later on to wait for more, we miss a case where we should be copying the
io_async_msghdr from the stack to stable storage. This leads to the next
retry potentially failing, if the application had the msghdr on the
stack.

Cc: stable@vger.kernel.org
Fixes: 9bb66906f2 ("io_uring: support multishot in recvmsg")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07 17:48:03 -07:00
Jens Axboe
b5311dbc2c io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler
This flag should not be persistent across retries, so ensure we clear
it before potentially attemting a retry.

Fixes: c3f9109dbc ("io_uring/kbuf: flag request if buffer pool is empty after buffer pick")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07 13:22:05 -07:00
Pavel Begunkov
1a8ec63b2b io_uring: fix io_queue_proc modifying req->flags
With multiple poll entries __io_queue_proc() might be running in
parallel with poll handlers and possibly task_work, we should not be
carelessly modifying req->flags there. io_poll_double_prepare() handles
a similar case with locking but it's much easier to move it into
__io_arm_poll_handler().

Cc: stable@vger.kernel.org
Fixes: 595e52284d ("io_uring/poll: don't enable lazy wake for POLLEXCLUSIVE")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/455cc49e38cf32026fa1b49670be8c162c2cb583.1709834755.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07 11:10:28 -07:00
Pavel Begunkov
70581dcd06 io_uring: fix mshot read defer taskrun cqe posting
We can't post CQEs from io-wq with DEFER_TASKRUN set, normal completions
are handled but aux should be explicitly disallowed by opcode handlers.

Cc: stable@vger.kernel.org
Fixes: fc68fcda04 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6fb7cba6f5366da25f4d3eb95273f062309d97fa.1709740837.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07 06:30:33 -07:00
Dan Carpenter
8ede3db506 io_uring/net: fix overflow check in io_recvmsg_mshot_prep()
The "controllen" variable is type size_t (unsigned long).  Casting it
to int could lead to an integer underflow.

The check_add_overflow() function considers the type of the destination
which is type int.  If we add two positive values and the result cannot
fit in an integer then that's counted as an overflow.

However, if we cast "controllen" to an int and it turns negative, then
negative values *can* fit into an int type so there is no overflow.

Good: 100 + (unsigned long)-4 = 96  <-- overflow
 Bad: 100 + (int)-4 = 96 <-- no overflow

I deleted the cast of the sizeof() as well.  That's not a bug but the
cast is unnecessary.

Fixes: 9b0fc3c054 ("io_uring: fix types in io_recvmsg_multishot_overflow")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/138bd2e2-ede8-4bcc-aa7b-f3d9de167a37@moroto.mountain
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-04 16:33:15 -07:00
Muhammad Usama Anjum
86bcacc957 io_uring/net: correct the type of variable
The namelen is of type int. It shouldn't be made size_t which is
unsigned. The signed number is needed for error checking before use.

Fixes: c55978024d ("io_uring/net: move receive multishot out of the generic msghdr path")
Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Link: https://lore.kernel.org/r/20240301144349.2807544-1-usama.anjum@collabora.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-04 16:33:11 -07:00