linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-22 12:11:40 +00:00

Author	SHA1	Message	Date
Pavel Begunkov	6b231248e9	io_uring: consolidate overflow flushing Consolidate __io_cqring_overflow_flush and io_cqring_overflow_kill() into a single function as it once was, it's easier to work with it this way. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/986b42c35e76a6be7aa0cdcda0a236a2222da3a7.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:27 -06:00
Pavel Begunkov	8d09a88ef9	io_uring: always lock __io_cqring_overflow_flush Conditional locking is never great, in case of __io_cqring_overflow_flush(), which is a slow path, it's not justified. Don't handle IOPOLL separately, always grab uring_lock for overflow flushing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/162947df299aa12693ac4b305dacedab32ec7976.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	408024b959	io_uring: open code io_cqring_overflow_flush() There is only one caller of io_cqring_overflow_flush(), open code it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a1fecd56d9dba923ed8d4d159727fa939d3baa2a.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	e45ec969d1	io_uring: remove extra SQPOLL overflow flush `c1edbf5f08` ("io_uring: flag SQPOLL busy condition to userspace") added an extra overflowed CQE flush in the SQPOLL submission path due to backpressure, which was later removed. Remove the flush and let io_cqring_wait() / iopoll handle it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2a83b0724ca6ca9d16c7d79a51b77c81876b2e39.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	a5bff51850	io_uring: unexport io_req_cqe_overflow() There are no users of io_req_cqe_overflow() apart from io_uring.c, make it static. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f4295eb2f9eb98d5db38c0578f57f0b86bfe0d8c.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	8c9a6f549e	io_uring: separate header for exported net bits We're exporting some io_uring bits to networking, e.g. for implementing a net callback for io_uring cmds, but we don't want to expose more than needed. Add a separate header for networking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/20240409210554.1878789-1-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	d285da7dbd	io_uring/net: set MSG_ZEROCOPY for sendzc in advance We can set MSG_ZEROCOPY at the preparation step, do it so we don't have to care about it later in the issue callback. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c2c22aaa577624977f045979a6db2b9fb2e5648c.1712534031.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	6b7f864bb7	io_uring/net: get rid of io_notif_complete_tw_ext io_notif_complete_tw_ext() can be removed and combined with io_notif_complete_tw to make it simpler without sacrificing anything. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/025a124a5e20e2474a57e2f04f16c422eb83063c.1712534031.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	998632921d	io_uring/net: merge ubuf sendzc callbacks Splitting io_tx_ubuf_callback_ext from io_tx_ubuf_callback is a pre mature optimisation that doesn't give us much. Merge the functions into one and reclaim some simplicity back. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d44d68f6f7add33a0dcf0b7fd7b73c2dc543604f.1712534031.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Ming Lei	bbbef3e9d2	io_uring: return void from io_put_kbuf_comp() The only caller doesn't handle the return value of io_put_kbuf_comp(), so change its return type into void. Also follow Jens's suggestion to rename it as io_put_kbuf_drop(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240407132759.4056167-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	c29006a245	io_uring: remove io_req_put_rsrc_locked() io_req_put_rsrc_locked() is a weird shim function around io_req_put_rsrc(). All calls to io_req_put_rsrc() require holding ->uring_lock, so we can just use it directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a195bc78ac3d2c6fbaea72976e982fe51e50ecdd.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	d9713ad3fa	io_uring: remove async request cache io_req_complete_post() was a sole user of ->locked_free_list, but since we just gutted the function, the cache is not used anymore and can be removed. ->locked_free_list served as an asynhronous counterpart of the main request (i.e. struct io_kiocb) cache for all unlocked cases like io-wq. Now they're all forced to be completed into the main cache directly, off of the normal completion path or via io_free_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7bffccd213e370abd4de480e739d8b08ab6c1326.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Pavel Begunkov	de96e9ae69	io_uring: turn implicit assumptions into a warning io_req_complete_post() is now io-wq only and shouldn't be used outside of it, i.e. it relies that io-wq holds a ref for the request as explained in a comment below. Let's add a warning to enforce the assumption and make sure nobody would try to do anything weird. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1013b60c35d431d0698cafbc53c06f5917348c20.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Ming Lei	f39130004d	io_uring: kill dead code in io_req_complete_post Since commit 8f6c829491fe ("io_uring: remove struct io_tw_state::locked"), io_req_complete_post() is only called from io-wq submit work, where the request reference is guaranteed to be grabbed and won't drop to zero in io_req_complete_post(). Kill the dead code, meantime add req_ref_put() to put the reference. Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1d8297e2046553153e763a52574f0e0f4d512f86.1712331455.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	285207f67c	io_uring/kbuf: remove dead define We no longer use IO_BUFFER_LIST_BUF_PER_PAGE, kill it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	1da2f311ba	io_uring: fix warnings on shadow variables There are a few of those: io_uring/fdinfo.c:170:16: warning: declaration shadows a local variable [-Wshadow] 170 \| struct file f = io_file_from_index(&ctx->file_table, i); \| ^ io_uring/fdinfo.c:53:67: note: previous declaration is here 53 \| __cold void io_uring_show_fdinfo(struct seq_file m, struct file f) \| ^ io_uring/cancel.c:187:25: warning: declaration shadows a local variable [-Wshadow] 187 \| struct io_uring_task tctx = node->task->io_uring; \| ^ io_uring/cancel.c:166:31: note: previous declaration is here 166 \| struct io_uring_task tctx, \| ^ io_uring/register.c:371:25: warning: declaration shadows a local variable [-Wshadow] 371 \| struct io_uring_task tctx = node->task->io_uring; \| ^ io_uring/register.c:312:24: note: previous declaration is here 312 \| struct io_uring_task *tctx = NULL; \| ^ and a simple cleanup gets rid of them. For the fdinfo case, make a distinction between the file being passed in (for the ring), and the registered files we iterate. For the other two cases, just get rid of shadowed variable, there's no reason to have a new one. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	f15ed8b4d0	io_uring: move mapping/allocation helpers to a separate file Move the related code from io_uring.c into memmap.c. No functional changes in this patch, just cleaning it up a bit now that the full transition is done. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	18595c0a58	io_uring: use unpin_user_pages() where appropriate There are a few cases of open-rolled loops around unpin_user_page(), use the generic helper instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	87585b0575	io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ring Rather than use remap_pfn_range() for this and manually free later, switch to using vm_insert_page() and have it Just Work. This requires a bit of effort on the mmap lookup side, as the ctx uring_lock isn't held, which otherwise protects buffer_lists from being torn down, and it's not safe to grab from mmap context that would introduce an ABBA deadlock between the mmap lock and the ctx uring_lock. Instead, lookup the buffer_list under RCU, as the the list is RCU freed already. Use the existing reference count to determine whether it's possible to safely grab a reference to it (eg if it's not zero already), and drop that reference when done with the mapping. If the mmap reference is the last one, the buffer_list and the associated memory can go away, since the vma insertion has references to the inserted pages at that point. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	e270bfd22a	io_uring/kbuf: vmap pinned buffer ring This avoids needing to care about HIGHMEM, and it makes the buffer indexing easier as both ring provided buffer methods are now virtually mapped in a contigious fashion. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	1943f96b38	io_uring: unify io_pin_pages() Move it into io_uring.c where it belongs, and use it in there as well rather than have two implementations of this. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	09fc75e0c0	io_uring: use vmap() for ring mapping This is the last holdout which does odd page checking, convert it to vmap just like what is done for the non-mmap path. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	3ab1db3c60	io_uring: get rid of remap_pfn_range() for mapping rings/sqes Rather than use remap_pfn_range() for this and manually free later, switch to using vm_insert_pages() and have it Just Work. If possible, allocate a single compound page that covers the range that is needed. If that works, then we can just use page_address() on that page. If we fail to get a compound page, allocate single pages and use vmap() to map them into the kernel virtual address space. This just covers the rings/sqes, the other remaining user of the mmap remap_pfn_range() user will be converted separately. Once that is done, we can kill the old alloc/free code. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	22537c9f79	io_uring: use the right type for work_llist empty check io_task_work_pending() uses wq_list_empty() on ctx->work_llist, but it's not an io_wq_work_list, it's a struct llist_head. They both have ->first as head-of-list, and it turns out the checks are identical. But be proper and use the right helper. Fixes: `dac6a0eae7` ("io_uring: ensure iopoll runs local task work as well") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Joel Granados	a80929d1cd	io_uring: Remove the now superfluous sentinel elements from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel element from kernel_io_uring_disabled_table Signed-off-by: Joel Granados <j.granados@samsung.com> Link: https://lore.kernel.org/r/20240328-jag-sysctl_remset_misc-v1-6-47c1463b3af2@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jiapeng Chong	4e9706c6c8	io_uring: Remove unused function The function are defined in the io_uring.c file, but not called elsewhere, so delete the unused function. io_uring/io_uring.c:646:20: warning: unused function '__io_cq_unlock'. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8660 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/20240328022324.78029-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	77a1cd5e79	io_uring: re-arrange Makefile order The object list is a bit of a mess, with core and opcode files mixed in. Re-arrange it so that we have the core bits first, and then opcode specific files after that. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	05eb5fe226	io_uring: refill request cache in memory order The allocator will generally return memory in order, but __io_alloc_req_refill() then adds them to a stack and we'll extract them in the opposite order. This obviously isn't a huge deal, but: 1) it makes debugging easier when they are in order 2) keeping them in-order is the right thing to do 3) reduces the code for adding them to the stack Just add them in reverse to the stack. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	da22bdf38b	io_uring/poll: shrink alloc cache size to 32 This should be plenty, rather than the default of 128, and matches what we have on the rsrc and futex side as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	414d0f45c3	io_uring/alloc_cache: switch to array based caching Currently lists are being used to manage this, but best practice is usually to have these in an array instead as that it cheaper to manage. Outside of that detail, games are also played with KASAN as the list is inside the cached entry itself. Finally, all users of this need a struct io_cache_entry embedded in their struct, which is union'ized with something else in there that isn't used across the free -> realloc cycle. Get rid of all of that, and simply have it be an array. This will not change the memory used, as we're just trading an 8-byte member entry for the per-elem array size. This reduces the overhead of the recycled allocations, and it reduces the amount of code code needed to support recycling to about half of what it currently is. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	e10677a8f6	io_uring: drop ->prep_async() It's now unused, drop the code related to it. This includes the io_issue_defs->manual alloc field. While in there, and since ->async_size is now being used a bit more frequently and in the issue path, move it to io_issue_defs[]. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	5eff57fa9f	io_uring/uring_cmd: defer SQE copying until it's needed The previous commit turned on async data for uring_cmd, and did the basic conversion of setting everything up on the prep side. However, for a lot of use cases, -EIOCBQUEUED will get returned on issue, as the operation got successfully queued. For that case, a persistent SQE isn't needed, as it's just used for issue. Unless execution goes async immediately, defer copying the double SQE until it's necessary. This greatly reduces the overhead of such commands, as evidenced by a perf diff from before and after this change: 10.60% -8.58% [kernel.vmlinux] [k] io_uring_cmd_prep where the prep side drops from 10.60% to ~2%, which is more expected. Performance also rises from ~113M IOPS to ~122M IOPS, bringing us back to where it was before the async command prep. Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	d10f19dff5	io_uring/uring_cmd: switch to always allocating async data Basic conversion ensuring async_data is allocated off the prep path. Adds a basic alloc cache as well, as passthrough IO can be quite high in rate. Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	e2ea5a7069	io_uring/net: move connect to always using async data While doing that, get rid of io_async_connect and just use the generic io_async_msghdr. Both of them have a struct sockaddr_storage in there, and while io_async_msghdr is bigger, if the same type can be used then the netmsg_cache can get reused for connect as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	d6f911a6b2	io_uring/rw: add iovec recycling Let the io_async_rw hold on to the iovec and reuse it, rather than always allocate and free them. Also enables KASAN for the iovec entries, so that reuse can be detected even while they are in the cache. While doing so, shrink io_async_rw by getting rid of the bigger embedded fast iovec. Since iovecs are being recycled now, shrink it from 8 to 1. This reduces the io_async_rw size from 264 to 160 bytes, a 40% reduction. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	cca6571381	io_uring/rw: cleanup retry path We no longer need to gate a potential retry on whether or not the context matches our original task, as all read/write operations have been fully prepared upfront. This means there's never any re-import needed, and hence we can always retry requests. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	0d10bd77a1	io_uring: get rid of struct io_rw_state A separate state struct is not needed anymore, just fold it in with io_async_rw. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	a9165b83c1	io_uring/rw: always setup io_async_rw for read/write requests read/write requests try to put everything on the stack, and then alloc and copy if a retry is needed. This necessitates a bunch of nasty code that deals with intermediate state. Get rid of this, and have the prep side setup everything that is needed upfront, which greatly simplifies the opcode handlers. This includes adding an alloc cache for io_async_rw, to make it cheap to handle. In terms of cost, this should be basically free and transparent. For the worst case of {READ,WRITE}_FIXED which didn't need it before, performance is unaffected in the normal peak workload that is being used to test that. Still runs at 122M IOPS. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	d80f940701	io_uring/net: drop 'kmsg' parameter from io_req_msg_cleanup() Now that iovec recycling is being done, the iovec is no longer being freed in there. Hence the kmsg parameter is now useless. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	7519134178	io_uring/net: add iovec recycling Right now the io_async_msghdr is recycled to avoid the overhead of allocating+freeing it for every request. But the iovec is not included, hence that will be allocated and freed for each transfer regardless. This commit enables recyling of the iovec between io_async_msghdr recycles. This avoids alloc+free for each one if an iovec is used, and on top of that, it extends the cache hot nature of msg to the iovec as well. Also enables KASAN for the iovec entries, so that reuse can be detected even while they are in the cache. The io_async_msghdr also shrinks from 376 -> 288 bytes, an 88 byte saving (or ~23% smaller), as the fast_iovec entry is dropped from 8 entries to a single entry. There's no point keeping a big fast iovec entry, if iovecs aren't being allocated and freed continually. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	9f8539fe29	io_uring/net: remove (now) dead code in io_netmsg_recycle() All net commands have async data at this point, there's no reason to check if this is the case or not. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	6498c5c97c	io_uring: kill io_msg_alloc_async_prep() We now ONLY call io_msg_alloc_async() from inside prep handling, which is always locked. No need for this helper anymore, or the check in io_msg_alloc_async() on whether the ring is locked or not. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	50220d6ac8	io_uring/net: get rid of ->prep_async() for send side Move the io_async_msghdr out of the issue path and into prep handling, e it's now done unconditionally and hence does not need to be part of the issue path. This means any usage of io_sendrecv_prep_async() and io_sendmsg_prep_async(), and hence the forced async setup path is now unified with the normal prep setup. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	c6f32c7d9e	io_uring/net: get rid of ->prep_async() for receive side Move the io_async_msghdr out of the issue path and into prep handling, since it's now done unconditionally and hence does not need to be part of the issue path. This reduces the footprint of the multishot fast path of multiple invocations of ->issue() per prep, and also means that using ->prep_async() can be dropped for recvmsg asthis is now done via setup on the prep side. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	3ba8345aec	io_uring/net: always set kmsg->msg.msg_control_user before issue We currently set this separately for async/sync entry, but let's just move it to a generic pre-issue spot and eliminate the difference between the two. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	790b68b32a	io_uring/net: always setup an io_async_msghdr Rather than use an on-stack one and then need to allocate and copy if async execution is required, always grab one upfront. This should be very cheap, and potentially even have cache hotness benefits for back-to-back send/recv requests. For any recv type of request, this is probably a good choice in general, as it's expected that no data is available initially. For send this is not necessarily the case, as space in the socket buffer is expected to be available. However, getting a cached io_async_msghdr is very cheap, and as it should be cache hot, probably the difference here is neglible, if any. A nice side benefit is that io_setup_async_msg can get killed completely, which has some nasty iovec manipulation code. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	f5b00ab222	io_uring/net: unify cleanup handling Now that recv/recvmsg both do the same cleanup, put it in the retry and finish handlers. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	4a3223f7bf	io_uring/net: switch io_recv() to using io_async_msghdr No functional changes in this patch, just in preparation for carrying more state than what is available now, if necessary. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	54cdcca05a	io_uring/net: switch io_send() and io_send_zc() to using io_async_msghdr No functional changes in this patch, just in preparation for carrying more state then what is being done now, if necessary. While unifying some of this code, add a generic send setup prep handler that they can both use. This gets rid of some manual msghdr and sockaddr on the stack, and makes it look a bit more like the sendmsg/recvmsg variants. Going forward, more can get unified on top. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	0ae9b9a14d	io_uring/alloc_cache: shrink default max entries from 512 to 128 In practice, we just need to recycle a few elements for (by far) most use cases. Shrink the total size down from 512 to 128, which should be more than plenty. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	29f858a7c6	io_uring: remove timeout/poll specific cancelations For historical reasons these were special cased, as they were the only ones that needed cancelation. But now we handle cancelations generally, and hence there's no need to check for these in io_ring_ctx_wait_and_kill() when io_uring_try_cancel_requests() handles both these and the rest as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:25 -06:00
Jens Axboe	2541762342	io_uring: flush delayed fallback task_work in cancelation Just like we run the inline task_work, ensure we also factor in and run the fallback task_work. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	c133b3b06b	io_uring: clean up io_lockdep_assert_cq_locked Move CONFIG_PROVE_LOCKING checks inside of io_lockdep_assert_cq_locked() and kill the else branch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/bbf33c429c9f6d7207a8fe66d1a5866ec2c99850.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	0667db14e1	io_uring: refactor io_req_complete_post() Make io_req_complete_post() to push all IORING_SETUP_IOPOLL requests to task_work, it's much cleaner and should normally happen. We couldn't do it before because there was a possibility of looping in complete_post() -> tw -> complete_post() -> ... Also, unexport the function and inline __io_req_complete_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/ea19c032ace3e0dd96ac4d991a063b0188037014.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	23fbdde620	io_uring: remove current check from complete_post task_work execution is now always locked, and we shouldn't get into io_req_complete_post() from them. That means that complete_post() is always called out of the original task context and we don't even need to check current. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/24ec27f27db0d8f58c974d8118dca1d345314ddc.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	902ce82c2a	io_uring: get rid of intermediate aux cqe caches io_post_aux_cqe(), which is used for multishot requests, delays completions by putting CQEs into a temporary array for the purpose completion lock/flush batching. DEFER_TASKRUN doesn't need any locking, so for it we can put completions directly into the CQ and defer post completion handling with a flag. That leaves !DEFER_TASKRUN, which is not that interesting / hot for multishot requests, so have conditional locking with deferred flush for them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/b1d05a81fd27aaa2a07f9860af13059e7ad7a890.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	e5c12945be	io_uring: refactor io_fill_cqe_req_aux The restriction on multishot execution context disallowing io-wq is driven by rules of io_fill_cqe_req_aux(), it should only be called in the master task context, either from the syscall path or in task_work. Since task_work now always takes the ctx lock implying IO_URING_F_COMPLETE_DEFER, we can just assume that the function is always called with its defer argument set to true. Kill the argument. Also rename the function for more consistency as "fill" in CQE related functions was usually meant for raw interfaces only copying data into the CQ without any locking, waking the user and other accounting "post" functions take care of. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/93423d106c33116c7d06bf277f651aa68b427328.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	8e5b3b89ec	io_uring: remove struct io_tw_state::locked ctx is always locked for task_work now, so get rid of struct io_tw_state::locked. Note I'm stopping one step before removing io_tw_state altogether, which is not empty, because it still serves the purpose of indicating which function is a tw callback and forcing users not to invoke them carelessly out of a wrong context. The removal can always be done later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/e95e1ea116d0bfa54b656076e6a977bc221392a4.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	92219afb98	io_uring: force tw ctx locking We can run normal task_work without locking the ctx, however we try to lock anyway and most handlers prefer or require it locked. It might have been interesting to multi-submitter ring with high contention completing async read/write requests via task_work, however that will still need to go through io_req_complete_post() and potentially take the lock for rsrc node putting or some other case. In other words, it's hard to care about it, so alawys force the locking. The case described would also because of various io_uring caches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/6ae858f2ef562e6ed9f13c60978c0d48926954ba.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	6e6b8c6212	io_uring/rw: avoid punting to io-wq directly kiocb_done() should care to specifically redirecting requests to io-wq. Remove the hopping to tw to then queue an io-wq, return -EAGAIN and let the core code io_uring handle offloading. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/413564e550fe23744a970e1783dfa566291b0e6f.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	e1eef2e56c	io_uring/cmd: fix tw <-> issue_flags conversion !IO_URING_F_UNLOCKED does not translate to availability of the deferred completion infra, IO_URING_F_COMPLETE_DEFER does, that what we should pass and look for to use io_req_complete_defer() and other variants. Luckily, it's not a real problem as two wrongs actually made it right, at least as far as io_uring_cmd_work() goes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/aef76d34fe9410df8ecc42a14544fd76cd9d8b9e.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	6edd953b6e	io_uring/cmd: kill one issue_flags to tw conversion io_uring cmd converts struct io_tw_state to issue_flags and later back to io_tw_state, it's awfully ill-fated, not to mention that intermediate issue_flags state is not correct. Get rid of the last conversion, drag through tw everything that came with IO_URING_F_UNLOCKED, and replace io_req_complete_defer() with a direct call to io_req_complete_defer(), at least for the time being. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/c53fa3df749752bd058cf6f824a90704822d6bcc.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	da12d9ab58	io_uring/cmd: move io_uring_try_cancel_uring_cmd() io_uring_try_cancel_uring_cmd() is a part of the cmd handling so let's move it closer to all cmd bits into uring_cmd.c Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/43a3937af4933655f0fd9362c381802f804f43de.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:24 -06:00
Pavel Begunkov	4fe82aedeb	io_uring/net: restore msg_control on sendzc retry `cac9e4418f` ("io_uring/net: save msghdr->msg_control for retries") reinstatiates msg_control before every __sys_sendmsg_sock(), since the function can overwrite the value in msghdr. We need to do same for zerocopy sendmsg. Cc: stable@vger.kernel.org Fixes: `493108d95f` ("io_uring/net: zerocopy sendmsg") Link: https://github.com/axboe/liburing/issues/1067 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cc1d5d9df0576fa66ddad4420d240a98a020b267.1712596179.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-08 21:48:41 -06:00
Christian Brauner	210a03c9d5	fs: claw back a few FMODE_* bits There's a bunch of flags that are purely based on what the file operations support while also never being conditionally set or unset. IOW, they're not subject to change for individual files. Imho, such flags don't need to live in f_mode they might as well live in the fops structs itself. And the fops struct already has that lonely mmap_supported_flags member. We might as well turn that into a generic fop_flags member and move a few flags from FMODE_* space into FOP_* space. That gets us four FMODE_* bits back and the ability for new static flags that are about file ops to not have to live in FMODE_* space but in their own FOP_* space. It's not the most beautiful thing ever but it gets the job done. Yes, there'll be an additional pointer chase but hopefully that won't matter for these flags. I suspect there's a few more we can move into there and that we can also redirect a bunch of new flag suggestions that follow this pattern into the fop_flags field instead of f_mode. Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-04-07 13:49:02 +02:00
Alexey Izbyshev	978e5c19df	io_uring: Fix io_cqring_wait() not restoring sigmask on get_timespec64() failure This bug was introduced in commit `950e79dd73` ("io_uring: minor io_cqring_wait() optimization"), which was made in preparation for `adc8682ec6` ("io_uring: Add support for napi_busy_poll"). The latter got reverted in `cb31821673` ("Revert "io_uring: Add support for napi_busy_poll""), so simply undo the former as well. Cc: stable@vger.kernel.org Fixes: `950e79dd73` ("io_uring: minor io_cqring_wait() optimization") Signed-off-by: Alexey Izbyshev <izbyshev@ispras.ru> Link: https://lore.kernel.org/r/20240405125551.237142-1-izbyshev@ispras.ru Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-05 20:05:41 -06:00
Jens Axboe	561e4f9451	io_uring/kbuf: hold io_buffer_list reference over mmap If we look up the kbuf, ensure that it doesn't get unregistered until after we're done with it. Since we're inside mmap, we cannot safely use the io_uring lock. Rely on the fact that we can lookup the buffer list under RCU now and grab a reference to it, preventing it from being unregistered until we're done with it. The lookup returns the io_buffer_list directly with it referenced. Cc: stable@vger.kernel.org # v6.4+ Fixes: `5cf4f52e6d` ("io_uring: free io_buffer_list entries via RCU") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-02 19:03:27 -06:00
Jens Axboe	6b69c4ab4f	io_uring/kbuf: protect io_buffer_list teardown with a reference No functional changes in this patch, just in preparation for being able to keep the buffer list alive outside of the ctx->uring_lock. Cc: stable@vger.kernel.org # v6.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-02 19:03:26 -06:00
Jens Axboe	3b80cff5a4	io_uring/kbuf: get rid of bl->is_ready Now that xarray is being exclusively used for the buffer_list lookup, this check is no longer needed. Get rid of it and the is_ready member. Cc: stable@vger.kernel.org # v6.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-02 19:03:24 -06:00
Jens Axboe	09ab7eff38	io_uring/kbuf: get rid of lower BGID lists Just rely on the xarray for any kind of bgid. This simplifies things, and it really doesn't bring us much, if anything. Cc: stable@vger.kernel.org # v6.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-02 19:03:13 -06:00
Jens Axboe	73eaa2b583	io_uring: use private workqueue for exit work Rather than use the system unbound event workqueue, use an io_uring specific one. This avoids dependencies with the tty, which also uses the system_unbound_wq, and issues flushes of said workqueue from inside its poll handling. Cc: stable@vger.kernel.org Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com> Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com> Tested-by: Iskren Chernev <me@iskren.info> Link: https://github.com/axboe/liburing/issues/1113 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-02 07:35:16 -06:00
Jens Axboe	bee1d5becd	io_uring: disable io-wq execution of multishot NOWAIT requests Do the same check for direct io-wq execution for multishot requests that commit `2a975d426c` did for the inline execution, and disable multishot mode (and revert to single shot) if the file type doesn't support NOWAIT, and isn't opened in O_NONBLOCK mode. For multishot to work properly, it's a requirement that nonblocking read attempts can be done. Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-01 11:46:22 -06:00
Jens Axboe	2a975d426c	io_uring/rw: don't allow multishot reads without NOWAIT support Supporting multishot reads requires support for NOWAIT, as the alternative would be always having io-wq execute the work item whenever the poll readiness triggered. Any fast file type will have NOWAIT support (eg it understands both O_NONBLOCK and IOCB_NOWAIT). If the given file type does not, then simply resort to single shot execution. Cc: stable@vger.kernel.org Fixes: `fc68fcda04` ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-01 11:41:50 -06:00
Jens Axboe	1251d2025c	io_uring/sqpoll: early exit thread if task_context wasn't allocated Ideally we'd want to simply kill the task rather than wake it, but for now let's just add a startup check that causes the thread to exit. This can only happen if io_uring_alloc_task_context() fails, which generally requires fault injection. Reported-by: Ubisectech Sirius <bugreport@ubisectech.com> Fixes: `af5d68f889` ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-18 20:22:42 -06:00
Jens Axboe	e21e1c45e1	io_uring: clear opcode specific data for an early failure If failure happens before the opcode prep handler is called, ensure that we clear the opcode specific area of the request, which holds data specific to that request type. This prevents errors where opcode handlers either don't get to clear per-request private data since prep isn't even called. Reported-and-tested-by: syzbot+f8e9a371388aa62ecab4@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-16 11:24:50 -06:00
Jens Axboe	f3a640cca9	io_uring/net: ensure async prep handlers always initialize ->done_io If we get a request with IOSQE_ASYNC set, then we first run the prep async handlers. But if we then fail setting it up and want to post a CQE with -EINVAL, we use ->done_io. This was previously guarded with REQ_F_PARTIAL_IO, and the normal setup handlers do set it up before any potential errors, but we need to cover the async setup too. Fixes: `9817ad8589` ("io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-16 10:33:19 -06:00
Jens Axboe	2b35b8b43e	io_uring/waitid: always remove waitid entry for cancel all We know the request is either being removed, or already in the process of being removed through task_work, so we can delete it from our waitid list upfront. This is important for remove all conditions, as we otherwise will find it multiple times and prevent cancelation progress. Remove the dead check in cancelation as well for the hash_node being empty or not. We already have a waitid reference check for ownership, so we don't need to check the list too. Cc: stable@vger.kernel.org Fixes: `f31ecf671d` ("io_uring: add IORING_OP_WAITID support") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-15 15:42:49 -06:00
Jens Axboe	30dab608c3	io_uring/futex: always remove futex entry for cancel all We know the request is either being removed, or already in the process of being removed through task_work, so we can delete it from our futex list upfront. This is important for remove all conditions, as we otherwise will find it multiple times and prevent cancelation progress. Cc: stable@vger.kernel.org Fixes: `194bb58c60` ("io_uring: add support for futex wake and wait") Fixes: `8f350194d5` ("io_uring: add support for vectored futex waits") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-15 15:37:15 -06:00
Pavel Begunkov	5e3afe580a	io_uring: fix poll_remove stalled req completion Taking the ctx lock is not enough to use the deferred request completion infrastructure, it'll get queued into the list but no one would expect it there, so it will sit there until next io_submit_flush_completions(). It's hard to care about the cancellation path, so complete it via tw. Fixes: `ef7dfac51d` ("io_uring/poll: serialize poll linked timer start with poll removal") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c446740bc16858f8a2a8dcdce899812f21d15f23.1710514702.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-15 09:36:56 -06:00
Gabriel Krisman Bertazi	67d1189d10	io_uring: Fix release of pinned pages when __io_uaddr_map fails Looking at the error path of __io_uaddr_map, if we fail after pinning the pages for any reasons, ret will be set to -EINVAL and the error handler won't properly release the pinned pages. I didn't manage to trigger it without forcing a failure, but it can happen in real life when memory is heavily fragmented. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Fixes: `223ef47431` ("io_uring: don't allow IORING_SETUP_NO_MMAP rings on highmem pages") Link: https://lore.kernel.org/r/20240313213912.1920-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-13 16:08:25 -06:00
Pavel Begunkov	9219e4a9d4	io_uring/kbuf: rename is_mapped In buffer lists we have ->is_mapped as well as ->is_mmap, it's pretty hard to stay sane double checking which one means what, and in the long run there is a high chance of an eventual bug. Rename ->is_mapped into ->is_buf_ring. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c4838f4d8ad506ad6373f1c305aee2d2c1a89786.1710343154.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-13 14:50:42 -06:00
Pavel Begunkov	2c5c0ba117	io_uring: simplify io_pages_free We never pass a null (top-level) pointer, remove the check. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0e1a46f9a5cd38e6876905e8030bdff9b0845e96.1710343154.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-13 14:50:42 -06:00
Pavel Begunkov	cef59d1ea7	io_uring: clean rings on NO_MMAP alloc fail We make a few cancellation judgements based on ctx->rings, so let's zero it afer deallocation for IORING_SETUP_NO_MMAP just like it's done with the mmap case. Likely, it's not a real problem, but zeroing is safer and better tested. Cc: stable@vger.kernel.org Fixes: `03d89a2de2` ("io_uring: support for user allocated memory for rings/sqes") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9ff6cdf91429b8a51699c210e1f6af6ea3f8bdcf.1710255382.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-12 09:21:36 -06:00
Jens Axboe	0a3737db84	io_uring/rw: return IOU_ISSUE_SKIP_COMPLETE for multishot retry If read multishot is being invoked from the poll retry handler, then we should return IOU_ISSUE_SKIP_COMPLETE rather than -EAGAIN. If not, then a CQE will be posted with -EAGAIN rather than triggering the retry when the file is flagged as readable again. Cc: stable@vger.kernel.org Reported-by: Sargun Dhillon <sargun@meta.com> Fixes: `fc68fcda04` ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-12 08:29:47 -06:00
Jens Axboe	6f0974eccb	io_uring: don't save/restore iowait state This kind of state is per-syscall, and since we're doing the waiting off entering the io_uring_enter(2) syscall, there's no way that iowait can already be set for this case. Simplify it by setting it if we need to, and always clearing it to 0 when done. Fixes: `7b72d661f1` ("io_uring: gate iowait schedule on having pending requests") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-11 15:02:59 -06:00
Linus Torvalds	d2c84bdce2	for-6.9/io_uring-20240310 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmXuD/AQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpsojEACNlJKqsebZv24szCR5ViBGqoDi/A5v5vZv 1p7f0sVgpwFLuDu3CCb9IG1tuAiuhBa5yvBKKpyGuGglQd+7Sxqsgdc2Bv/76D7S Ej/fc1x5dxuvAvAetYk4yH2idPhYIBVIx3g2oz44bO4Ur3jFZ/yXzp+JtuKEuTba 7kQmAXfN7c497XDsmSv1eJM/+D/LKjmvjqMX2gnXprw2qPgdAklXcUSnBYaS2JEt o4HGWAImJOV416d7QkOWgKfk6ksJbO3lFzQ6R+JdQCl6KVqc0+5u0oT06ZGVpSUf fQqfcV+cJw41dQB47Qr017ku0EdDI19L3YpL9/WOnNMBM421j1QER1cKiKfiHD2B LCOn+tvunxcGMzYonAFfgSF4XXFJWSK33TpvmmVsU3w0+YSC9oIqFfCxOdHuAJqB tHSuGHgzkufgqhNIQWHiWZEJJUW+MO4Dv2rUV6n+dfCz6JQG48Gs9clDv/tAEY4U 4NzErfYLCsWlNaMPQK1f/b9dWjBXAnpJA4yq8jPyYB3GqjnVuX3Ze14UfwOWgv0B E++qgPsh30ShbP/NRHqS9tNQC2hIy27x/jzpTyKwxuoSs/nyeZg7lFXIPaQQo7wt GZhGzsMasbhoylqblB171NFlxpRetY9aYvHZ3OfUP4xAt1THVOzR6hZrBurOKMv/ e8FBGBh/cg== =Hy// -----END PGP SIGNATURE----- Merge tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: - Make running of task_work internal loops more fair, and unify how the different methods deal with them (me) - Support for per-ring NAPI. The two minor networking patches are in a shared branch with netdev (Stefan) - Add support for truncate (Tony) - Export SQPOLL utilization stats (Xiaobing) - Multishot fixes (Pavel) - Fix for a race in manipulating the request flags via poll (Pavel) - Cleanup the multishot checking by making it generic, moving it out of opcode handlers (Pavel) - Various tweaks and cleanups (me, Kunwu, Alexander) * tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux: (53 commits) io_uring: Fix sqpoll utilization check racing with dying sqpoll io_uring/net: dedup io_recv_finish req completion io_uring: refactor DEFER_TASKRUN multishot checks io_uring: fix mshot io-wq checks io_uring/net: add io_req_msg_cleanup() helper io_uring/net: simplify msghd->msg_inq checking io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io io_uring/net: correctly handle multishot recvmsg retry setup io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler io_uring: fix io_queue_proc modifying req->flags io_uring: fix mshot read defer taskrun cqe posting io_uring/net: fix overflow check in io_recvmsg_mshot_prep() io_uring/net: correct the type of variable io_uring/sqpoll: statistics of the true utilization of sq threads io_uring/net: move recv/recvmsg flags out of retry loop io_uring/kbuf: flag request if buffer pool is empty after buffer pick io_uring/net: improve the usercopy for sendmsg/recvmsg io_uring/net: move receive multishot out of the generic msghdr path io_uring/net: unify how recvmsg and sendmsg copy in the msghdr ...	2024-03-11 11:35:31 -07:00
Gabriel Krisman Bertazi	606559dc4f	io_uring: Fix sqpoll utilization check racing with dying sqpoll Commit `3fcb9d1720` ("io_uring/sqpoll: statistics of the true utilization of sq threads"), currently in Jens for-next branch, peeks at io_sq_data->thread to report utilization statistics. But, If io_uring_show_fdinfo races with sqpoll terminating, even though we hold the ctx lock, sqd->thread might be NULL and we hit the Oops below. Note that we could technically just protect the getrusage() call and the sq total/work time calculations. But showing some sq information (pid/cpu) and not other information (utilization) is more confusing than not reporting anything, IMO. So let's hide it all if we happen to race with a dying sqpoll. This can be triggered consistently in my vm setup running sqpoll-cancel-hang.t in a loop. BUG: kernel NULL pointer dereference, address: 00000000000007b0 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 16587 Comm: systemd-coredum Not tainted 6.8.0-rc3-g3fcb9d17206e-dirty #69 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022 RIP: 0010:getrusage+0x21/0x3e0 Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89 RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282 RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60 RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000 R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000 R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000 FS: 00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> ? __die_body+0x1a/0x60 ? page_fault_oops+0x154/0x440 ? srso_alias_return_thunk+0x5/0xfbef5 ? do_user_addr_fault+0x174/0x7c0 ? srso_alias_return_thunk+0x5/0xfbef5 ? exc_page_fault+0x63/0x140 ? asm_exc_page_fault+0x22/0x30 ? getrusage+0x21/0x3e0 ? seq_printf+0x4e/0x70 io_uring_show_fdinfo+0x9db/0xa10 ? srso_alias_return_thunk+0x5/0xfbef5 ? vsnprintf+0x101/0x4d0 ? srso_alias_return_thunk+0x5/0xfbef5 ? seq_vprintf+0x34/0x50 ? srso_alias_return_thunk+0x5/0xfbef5 ? seq_printf+0x4e/0x70 ? seq_show+0x16b/0x1d0 ? __pfx_io_uring_show_fdinfo+0x10/0x10 seq_show+0x16b/0x1d0 seq_read_iter+0xd7/0x440 seq_read+0x102/0x140 vfs_read+0xae/0x320 ? srso_alias_return_thunk+0x5/0xfbef5 ? __do_sys_newfstat+0x35/0x60 ksys_read+0xa5/0xe0 do_syscall_64+0x50/0x110 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7f786ec1db4d Code: e8 46 e3 01 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 80 3d d9 ce 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec RSP: 002b:00007ffcb361a4b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 000055a4c8fe42f0 RCX: 00007f786ec1db4d RDX: 0000000000000400 RSI: 000055a4c8fe48a0 RDI: 0000000000000006 RBP: 00007f786ecfb0b0 R08: 00007f786ecfb2a8 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f786ecfaf60 R13: 000055a4c8fe42f0 R14: 0000000000000000 R15: 00007ffcb361a628 </TASK> Modules linked in: CR2: 00000000000007b0 ---[ end trace 0000000000000000 ]--- RIP: 0010:getrusage+0x21/0x3e0 Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89 RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282 RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60 RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000 R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000 R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000 FS: 00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0 PKRU: 55555554 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x1ce00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Fixes: `3fcb9d1720` ("io_uring/sqpoll: statistics of the true utilization of sq threads") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240309003256.358-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-09 07:27:09 -07:00
Pavel Begunkov	1af04699c5	io_uring/net: dedup io_recv_finish req completion There are two block in io_recv_finish() completing the request, which we can combine and remove jumping. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0e338dcb33c88de83809fda021cba9e7c9681620.1709905727.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-08 07:59:20 -07:00
Pavel Begunkov	e0e4ab52d1	io_uring: refactor DEFER_TASKRUN multishot checks We disallow DEFER_TASKRUN multishots from running by io-wq, which is checked by individual opcodes in the issue path. We can consolidate all it in io_wq_submit_work() at the same time moving the checks out of the hot path. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e492f0f11588bb5aa11d7d24e6f53b7c7628afdb.1709905727.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-08 07:58:23 -07:00
Pavel Begunkov	3a96378e22	io_uring: fix mshot io-wq checks When checking for concurrent CQE posting, we're not only interested in requests running from the poll handler but also strayed requests ended up in normal io-wq execution. We're disallowing multishots in general from io-wq, not only when they came in a certain way. Cc: stable@vger.kernel.org Fixes: `17add5cea2` ("io_uring: force multishot CQEs into task context") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d8c5b36a39258036f93301cd60d3cd295e40653d.1709905727.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-08 07:58:23 -07:00
Jens Axboe	d9b441889c	io_uring/net: add io_req_msg_cleanup() helper For the fast inline path, we manually recycle the io_async_msghdr and free the iovec, and then clear the REQ_F_NEED_CLEANUP flag to avoid that needing doing in the slower path. We already do that in 2 spots, and in preparation for adding more, add a helper and use it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-08 07:57:27 -07:00
Jens Axboe	fb6328bc2a	io_uring/net: simplify msghd->msg_inq checking Just check for larger than zero rather than check for non-zero and not -1. This is easier to read, and also protects against any errants < 0 values that aren't -1. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-08 07:56:31 -07:00
Jens Axboe	186daf2385	io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE We only use the flag for this purpose, so rename it accordingly. This further prevents various other use cases of it, keeping it clean and consistent. Then we can also check it in one spot, when it's being attempted recycled, and remove some dead code in io_kbuf_recycle_ring(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-08 07:56:27 -07:00
Jens Axboe	9817ad8589	io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io Ensure that prep handlers always initialize sr->done_io before any potential failure conditions, and with that, we now it's always been set even for the failure case. With that, we don't need to use the REQ_F_PARTIAL_IO flag to gate on that. Additionally, we should not overwrite req->cqe.res unless sr->done_io is actually positive. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-08 07:56:21 -07:00
Jens Axboe	deaef31bc1	io_uring/net: correctly handle multishot recvmsg retry setup If we loop for multishot receive on the initial attempt, and then abort later on to wait for more, we miss a case where we should be copying the io_async_msghdr from the stack to stable storage. This leads to the next retry potentially failing, if the application had the msghdr on the stack. Cc: stable@vger.kernel.org Fixes: `9bb66906f2` ("io_uring: support multishot in recvmsg") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-07 17:48:03 -07:00
Jens Axboe	b5311dbc2c	io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler This flag should not be persistent across retries, so ensure we clear it before potentially attemting a retry. Fixes: `c3f9109dbc` ("io_uring/kbuf: flag request if buffer pool is empty after buffer pick") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-07 13:22:05 -07:00
Pavel Begunkov	1a8ec63b2b	io_uring: fix io_queue_proc modifying req->flags With multiple poll entries __io_queue_proc() might be running in parallel with poll handlers and possibly task_work, we should not be carelessly modifying req->flags there. io_poll_double_prepare() handles a similar case with locking but it's much easier to move it into __io_arm_poll_handler(). Cc: stable@vger.kernel.org Fixes: `595e52284d` ("io_uring/poll: don't enable lazy wake for POLLEXCLUSIVE") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/455cc49e38cf32026fa1b49670be8c162c2cb583.1709834755.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-07 11:10:28 -07:00
Pavel Begunkov	70581dcd06	io_uring: fix mshot read defer taskrun cqe posting We can't post CQEs from io-wq with DEFER_TASKRUN set, normal completions are handled but aux should be explicitly disallowed by opcode handlers. Cc: stable@vger.kernel.org Fixes: `fc68fcda04` ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6fb7cba6f5366da25f4d3eb95273f062309d97fa.1709740837.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-07 06:30:33 -07:00
Dan Carpenter	8ede3db506	io_uring/net: fix overflow check in io_recvmsg_mshot_prep() The "controllen" variable is type size_t (unsigned long). Casting it to int could lead to an integer underflow. The check_add_overflow() function considers the type of the destination which is type int. If we add two positive values and the result cannot fit in an integer then that's counted as an overflow. However, if we cast "controllen" to an int and it turns negative, then negative values can fit into an int type so there is no overflow. Good: 100 + (unsigned long)-4 = 96 <-- overflow Bad: 100 + (int)-4 = 96 <-- no overflow I deleted the cast of the sizeof() as well. That's not a bug but the cast is unnecessary. Fixes: `9b0fc3c054` ("io_uring: fix types in io_recvmsg_multishot_overflow") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/138bd2e2-ede8-4bcc-aa7b-f3d9de167a37@moroto.mountain Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-04 16:33:15 -07:00
Muhammad Usama Anjum	86bcacc957	io_uring/net: correct the type of variable The namelen is of type int. It shouldn't be made size_t which is unsigned. The signed number is needed for error checking before use. Fixes: `c55978024d` ("io_uring/net: move receive multishot out of the generic msghdr path") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Link: https://lore.kernel.org/r/20240301144349.2807544-1-usama.anjum@collabora.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-04 16:33:11 -07:00

1 2 3 4 5 ...

1032 Commits