linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-22 04:02:20 +00:00

Author	SHA1	Message	Date
Jens Axboe	b6f58a3f4a	io_uring: move struct io_kiocb from task_struct to io_uring_task Rather than store the task_struct itself in struct io_kiocb, store the io_uring specific task_struct. The life times are the same in terms of io_uring, and this avoids doing some dereferences through the task_struct. For the hot path of putting local task references, we can deref req->tctx instead, which we'll need anyway in that function regardless of whether it's local or remote references. This is mostly straight forward, except the original task PF_EXITING check needs a bit of tweaking. task_work is _always_ run from the originating task, except in the fallback case, where it's run from a kernel thread. Replace the potentially racy (in case of fallback work) checks for req->task->flags with current->flags. It's either the still the original task, in which case PF_EXITING will be sane, or it has PF_KTHREAD set, in which case it's fallback work. Both cases should prevent moving forward with the given request. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 13:55:38 -07:00
Jens Axboe	b54a14041e	io_uring/rsrc: add io_rsrc_node_lookup() helper There are lots of spots open-coding this functionality, add a generic helper that does the node lookup in a speculation safe way. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:30 -06:00
Jens Axboe	3597f2786b	io_uring/rsrc: unify file and buffer resource tables For files, there's nr_user_files/file_table/file_data, and buffers have nr_user_bufs/user_bufs/buf_data. There's no reason why file_table and file_data can't be the same thing, and ditto for the buffer side. That gets rid of more io_ring_ctx state that's in two spots rather than just being in one spot, as it should be. Put all the registered file data in one locations, and ditto on the buffer front. This also avoids having both io_rsrc_data->nodes being an allocated array, and ->user_bufs[] or ->file_table.nodes. There's no reason to have this information duplicated. Keep it in one spot, io_rsrc_data, along with how many resources are available. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:23 -06:00
Jens Axboe	8abf47a8d6	io_uring/cancel: get rid of init_hash_table() helper All it does is initialize the lists, just move the INIT_HLIST_HEAD() into the one caller. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	ba4366f57b	io_uring/poll: get rid of per-hashtable bucket locks Any access to the table is protected by ctx->uring_lock now anyway, the per-bucket locking doesn't buy us anything. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	1da2f311ba	io_uring: fix warnings on shadow variables There are a few of those: io_uring/fdinfo.c:170:16: warning: declaration shadows a local variable [-Wshadow] 170 \| struct file f = io_file_from_index(&ctx->file_table, i); \| ^ io_uring/fdinfo.c:53:67: note: previous declaration is here 53 \| __cold void io_uring_show_fdinfo(struct seq_file m, struct file f) \| ^ io_uring/cancel.c:187:25: warning: declaration shadows a local variable [-Wshadow] 187 \| struct io_uring_task tctx = node->task->io_uring; \| ^ io_uring/cancel.c:166:31: note: previous declaration is here 166 \| struct io_uring_task tctx, \| ^ io_uring/register.c:371:25: warning: declaration shadows a local variable [-Wshadow] 371 \| struct io_uring_task tctx = node->task->io_uring; \| ^ io_uring/register.c:312:24: note: previous declaration is here 312 \| struct io_uring_task *tctx = NULL; \| ^ and a simple cleanup gets rid of them. For the fdinfo case, make a distinction between the file being passed in (for the ring), and the registered files we iterate. For the other two cases, just get rid of shadowed variable, there's no reason to have a new one. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:10:26 -06:00
Jens Axboe	521223d7c2	io_uring/cancel: don't default to setting req->work.cancel_seq Just leave it unset by default, avoiding dipping into the last cacheline (which is otherwise untouched) for the fast path of using poll to drive networked traffic. Add a flag that tells us if the sequence is valid or not, and then we can defer actually assigning the flag and sequence until someone runs cancelations. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-08 13:27:06 -07:00
Jens Axboe	73363c262d	io_uring: use fget/fput consistently Normally within a syscall it's fine to use fdget/fdput for grabbing a file from the file table, and it's fine within io_uring as well. We do that via io_uring_enter(2), io_uring_register(2), and then also for cancel which is invoked from the latter. io_uring cannot close its own file descriptors as that is explicitly rejected, and for the cancel side of things, the file itself is just used as a lookup cookie. However, it is more prudent to ensure that full references are always grabbed. For anything threaded, either explicitly in the application itself or through use of the io-wq worker threads, this is what happens anyway. Generalize it and use fget/fput throughout. Also see the below link for more details. Link: https://lore.kernel.org/io-uring/CAG48ez1htVSO3TqmrF8QcX2WFuYTRM-VZ_N10i-VZgbtg=NNqw@mail.gmail.com/ Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-28 11:56:29 -07:00
Jens Axboe	194bb58c60	io_uring: add support for futex wake and wait Add support for FUTEX_WAKE/WAIT primitives. IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as it does support passing in a bitset. Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and FUTEX_WAIT_BITSET. For both of them, they are using the futex2 interface. FUTEX_WAKE is straight forward, as those can always be done directly from the io_uring submission without needing async handling. For FUTEX_WAIT, things are a bit more complicated. If the futex isn't ready, then we rely on a callback via futex_queue->wake() when someone wakes up the futex. From that calback, we queue up task_work with the original task, which will post a CQE and wake it, if necessary. Cancelations are supported, both from the application point-of-view, but also to be able to cancel pending waits if the ring exits before all events have occurred. The return value of futex_unqueue() is used to gate who wins the potential race between cancelation and futex wakeups. Whomever gets a 'ret == 1' return from that claims ownership of the io_uring futex request. This is just the barebones wait/wake support. PI or REQUEUE support is not added at this point, unclear if we might look into that later. Likewise, explicit timeouts are not supported either. It is expected that users that need timeouts would do so via the usual io_uring mechanism to do that using linked timeouts. The SQE format is as follows: `addr` Address of futex `fd` futex2(2) FUTEX2_* flags `futex_flags` io_uring specific command flags. None valid now. `addr2` Value of futex `addr3` Mask to wake/wait Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-09-29 02:36:57 -06:00
Jens Axboe	f31ecf671d	io_uring: add IORING_OP_WAITID support This adds support for an async version of waitid(2), in a fully async version. If an event isn't immediately available, wait for a callback to trigger a retry. The format of the sqe is as follows: sqe->len The 'which', the idtype being queried/waited for. sqe->fd The 'pid' (or id) being waited for. sqe->file_index The 'options' being set. sqe->addr2 A pointer to siginfo_t, if any, being filled in. buf_index, add3, and waitid_flags are reserved/unused for now. waitid_flags will be used for options for this request type. One interesting use case may be to add multi-shot support, so that the request stays armed and posts a notification every time a monitored process state change occurs. Note that this does not support rusage, on Arnd's recommendation. See the waitid(2) man page for details on the arguments. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-09-21 12:04:45 -06:00
Jens Axboe	f77569d22a	io_uring/cancel: wire up IORING_ASYNC_CANCEL_OP for sync cancel Allow usage of IORING_ASYNC_CANCEL_OP through the sync cancelation API as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-17 10:05:48 -06:00
Jens Axboe	d7b8b079a8	io_uring/cancel: support opcode based lookup and cancelation Add IORING_ASYNC_CANCEL_OP flag for cancelation, which allows the application to target cancelation based on the opcode of the original request. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-17 10:05:48 -06:00
Jens Axboe	8165b56604	io_uring/cancel: add IORING_ASYNC_CANCEL_USERDATA Add a flag to explicitly match on user_data in the request for cancelation purposes. This is the default behavior if none of the other match flags are set, but if we ALSO want to match on user_data, then this flag can be set. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-17 10:05:48 -06:00
Jens Axboe	3a372b6692	io_uring/cancel: fix sequence matching for IORING_ASYNC_CANCEL_ANY We always need to check/update the cancel sequence if IORING_ASYNC_CANCEL_ALL is set. Also kill the redundant check for IORING_ASYNC_CANCEL_ANY at the end, if we get here we know it's not set as we would've matched it higher up. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-17 10:05:48 -06:00
Jens Axboe	aa5cd116f3	io_uring/cancel: abstract out request match helper We have different match code in a variety of spots. Start the cleanup of this by abstracting out a helper that can be used to check if a given request matches the cancelation criteria outlined in io_cancel_data. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-17 10:05:48 -06:00
Christoph Hellwig	60a666f097	io_uring: use io_file_from_index in __io_sync_cancel Use io_file_from_index instead of open coding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-20 09:36:22 -06:00
Jens Axboe	23fffb2f09	io_uring/cancel: re-grab ctx mutex after finishing wait If we have a signal pending during cancelations, it'll cause the task_work run to return an error. Since we didn't run task_work, the current task is left in TASK_INTERRUPTIBLE state when we need to re-grab the ctx mutex, and the kernel will rightfully complain about that. Move the lock grabbing for the error cases outside the loop to avoid that issue. Reported-by: syzbot+7df055631cd1be4586fd@syzkaller.appspotmail.com Link: https://lore.kernel.org/io-uring/0000000000003a14a905f05050b0@google.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-21 13:31:40 -07:00
Dylan Yudaken	c0e0d6ba25	io_uring: add IORING_SETUP_DEFER_TASKRUN Allow deferring async tasks until the user calls io_uring_enter(2) with the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at io_uring_setup time. This functionality requires that the later io_uring_enter will be called from the same submission task, and therefore restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also set. Being able to hand pick when tasks are run prevents the problem where there is current work to be done, however task work runs anyway. For example, a common workload would obtain a batch of CQEs, and process each one. Interrupting this to additional taskwork would add latency but not gain anything. If instead task work is deferred to just before more CQEs are obtained then no additional latency is added. The way this is implemented is by trying to keep task work local to a io_ring_ctx, rather than to the submission task. This is required, as the application will want to wake up only a single io_ring_ctx at a time to process work, and so the lists of work have to be kept separate. This has some other benefits like not having to check the task continually in handle_tw_list (and potentially unlocking/locking those), and reducing locks in the submit & process completions path. There are networking cases where using this option can reduce request latency by 50%. For example a contrived example using [1] where the client sends 2k data and receives the same data back while doing some system calls (to trigger task work) shows this reduction. The reason ends up being that if sending responses is delayed by processing task work, then the client side sits idle. Whereas reordering the sends first means that the client runs it's workload in parallel with the local task work. [1]: Using https://github.com/DylanZA/netbench/tree/defer_run Client: ./netbench --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000" Server: ./netbench --server_only 1 --control_port 10000 --rx "io_uring --defer_taskrun 0 --workload 100" --rx "io_uring --defer_taskrun 1 --workload 100" Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220830125013.570060-5-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-21 10:30:42 -06:00
Jens Axboe	47abea041f	io_uring: fix off-by-one in sync cancelation file check The passed in index should be validated against the number of registered files we have, it needs to be smaller than the index value to avoid going one beyond the end. Fixes: `78a861b949` ("io_uring: add sync cancelation API through io_uring_register()") Reported-by: Luo Likang <luolikang@nsfocus.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-23 07:26:08 -06:00
Stefan Metzmacher	f2ccb5aed7	io_uring: make io_kiocb_to_cmd() typesafe We need to make sure (at build time) that struct io_cmd_data is not casted to a structure that's larger. Signed-off-by: Stefan Metzmacher <metze@samba.org> Link: https://lore.kernel.org/r/c024cdf25ae19fc0319d4180e2298bade8ed17b8.1660201408.git.metze@samba.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-12 17:01:00 -06:00
Jens Axboe	78a861b949	io_uring: add sync cancelation API through io_uring_register() The io_uring cancelation API is async, like any other API that we expose there. For the case of finding a request to cancel, or not finding one, it is fully sync in that when submission returns, the CQE for both the cancelation request and the targeted request have been posted to the CQ ring. However, if the targeted work is being executed by io-wq, the API can only start the act of canceling it. This makes it difficult to use in some circumstances, as the caller then has to wait for the CQEs to come in and match on the same cancelation data there. Provide a IORING_REGISTER_SYNC_CANCEL command for io_uring_register() that does sync cancelations, always. For the io-wq case, it'll wait for the cancelation to come in before returning. The only expected returns from this API is: 0 Request found and canceled fine. > 0 Requests found and canceled. Only happens if asked to cancel multiple requests, and if the work wasn't in progress. -ENOENT Request not found. -ETIME A timeout on the operation was requested, but the timeout expired before we could cancel. and we won't get -EALREADY via this API. If the timeout value passed in is -1 (tv_sec and tv_nsec), then that means that no timeout is requested. Otherwise, the timespec passed in is the amount of time the sync cancel will wait for a successful cancelation. Link: https://github.com/axboe/liburing/discussions/608 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:15 -06:00
Jens Axboe	7d8ca72501	io_uring: add IORING_ASYNC_CANCEL_FD_FIXED cancel flag In preparation for not having a request to pass in that carries this state, add a separate cancelation flag that allows the caller to ask for a fixed file for cancelation. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:15 -06:00
Jens Axboe	88f52eaad2	io_uring: have cancelation API accept io_uring_task directly We just use the io_kiocb passed in to find the io_uring_task, and we already pass in the ctx via cd->ctx anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:15 -06:00
Pavel Begunkov	27a9d66fec	io_uring: kill extra io_uring_types.h includes io_uring/io_uring.h already includes io_uring_types.h, no need to include it every time. Kill it in a bunch of places, it prepares us for following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/94d8c943fbe0ef949981c508ddcee7fc1c18850f.1655384063.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:14 -06:00
Pavel Begunkov	5d7943d99d	io_uring: propagate locking state to poll cancel Poll cancellation will be soon need to grab ->uring_lock inside, pass the locking state, i.e. issue_flags, inside the cancellation functions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b86781d047727c07163443b57551a3fa57c7c5e1.1655371007.git.asml.silence@gmail.com Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:13 -06:00
Pavel Begunkov	e6f89be614	io_uring: introduce a struct for hash table Instead of passing around a pointer to hash buckets, add a bit of type safety and wrap it into a structure. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d65bc3faba537ec2aca9eabf334394936d44bd28.1655371007.git.asml.silence@gmail.com Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:13 -06:00
Pavel Begunkov	4dfab8abb4	io_uring: clean up io_try_cancel Get rid of an unnecessary extra goto in io_try_cancel() and simplify the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/48cf5417b43a8386c6c364dba1ad9b4c7382d158.1655371007.git.asml.silence@gmail.com Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:13 -06:00
Hao Xu	38513c464d	io_uring: switch cancel_hash to use per entry spinlock Add a new io_hash_bucket structure so that each bucket in cancel_hash has separate spinlock. Use per entry lock for cancel_hash, this removes some completion lock invocation and remove contension between different cancel_hash entries. Signed-off-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/05d1e135b0c8bce9d1441e6346776589e5783e26.1655371007.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:13 -06:00
Jens Axboe	7aaff708a7	io_uring: move cancelation into its own file This also helps cleanup the io_uring.h cancel parts, as we can make things static in the cancel.c file, mostly. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:12 -06:00

29 Commits