linux

Author	SHA1	Message	Date
Pavel Begunkov	62906e89e6	io_uring: remove file batch-get optimisation For requests with non-fixed files, instead of grabbing just one reference, we get by the number of left requests, so the following requests using the same file can take it without atomics. However, it's not all win. If there is one request in the middle not using files or having a fixed file, we'll need to put back the left references. Even worse if an application submits requests dealing with different files, it will do a put for each new request, so doubling the number of atomics needed. Also, even if not used, it's still takes some cycles in the submission path. If a file used many times, it rather makes sense to pre-register it, if not, we may fall in the described pitfall. So, this optimisation is a matter of use case. Go with the simpliest code-wise way, remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	6294f3686b	io_uring: clean up tctx_task_work() After recent fixes, tctx_task_work() always does proper spinlocking before looking into ->task_list, so now we don't need atomics for ->task_state, replace it with non-atomic task_running using the critical section. Tide it up, combine two separate block with spinlocking, and always try to splice in there, so we do less locking when new requests are arriving during the function execution. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: fix missing ->task_running reset on task_work_add() failure] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	5d70904367	io_uring: inline io_poll_remove_waitqs Inline io_poll_remove_waitqs() into its only user and clean it up. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2f1a91a19ffcd591531dc4c61e2f11c64a2d6a6d.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:26 -06:00
Pavel Begunkov	90f67366cb	io_uring: remove extra argument for overflow flush Unlike __io_cqring_overflow_flush(), nobody does forced flushing with io_cqring_overflow_flush(), so removed the argument from it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7594f869ca41b7cfb5a35a3c7c2d402242834e9e.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:19 -06:00
Pavel Begunkov	cd0ca2e048	io_uring: inline struct io_comp_state Inline struct io_comp_state into struct io_submit_state. They are already coupled tightly, together with mixed responsibilities it only brings confusion having them separately. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e55bba77426b399e3a2e54e3c6c267c6a0fc4b57.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:48 -06:00
Pavel Begunkov	bb943b8265	io_uring: use inflight_entry instead of compl.list req->compl.list is used to cache freed requests, and so can't overlap in time with req->inflight_entry. So, use inflight_entry to link requests and remove compl.list. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e430e79d22d70a190d718831bda7bfed1daf8976.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:43 -06:00
Pavel Begunkov	7255834ed6	io_uring: remove redundant args from cache_free We don't use @tsk argument of io_req_cache_free(), remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6a28b4a58ee0aaf0db98e2179b9c9f06f9b0cca1.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:43 -06:00
Pavel Begunkov	c34b025f2d	io_uring: cache __io_free_req()'d requests Don't kfree requests in __io_free_req() but put them back into the internal request cache. That makes allocations more sustainable and will be used for refcounting optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9f4950fbe7771c8d41799366d0a3a08ac3040236.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:43 -06:00
Pavel Begunkov	f56165e62f	io_uring: move io_fallback_req_func() Move io_fallback_req_func() to kill yet another forward declaration. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d0a8f9d9a0057ed761d6237167d51c9378798d2d.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:22 -06:00
Pavel Begunkov	e9dbe221f5	io_uring: optimise putting task struct We cache all the reference to task + tctx, so if io_put_task() is called by the corresponding task itself, we can save on atomics and return the refs right back into the cache. It's beneficial for all inline completions, and also iopolling, when polling and submissions are done by the same task, including SQPOLL\|IOPOLL. Note: io_uring_cancel_generic() can return refs to the cache as well, so those should be flushed in the loop for tctx_inflight() to work right. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6fe9646b3cb70e46aca1f58426776e368c8926b3.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:08:06 -06:00
Pavel Begunkov	af066f31eb	io_uring: drop exec checks from io_req_task_submit In case of on-exec io_uring cancellations, tasks already wait for all submitted requests to get completed/cancelled, so we don't need to check for ->in_execve separately. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/be8707049f10df9d20ca03dc4ca3316239b5e8e0.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:08:06 -06:00
Pavel Begunkov	bbbca09489	io_uring: kill unused IO_IOPOLL_BATCH IO_IOPOLL_BATCH is not used, delete it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b2bdf19dbee2c9fc8865bbab9412135a14e24a64.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:08:06 -06:00
Pavel Begunkov	58d3be2c60	io_uring: improve ctx hang handling If io_ring_exit_work() can't get it done in 5 minutes, something is going very wrong, don't keep spinning at HZ / 20 rate, it doesn't help and it may take much of CPU time if there is a lot of workers stuck as such. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9e2d1ca81d569f6bc628af1a42ff6663bff7ce9c.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	d3fddf6ddd	io_uring: deduplicate open iopoll check Move IORING_SETUP_IOPOLL check into __io_openat_prep(), so both openat and openat2 reuse it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9a73ce83e4ee60d011180ef177eecef8e87ff2a2.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	543af3a13d	io_uring: inline io_free_req_deferred Inline io_free_req_deferred(), there is no reason to keep it separated. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ce04b7180d4eac0d69dd00677b227eefe80c2cc5.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	b9bd2bea0f	io_uring: move io_rsrc_node_alloc() definition Move the function together with io_rsrc_node_ref_zero() in the source file as it is to get rid of forward declarations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d81f6f833e7d017860b24463a9a68b14a8a5ed2.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	6a290a1442	io_uring: move io_put_task() definition Move the function in the source file as it is to get rid of forward declarations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/33d917d69e4206557c75a5b98fe22bcdf77ce47d.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	e73c5c7cd3	io_uring: extract a helper for ctx quiesce Refactor __io_uring_register() by extracting a helper responsible for ctx queisce. Looks better and will make it easier to add more optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0339e0027504176be09237eefa7945bf9a6f153d.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	90291099f2	io_uring: optimise io_cqring_wait() hot path Turns out we always init struct io_wait_queue in io_cqring_wait(), even if it's not used after, i.e. there are already enough of CQEs. And often it's exactly what happens, for instance, requests may have been completed inline, or in case of io_uring_enter(submit=N, wait=1). It shows up in my profiler, so optimise it by delaying the struct init. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6f1b81c60b947d165583dc333947869c3d85d037.1628471125.git.asml.silence@gmail.com [axboe: fixed up for new cqring wait] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	282cdc8693	io_uring: add more locking annotations for submit Add more annotations for submission path functions holding ->uring_lock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/128ec4185e26fbd661dd3a424aa66108ee8ff951.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	a2416e1ec2	io_uring: don't halt iopoll too early IOPOLL users should care more about getting completions for requests they submitted, but not in "device did/completed something". Currently, io_do_iopoll() may return a positive number, which will instruct io_iopoll_check() to break the loop and end the syscall, even if there is not enough CQEs or none at all. Don't return positive numbers, so io_iopoll_check() exits only when it gets an actual error, need reschedule or got enough CQEs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/641a88f751623b6758303b3171f0a4141f06726e.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	864ea921b0	io_uring: refactor io_alloc_req Replace the main if of io_flush_cached_reqs() with inverted condition + goto, so all the cases are handled in the same way. And also extract io_preinit_req() to make it cleaner and easier to refer to. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1abcba1f7b55dc53bf1dbe95036e345ffb1d5b01.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	2215bed924	io_uring: remove unnecessary PF_EXITING check We prefer nornal task_works even if it would fail requests inside. Kill a PF_EXITING check in io_req_task_work_add(), task_work_add() handles well dying tasks, i.e. return error when can't enqueue due to late stages of do_exit(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fc14297e8441cd8f5d1743a2488cf0df09bf48ac.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	ebc11b6c6b	io_uring: clean io-wq callbacks Move io-wq callbacks closer to each other, so it's easier to work with them, and rename io_free_work() into io_wq_free_work() for consistency. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/851bbc7f0f86f206d8c1333efee8bcb9c26e419f.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	c97d8a0f68	io_uring: avoid touching inode in rw prep If we use fixed files, we can be sure (almost) that REQ_F_ISREG is set. However, for non-reg files io_prep_rw() still will look into inode to double check, and that's expensive and can be avoided. The only caveat is that it only currently works with 64+ bit architectures, see FFS_ISREG, so we should consider that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0a62780c491ca2522cd52db4ae3f16e03aafed0f.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	b191e2dfe5	io_uring: rename io_file_supports_async() io_file_supports_async() checks whether a file supports nowait operations, so "async" in the name is misleading. Rename it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/33d55b5ce43aa1884c637c1957f1e30d30dc3bec.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	ac177053bb	io_uring: inline fixed part of io_file_get() Optimise io_file_get() with registered files, which is in a hot path, by inlining parts of the function. Saves a function call, and inefficiencies of passing arguments, e.g. evaluating (sqe_flags & IOSQE_FIXED_FILE). It couldn't have been done before as compilers were refusing to inline it because of the function size. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/52115cd6ce28f33bd0923149c0e6cb611084a0b1.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	042b0d85ea	io_uring: use kvmalloc for fixed files Instead of hand-coded two-level tables for registered files, allocate them with kvmalloc(). In many cases small enough tables are enough, and so can be kmalloc()'ed removing an extra memory load and a bunch of bit logic instructions from the hot path. If the table is larger, we trade off all the pros with a TLB-assisted memory lookup. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/280421d3b48775dabab773006bb5588c7b2dabc0.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Jens Axboe	5fd4617840	io_uring: be smarter about waking multiple CQ ring waiters Currently we only wake the first waiter, even if we have enough entries posted to satisfy multiple waiters. Improve that situation so that every waiter knows how much the CQ tail has to advance before they can be safely woken up. With this change, if we have N waiters each asking for 1 event and we get 4 completions, then we wake up 4 waiters. If we have N waiters asking for 2 completions and we get 4 completions, then we wake up the first two. Previously, only the first waiter would've been woken up. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Jens Axboe	a30f895ad3	io_uring: fix xa_alloc_cycle() error return value check We currently check for ret != 0 to indicate error, but '1' is a valid return and just indicates that the allocation succeeded with a wrap. Correct the check to be for < 0, like it was before the xarray conversion. Cc: stable@vger.kernel.org Fixes: `61cf93700f` ("io_uring: Convert personality_idr to XArray") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-20 14:59:58 -06:00
Pavel Begunkov	9cb0073b30	io_uring: pin ctx on fallback execution Pin ring in io_fallback_req_func() by briefly elevating ctx->refs in case any task_work handler touches ctx after releasing a request. Fixes: `9011bf9a13` ("io_uring: fix stuck fallback reqs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/833a494713d235ec144284a9bbfe418df4f6b61c.1629235576.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-17 16:06:14 -06:00
Jens Axboe	21f965221e	io_uring: only assign io_uring_enter() SQPOLL error in actual error case If an SQPOLL based ring is newly created and an application issues an io_uring_enter(2) system call on it, then we can return a spurious -EOWNERDEAD error. This happens because there's nothing to submit, and if the caller doesn't specify any other action, the initial error assignment of -EOWNERDEAD never gets overwritten. This causes us to return it directly, even if it isn't valid. Move the error assignment into the actual failure case instead. Cc: stable@vger.kernel.org Fixes: `d9d05217cb` ("io_uring: stop SQPOLL submit on creator's death") Reported-by: Sherlock Holo sherlockya@gmail.com Link: https://github.com/axboe/liburing/issues/413 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-14 12:38:21 -06:00
Pavel Begunkov	43597aac1f	io_uring: fix ctx-exit io_rsrc_put_work() deadlock __io_rsrc_put_work() might need ->uring_lock, so nobody should wait for rsrc nodes holding the mutex. However, that's exactly what io_ring_ctx_free() does with io_wait_rsrc_data(). Split it into rsrc wait + dealloc, and move the first one out of the lock. Cc: stable@vger.kernel.org Fixes: `b60c8dce33` ("io_uring: preparation for rsrc tagging") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0130c5c2693468173ec1afab714e0885d2c9c363.1628559783.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-09 19:59:28 -06:00
Jens Axboe	c018db4a57	io_uring: drop ctx->uring_lock before flushing work item Ammar reports that he's seeing a lockdep splat on running test/rsrc_tags from the regression suite: ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Tainted: G OE ------------------------------------------------------ kworker/2:4/2684 is trying to acquire lock: ffff88814bb1c0a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_rsrc_put_work+0x13d/0x1a0 but task is already holding lock: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}: __flush_work+0x31b/0x490 io_rsrc_ref_quiesce.part.0.constprop.0+0x35/0xb0 __do_sys_io_uring_register+0x45b/0x1060 do_syscall_64+0x35/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #0 (&ctx->uring_lock){+.+.}-{3:3}: __lock_acquire+0x119a/0x1e10 lock_acquire+0xc8/0x2f0 __mutex_lock+0x86/0x740 io_rsrc_put_work+0x13d/0x1a0 process_one_work+0x236/0x530 worker_thread+0x52/0x3b0 kthread+0x135/0x160 ret_from_fork+0x1f/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock((work_completion)(&(&ctx->rsrc_put_work)->work)); lock(&ctx->uring_lock); lock((work_completion)(&(&ctx->rsrc_put_work)->work)); lock(&ctx->uring_lock); * DEADLOCK * 2 locks held by kworker/2:4/2684: #0: ffff88810004d938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 #1: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 stack backtrace: CPU: 2 PID: 2684 Comm: kworker/2:4 Tainted: G OE 5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Hardware name: Acer Aspire ES1-421/OLVIA_BE, BIOS V1.05 07/02/2015 Workqueue: events io_rsrc_put_work Call Trace: dump_stack_lvl+0x6a/0x9a check_noncircular+0xfe/0x110 __lock_acquire+0x119a/0x1e10 lock_acquire+0xc8/0x2f0 ? io_rsrc_put_work+0x13d/0x1a0 __mutex_lock+0x86/0x740 ? io_rsrc_put_work+0x13d/0x1a0 ? io_rsrc_put_work+0x13d/0x1a0 ? io_rsrc_put_work+0x13d/0x1a0 ? process_one_work+0x1ce/0x530 io_rsrc_put_work+0x13d/0x1a0 process_one_work+0x236/0x530 worker_thread+0x52/0x3b0 ? process_one_work+0x530/0x530 kthread+0x135/0x160 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 which is due to holding the ctx->uring_lock when flushing existing pending work, while the pending work flushing may need to grab the uring lock if we're using IOPOLL. Fix this by dropping the uring_lock a bit earlier as part of the flush. Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/issues/404 Tested-by: Ammar Faizi <ammarfaizi2@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-09 19:59:06 -06:00
Jens Axboe	4956b9eaad	io_uring: rsrc ref lock needs to be IRQ safe Nadav reports running into the below splat on re-enabling softirqs: WARNING: CPU: 2 PID: 1777 at kernel/softirq.c:364 __local_bh_enable_ip+0xaa/0xe0 Modules linked in: CPU: 2 PID: 1777 Comm: umem Not tainted 5.13.1+ #161 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020 RIP: 0010:__local_bh_enable_ip+0xaa/0xe0 Code: a9 00 ff ff 00 74 38 65 ff 0d a2 21 8c 7a e8 ed 1a 20 00 fb 66 0f 1f 44 00 00 5b 41 5c 5d c3 65 8b 05 e6 2d 8c 7a 85 c0 75 9a <0f> 0b eb 96 e8 2d 1f 20 00 eb a5 4c 89 e7 e8 73 4f 0c 00 eb ae 65 RSP: 0018:ffff88812e58fcc8 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000201 RCX: dffffc0000000000 RDX: 0000000000000007 RSI: 0000000000000201 RDI: ffffffff8898c5ac RBP: ffff88812e58fcd8 R08: ffffffff8575dbbf R09: ffffed1028ef14f9 R10: ffff88814778a7c3 R11: ffffed1028ef14f8 R12: ffffffff85c9e9ae R13: ffff88814778a000 R14: ffff88814778a7b0 R15: ffff8881086db890 FS: 00007fbcfee17700(0000) GS:ffff8881e0300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000c0402a5008 CR3: 000000011c1ac003 CR4: 00000000003706e0 Call Trace: _raw_spin_unlock_bh+0x31/0x40 io_rsrc_node_ref_zero+0x13e/0x190 io_dismantle_req+0x215/0x220 io_req_complete_post+0x1b8/0x720 __io_complete_rw.isra.0+0x16b/0x1f0 io_complete_rw+0x10/0x20 where it's clear we end up calling the percpu count release directly from the completion path, as it's in atomic mode and we drop the last ref. For file/block IO, this can be from IRQ context already, and the softirq locking for rsrc isn't enough. Just make the lock fully IRQ safe, and ensure we correctly safe state from the release path as we don't know the full context there. Reported-by: Nadav Amit <nadav.amit@gmail.com> Tested-by: Nadav Amit <nadav.amit@gmail.com> Link: https://lore.kernel.org/io-uring/C187C836-E78B-4A31-B24C-D16919ACA093@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-09 19:58:59 -06:00
Nadav Amit	20c0b380f9	io_uring: Use WRITE_ONCE() when writing to sq_flags The compiler should be forbidden from any strange optimization for async writes to user visible data-structures. Without proper protection, the compiler can cause write-tearing or invent writes that would confuse the userspace. However, there are writes to sq_flags which are not protected by WRITE_ONCE(). Use WRITE_ONCE() for these writes. This is purely a theoretical issue. Presumably, any compiler is very unlikely to do such optimizations. Fixes: `75b28affdd` ("io_uring: allocate the two rings together") Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Nadav Amit <namit@vmware.com> Link: https://lore.kernel.org/r/20210808001342.964634-3-namit@vmware.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-08 21:21:11 -06:00
Nadav Amit	ef98eb0409	io_uring: clear TIF_NOTIFY_SIGNAL when running task work When using SQPOLL, the submission queue polling thread calls task_work_run() to run queued work. However, when work is added with TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains set afterwards and is never cleared. Consequently, when the submission queue polling thread checks whether signal_pending(), it may always find a pending signal, if task_work_add() was ever called before. The impact of this bug might be different on different kernel versions. It appears that on 5.14 it would only cause unnecessary calculation and prevent the polling thread from sleeping. On 5.13, where the bug was found, it stops the polling thread from finding newly submitted work. Instead of task_work_run(), use tracehook_notify_signal() that clears TIF_NOTIFY_SIGNAL. Test for TIF_NOTIFY_SIGNAL in addition to current->task_works to avoid a race in which task_works is cleared but the TIF_NOTIFY_SIGNAL is set. Fixes: `685fe7feed` ("io-wq: eliminate the need for a manager thread") Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Nadav Amit <namit@vmware.com> Link: https://lore.kernel.org/r/20210808001342.964634-2-namit@vmware.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-08 21:21:11 -06:00
Hao Xu	a890d01e4e	io_uring: fix poll requests leaking second poll entries For pure poll requests, it doesn't remove the second poll wait entry when it's done, neither after vfs_poll() or in the poll completion handler. We should remove the second poll wait entry. And we use io_poll_remove_double() rather than io_poll_remove_waitqs() since the latter has some redundant logic. Fixes: `88e41cf928` ("io_uring: add multishot mode for IORING_OP_POLL_ADD") Cc: stable@vger.kernel.org # 5.13+ Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210728030322.12307-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-28 07:24:57 -06:00
Jens Axboe	ef04688871	io_uring: don't block level reissue off completion path Some setups, like SCSI, can throw spurious -EAGAIN off the softirq completion path. Normally we expect this to happen inline as part of submission, but apparently SCSI has a weird corner case where it can happen as part of normal completions. This should be solved by having the -EAGAIN bubble back up the stack as part of submission, but previous attempts at this failed and we're not just quite there yet. Instead we currently use REQ_F_REISSUE to handle this case. For now, catch it in io_rw_should_reissue() and prevent a reissue from a bogus path. Cc: stable@vger.kernel.org Reported-by: Fabian Ebner <f.ebner@proxmox.com> Tested-by: Fabian Ebner <f.ebner@proxmox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-28 07:24:38 -06:00
Jens Axboe	773af69121	io_uring: always reissue from task_work context As a safeguard, if we're going to queue async work, do it from task_work from the original task. This ensures that we can always sanely create threads, regards of what the reissue context may be. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-27 10:49:48 -06:00
Jens Axboe	110aa25c3c	io_uring: fix race in unified task_work running We use a bit to manage if we need to add the shared task_work, but a list + lock for the pending work. Before aborting a current run of the task_work we check if the list is empty, but we do so without grabbing the lock that protects it. This can lead to races where we think we have nothing left to run, where in practice we could be racing with a task adding new work to the list. If we do hit that race condition, we could be left with work items that need processing, but the shared task_work is not active. Ensure that we grab the lock before checking if the list is empty, so we know if it's safe to exit the run or not. Link: https://lore.kernel.org/io-uring/c6bd5987-e9ae-cd02-49d0-1b3ac1ef65b1@tnonline.net/ Cc: stable@vger.kernel.org # 5.11+ Reported-by: Forza <forza@tnonline.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-26 10:42:56 -06:00
Pavel Begunkov	44eff40a32	io_uring: fix io_prep_async_link locking io_prep_async_link() may be called after arming a linked timeout, automatically making it unsafe to traverse the linked list. Guard with completion_lock if there was a linked timeout. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/93f7c617e2b4f012a2a175b3dab6bc2f27cebc48.1627304436.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-26 08:58:04 -06:00
Jens Axboe	991468dcf1	io_uring: explicitly catch any illegal async queue attempt Catch an illegal case to queue async from an unrelated task that got the ring fd passed to it. This should not be possible to hit, but better be proactive and catch it explicitly. io-wq is extended to check for early IO_WQ_WORK_CANCEL being set on a work item as well, so it can run the request through the normal cancelation path. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-23 16:44:51 -06:00
Jens Axboe	3c30ef0f78	io_uring: never attempt iopoll reissue from release path There are two reasons why this shouldn't be done: 1) Ring is exiting, and we're canceling requests anyway. Any request should be canceled anyway. In theory, this could iterate for a number of times if someone else is also driving the target block queue into request starvation, however the likelihood of this happening is miniscule. 2) If the original task decided to pass the ring to another task, then we don't want to be reissuing from this context as it may be an unrelated task or context. No assumptions should be made about the context in which ->release() is run. This can only happen for pure read/write, and we'll get -EFAULT on them anyway. Link: https://lore.kernel.org/io-uring/YPr4OaHv0iv0KTOc@zeniv-ca.linux.org.uk/ Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-23 16:32:48 -06:00
Jens Axboe	0cc936f74b	io_uring: fix early fdput() of file A previous commit shuffled some code around, and inadvertently used struct file after fdput() had been called on it. As we can't touch the file post fdput() dropping our reference, move the fdput() to after that has been done. Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/io-uring/YPnqM0fY3nM5RdRI@zeniv-ca.linux.org.uk/ Fixes: `f2a48dd09b` ("io_uring: refactor io_sq_offload_create()") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-22 17:11:46 -06:00
Yang Yingliang	362a9e6528	io_uring: fix memleak in io_init_wq_offload() I got memory leak report when doing fuzz test: BUG: memory leak unreferenced object 0xffff888107310a80 (size 96): comm "syz-executor.6", pid 4610, jiffies 4295140240 (age 20.135s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... backtrace: [<000000001974933b>] kmalloc include/linux/slab.h:591 [inline] [<000000001974933b>] kzalloc include/linux/slab.h:721 [inline] [<000000001974933b>] io_init_wq_offload fs/io_uring.c:7920 [inline] [<000000001974933b>] io_uring_alloc_task_context+0x466/0x640 fs/io_uring.c:7955 [<0000000039d0800d>] __io_uring_add_tctx_node+0x256/0x360 fs/io_uring.c:9016 [<000000008482e78c>] io_uring_add_tctx_node fs/io_uring.c:9052 [inline] [<000000008482e78c>] __do_sys_io_uring_enter fs/io_uring.c:9354 [inline] [<000000008482e78c>] __se_sys_io_uring_enter fs/io_uring.c:9301 [inline] [<000000008482e78c>] __x64_sys_io_uring_enter+0xabc/0xc20 fs/io_uring.c:9301 [<00000000b875f18f>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<00000000b875f18f>] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 [<000000006b0a8484>] entry_SYSCALL_64_after_hwframe+0x44/0xae CPU0 CPU1 io_uring_enter io_uring_enter io_uring_add_tctx_node io_uring_add_tctx_node __io_uring_add_tctx_node __io_uring_add_tctx_node io_uring_alloc_task_context io_uring_alloc_task_context io_init_wq_offload io_init_wq_offload hash = kzalloc hash = kzalloc ctx->hash_map = hash ctx->hash_map = hash <- one of the hash is leaked When calling io_uring_enter() in parallel, the 'hash_map' will be leaked, add uring_lock to protect 'hash_map'. Fixes: `e941894eae` ("io-wq: make buffered file write hashed work map per-ctx") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210720083805.3030730-1-yangyingliang@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-20 07:51:47 -06:00
Pavel Begunkov	46fee9ab02	io_uring: remove double poll entry on arm failure __io_queue_proc() can enqueue both poll entries and still fail afterwards, so the callers trying to cancel it should also try to remove the second poll entry (if any). For example, it may leave the request alive referencing a io_uring context but not accessible for cancellation: [ 282.599913][ T1620] task:iou-sqp-23145 state:D stack:28720 pid:23155 ppid: 8844 flags:0x00004004 [ 282.609927][ T1620] Call Trace: [ 282.613711][ T1620] __schedule+0x93a/0x26f0 [ 282.634647][ T1620] schedule+0xd3/0x270 [ 282.638874][ T1620] io_uring_cancel_generic+0x54d/0x890 [ 282.660346][ T1620] io_sq_thread+0xaac/0x1250 [ 282.696394][ T1620] ret_from_fork+0x1f/0x30 Cc: stable@vger.kernel.org Fixes: `18bceab101` ("io_uring: allow POLL_ADD with double poll_wait() users") Reported-and-tested-by: syzbot+ac957324022b7132accf@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0ec1228fc5eda4cb524eeda857da8efdc43c331c.1626774457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-20 07:50:42 -06:00
Pavel Begunkov	68b11e8b15	io_uring: explicitly count entries for poll reqs If __io_queue_proc() fails to add a second poll entry, e.g. kmalloc() failed, but it goes on with a third waitqueue, it may succeed and overwrite the error status. Count the number of poll entries we added, so we can set pt->error to zero at the beginning and find out when the mentioned scenario happens. Cc: stable@vger.kernel.org Fixes: `18bceab101` ("io_uring: allow POLL_ADD with double poll_wait() users") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9d6b9e561f88bcc0163623b74a76c39f712151c3.1626774457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-20 07:50:42 -06:00
Pavel Begunkov	1b48773f9f	io_uring: fix io_drain_req() io_drain_req() return whether the request has been consumed or not, not an error code. Fix a stupid mistake slipped from optimisation patches. Reported-by: syzbot+ba6fcd859210f4e9e109@syzkaller.appspotmail.com Fixes: `76cc33d791` ("io_uring: refactor io_req_defer()") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d3c53c4274ffff307c8ae062fc7fda63b978df2.1626039606.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-11 16:39:06 -06:00
Pavel Begunkov	9c6882608b	io_uring: use right task for exiting checks When we use delayed_work for fallback execution of requests, current will be not of the submitter task, and so checks in io_req_task_submit() may not behave as expected. Currently, it leaves inline completions not flushed, so making io_ring_exit_work() to hang. Use the submitter task for all those checks. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cb413c715bed0bc9c98b169059ea9c8a2c770715.1625881431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-11 16:39:06 -06:00
Jens Axboe	9ce85ef2cb	io_uring: remove dead non-zero 'poll' check Colin reports that Coverity complains about checking for poll being non-zero after having dereferenced it multiple times. This is a valid complaint, and actually a leftover from back when this code was based on the aio poll code. Kill the redundant check. Link: https://lore.kernel.org/io-uring/fe70c532-e2a7-3722-58a1-0fa4e5c5ff2c@canonical.com/ Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-09 08:20:28 -06:00
Pavel Begunkov	8f487ef2cb	io_uring: mitigate unlikely iopoll lag We have requests like IORING_OP_FILES_UPDATE that don't go through ->iopoll_list but get completed in place under ->uring_lock, and so after dropping the lock io_iopoll_check() should expect that some CQEs might have get completed in a meanwhile. Currently such events won't be accounted in @nr_events, and the loop will continue to poll even if there is enough of CQEs. It shouldn't be a problem as it's not likely to happen and so, but not nice either. Just return earlier in this case, it should be enough. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66ef932cc66a34e3771bbae04b2953a8058e9d05.1625747741.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-08 14:07:43 -06:00
Pavel Begunkov	c32aace0cf	io_uring: fix drain alloc fail return code After a recent change io_drain_req() started to fail requests with result=0 in case of allocation failure, where it should be and have been -ENOMEM. Fixes: `76cc33d791` ("io_uring: refactor io_req_defer()") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e068110ac4293e0c56cfc4d280d0f22b9303ec08.1625682153.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-07 12:49:32 -06:00
Pavel Begunkov	e09ee51060	io_uring: fix exiting io_req_task_work_add leaks If one entered io_req_task_work_add() not seeing PF_EXITING, it will set a ->task_state bit and try task_work_add(), which may fail by that moment. If that happens the function would try to cancel the request. However, in a meanwhile there might come other io_req_task_work_add() callers, which will see the bit set and leave their requests in the list, which will never be executed. Don't propagate an error, but clear the bit first and then fallback all requests that we can splice from the list. The callback functions have to be able to deal with PF_EXITING, so poll and apoll was modified via changing io_poll_rewait(). Fixes: `7cbf1722d5` ("io_uring: provide FIFO ordering for task_work") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/060002f19f1fdbd130ba24aef818ea4d3080819b.1625142209.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-01 13:40:32 -06:00
Pavel Begunkov	5b0a6acc73	io_uring: simplify task_work func Since we don't really use req->task_work anymore, get rid of it together with the nasty ->func aliasing between ->io_task_work and ->task_work, and hide ->fallback_node inside of io_task_work. Also, as task_work is gone now, replace the callback type from task_work_func_t to a function taking io_kiocb to avoid casting and simplify code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-01 13:40:23 -06:00
Pavel Begunkov	9011bf9a13	io_uring: fix stuck fallback reqs When task_work_add() fails, we use ->exit_task_work to queue the work. That will be run only in the cancellation path, which happens either when the ctx is dying or one of tasks with inflight requests is exiting or executing. There is a good chance that such a request would just get stuck in the list potentially hodling a file, all io_uring rsrc recycling or some other resources. Nothing terrible, it'll go away at some point, but we don't want to lock them up for longer than needed. Replace that hand made ->exit_task_work with delayed_work + llist inspired by fput_many(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-01 13:40:17 -06:00
Hao Xu	e149bd742b	io_uring: code clean for kiocb_done() A simple code clean for kiocb_done() Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:40 -06:00
Hao Xu	915b3dde9b	io_uring: spin in iopoll() only when reqs are in a single queue We currently spin in iopoll() when requests to be iopolled are for same file(device), while one device may have multiple hardware queues. given an example: hw_queue_0 \| hw_queue_1 req(30us) req(10us) If we first spin on iopolling for the hw_queue_0. the avg latency would be (30us + 30us) / 2 = 30us. While if we do round robin, the avg latency would be (30us + 10us) / 2 = 20us since we reap the request in hw_queue_1 in time. So it's better to do spinning only when requests are in same hardware queue. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:40 -06:00
Pavel Begunkov	99ebe4efbd	io_uring: pre-initialise some of req fields Most of requests are allocated from an internal cache, so it's waste of time fully initialising them every time. Instead, let's pre-init some of the fields we can during initial allocation (e.g. kmalloc(), see io_alloc_req()) and keep them valid on request recycling. There are four of them in this patch: ->ctx is always stays the same ->link is NULL on free, it's an invariant ->result is not even needed to init, just a precaution ->async_data we now clean in io_dismantle_req() as it's likely to never be allocated. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/892ba0e71309bba9fe9e0142472330bbf9d8f05d.1624739600.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:40 -06:00
Pavel Begunkov	5182ed2e33	io_uring: refactor io_submit_flush_completions Don't init req_batch before we actually need it. Also, add a small clean up for req declaration. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ad85512e12bd3a20d521e9782750300970e5afc8.1624739600.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:40 -06:00
Pavel Begunkov	4cfb25bf88	io_uring: optimise hot path restricted checks Move likely/unlikely from io_check_restriction() to specifically ctx->restricted check, because doesn't do what it supposed to and make the common path take an extra jump. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/22bf70d0a543dfc935d7276bdc73081784e30698.1624739600.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:40 -06:00
Pavel Begunkov	e5dc480d4e	io_uring: remove not needed PF_EXITING check Since cancellation got moved before exit_signals(), there is no one left who can call io_run_task_work() with PF_EXIING set, so remove the check. Note that __io_req_task_submit() still needs a similar check. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f7f305ececb1e6044ea649fb983ca754805bb884.1624739600.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:40 -06:00
Pavel Begunkov	dd432ea520	io_uring: mainstream sqpoll task_work running task_works are widely used, so place io_run_task_work() directly into the main path of io_sq_thread(), and remove it from other places where it's not needed anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/24eb5e35d519c590d3dffbd694b4c61a5fe49029.1624739600.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Pavel Begunkov	b2d9c3da77	io_uring: refactor io_arm_poll_handler() gcc 11 goes a weird path and duplicates most of io_arm_poll_handler() for READ and WRITE cases. Help it and move all pollin vs pollout specific bits under a single if-else, so there is no temptation for this kind of unfolding. before vs after: text data bss dec hex filename 85362 12650 8 98020 17ee4 ./fs/io_uring.o 85186 12650 8 97844 17e34 ./fs/io_uring.o Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1deea0037293a922a0358e2958384b2e42437885.1624739600.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Olivier Langlois	59b735aeeb	io_uring: reduce latency by reissueing the operation It is quite frequent that when an operation fails and returns EAGAIN, the data becomes available between that failure and the call to vfs_poll() done by io_arm_poll_handler(). Detecting the situation and reissuing the operation is much faster than going ahead and push the operation to the io-wq. Performance improvement testing has been performed with: Single thread, 1 TCP connection receiving a 5 Mbps stream, no sqpoll. 4 measurements have been taken: 1. The time it takes to process a read request when data is already available 2. The time it takes to process by calling twice io_issue_sqe() after vfs_poll() indicated that data was available 3. The time it takes to execute io_queue_async_work() 4. The time it takes to complete a read request asynchronously 2.25% of all the read operations did use the new path. ready data (baseline) avg 3657.94182918628 min 580 max 20098 stddev 1213.15975908162 reissue completion average 7882.67567567568 min 2316 max 28811 stddev 1982.79172973284 insert io-wq time average 8983.82276995305 min 3324 max 87816 stddev 2551.60056552038 async time completion average 24670.4758861127 min 10758 max 102612 stddev 3483.92416873804 Conclusion: On average reissuing the sqe with the patch code is 1.1uSec faster and in the worse case scenario 59uSec faster than placing the request on io-wq On average completion time by reissuing the sqe with the patch code is 16.79uSec faster and in the worse case scenario 73.8uSec faster than async completion. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9e8441419bb1b8f3c3fcc607b2713efecdef2136.1624364038.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Jens Axboe	22634bc562	io_uring: add IOPOLL and reserved field checks to IORING_OP_UNLINKAT We can't support IOPOLL with non-pollable request types, and we should check for unused/reserved fields like we do for other request types. Fixes: `14a1143b68` ("io_uring: add support for IORING_OP_UNLINKAT") Cc: stable@vger.kernel.org Reported-by: Dmitry Kadashev <dkadashev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Jens Axboe	ed7eb25922	io_uring: add IOPOLL and reserved field checks to IORING_OP_RENAMEAT We can't support IOPOLL with non-pollable request types, and we should check for unused/reserved fields like we do for other request types. Fixes: `80a261fd00` ("io_uring: add support for IORING_OP_RENAMEAT") Cc: stable@vger.kernel.org Reported-by: Dmitry Kadashev <dkadashev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Pavel Begunkov	12dcb58ac7	io_uring: refactor io_openat2() Put do_filp_open() fail path of io_openat2() under a single if, deduplicating put_unused_fd(), making it look better and helping the hot path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f4c84d25c049d0af2adc19c703bbfef607200209.1624543113.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Pavel Begunkov	16340eab61	io_uring: update sqe layout build checks Add missing BUILD_BUG_SQE_ELEM() for ->buf_group verifying that SQE layout doesn't change. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1f9d21bd74599b856b3a632be4c23ffa184a3ef0.1624543113.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Pavel Begunkov	fe7e325750	io_uring: fix code style problems Fix a bunch of problems mostly found by checkpatch.pl Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cfaf9a2f27b43934144fe9422a916bd327099f44.1624543113.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Pavel Begunkov	1a924a8082	io_uring: refactor io_sq_thread() Move needs_sched declaration into the block where it's used, so it's harder to misuse/wrongfully reuse. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e4a07db1353ee38b924dd1b45394cf8e746130b4.1624543113.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Pavel Begunkov	948e19479c	io_uring: don't change sqpoll creds if not needed SQPOLL doesn't need to change creds if it's not submitting requests. Move creds overriding into __io_sq_thread() after checking if there are SQEs pending. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c54368da2357ac539e0a333f7cfff70d5fb045b2.1624543113.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:38 -06:00
Olivier Langlois	4ce8ad95f0	io_uring: Create define to modify a SQPOLL parameter The magic number used to cap the number of entries extracted from an io_uring instance SQ before moving to the other instances is an interesting parameter to experiment with. A define has been created to make it easy to change its value from a single location. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b401640063e77ad3e9f921e09c9b3ac10a8bb923.1624473200.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-23 20:38:21 -06:00
Olivier Langlois	9971350177	io_uring: Fix race condition when sqp thread goes to sleep If an asynchronous completion happens before the task is preparing itself to wait and set its state to TASK_INTERRUPTIBLE, the completion will not wake up the sqp thread. Cc: stable@vger.kernel.org Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d1419dc32ec6a97b453bee34dc03fa6a02797142.1624473200.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-23 20:38:21 -06:00
Pavel Begunkov	7a778f9dc3	io_uring: improve in tctx_task_work() resubmission If task_state is cleared, io_req_task_work_add() will go the slow path adding a task_work, setting the task_state, waking up the task and so on. Not to mention it's expensive. tctx_task_work() first clears the state and then executes all the work items queued, so if any of them resubmits or adds new task_work items, it would unnecessarily go through the slow path of io_req_task_work_add(). Let's clear the ->task_state at the end. We still have to check ->task_list for emptiness afterward to synchronise with io_req_task_work_add(), do that, and set the state back if we're going to retry, because clearing not-ours task_state on the next iteration would be buggy. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1ef72cdac7022adf0cd7ce4bfe3bb5c82a62eb93.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	16f7207038	io_uring: don't resched with empty task_list Entering tctx_task_work() with empty task_list is a strange scenario, that can happen only on rare occasion during task exit, so let's not check for task_list emptiness in advance and do it do-while style. The code still correct for the empty case, just would do extra work about which we don't care. Do extra step and do the check before cond_resched(), so we don't resched if have nothing to execute. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c4173e288e69793d03c7d7ce826f9d28afba718a.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	c6538be9e4	io_uring: refactor tctx task_work list splicing We don't need a full copy of tctx->task_list in tctx_task_work(), but only a first one, so just assign node directly. Taking into account that task_works are run in a context of a task, it's very unlikely to first see non-empty tctx->task_list and then splice it empty, can only happen with task_work cancellations that is not-normal slow path anyway. Hence, get rid of the check in the end, it's there not for validity but "performance" purposes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d076c83fedb8253baf43acb23b8fafd7c5da1714.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	ebd0df2e63	io_uring: optimise task_work submit flushing tctx_task_work() tries to fetch a next batch of requests, but before it would flush completions from the previous batch that may be sub-optimal. E.g. io_req_task_queue() executes a head of the link where all the linked may be enqueued through the same io_req_task_queue(). And there are more cases for that. Do the flushing at the end, so it can cache completions of several waves of a single tctx_task_work(), and do the flush at the very end. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3cac83934e4fbce520ff8025c3524398b3ae0270.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	3f18407dc6	io_uring: inline __tctx_task_work() Inline __tctx_task_work() into tctx_task_work() in preparation for further optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f9c05c4bc9763af7bd8e25ebc3c5f7b6f69148f8.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	a3dbdf54da	io_uring: refactor io_get_sequence() Clean up io_get_sequence() and add a comment describing the magic around sequence correction. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f55dc409936b8afa4698d24b8677a34d31077ccb.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	c854357bc1	io_uring: clean all flags in io_clean_op() at once Clean all flags in io_clean_op() in the end in one operation, will save us a couple of operation and binary size. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b8efe1f022a037f74e7fe497c69fb554d59bfeaf.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	1dacb4df4e	io_uring: simplify iovec freeing in io_clean_op() We don't get REQ_F_NEED_CLEANUP for rw unless there is ->free_iovec set, so remove the optimisation of NULL checking it inline, kfree() will take care if that would ever be the case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a233dc655d3d45bd4f69b73d55a61de46d914415.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	b8e64b5300	io_uring: track request creds with a flag Currently, if req->creds is not NULL, then there are creds assigned. Track the invariant with a new flag in req->flags. No need to clear the field at init, and also cleanup can be efficiently moved into io_clean_op(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5f8baeb8d3b909487f555542350e2eac97005556.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	c10d1f986b	io_uring: move creds from io-wq work to io_kiocb io-wq now doesn't have anything to do with creds now, so move ->creds from struct io_wq_work into request (aka struct io_kiocb). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8520c72ab8b8f4b96db12a228a2ab4c094ae64e1.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	2a2758f26d	io_uring: refactor io_submit_flush_completions() struct io_comp_state is always contained in struct io_ring_ctx, don't pass them into io_submit_flush_completions() separately, it makes the interface cleaner and simplifies it for the compiler. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/44d6ca57003a82484338e95197024dbd65a1b376.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Jens Axboe	fe76421d1d	io_uring: allow user configurable IO thread CPU affinity io-wq defaults to per-node masks for IO workers. This works fine by default, but isn't particularly handy for workloads that prefer more specific affinities, for either performance or isolation reasons. This adds IORING_REGISTER_IOWQ_AFF that allows the user to pass in a CPU mask that is then applied to IO thread workers, and an IORING_UNREGISTER_IOWQ_AFF that simply resets the masks back to the default of per-node. Note that no care is given to existing IO threads, they will need to go through a reschedule before the affinity is correct if they are already running or sleeping. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-17 10:25:50 -06:00
Olivier Langlois	236daeae36	io_uring: Add to traces the req pointer when available The req pointer uniquely identify a specific request. Having it in traces can provide valuable insights that is not possible to have if the calling process is reusing the same user_data value. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Olivier Langlois <olivier@trillion01.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-16 06:41:46 -06:00
Pavel Begunkov	2335f6f5dd	io_uring: optimise io_commit_cqring() In most cases io_commit_cqring() is just an smp_store_release(), and it's hot enough, especially for IRQ rw, to want it to save on a function call. Mark it inline and extract a non-inlined slow path doing drain and timeout flushing. The inlined part is pretty slim to not cause binary bloating. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7350f8b6b92caa50a48a80be39909f0d83eddd93.1623772051.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:44:34 -06:00
Pavel Begunkov	3c19966d37	io_uring: shove more drain bits out of hot path Place all drain_next logic into io_drain_req(), so it's never executed if there was no drained requests before. The only thing we need is to set ->drain_active if we see a request with IOSQE_IO_DRAIN, do that in io_init_req() where flags are definitely in registers. Also, all drain-related code is encapsulated in io_drain_req(), makes it cleaner. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/68bf4f7395ddaafbf1a26bd97b57d57d45a9f900.1623772051.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:44:34 -06:00
Pavel Begunkov	10c669040e	io_uring: switch !DRAIN fast path when possible ->drain_used is one way, which is not optimal if users use DRAIN but very rarely. However, we can just clear it in io_drain_req() when all drained before requests are gone. Also rename the flag to reflect the change and be more clear about it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7f37a240857546a94df6348507edddacab150460.1623772051.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:44:33 -06:00
Pavel Begunkov	27f6b318de	io_uring: fix min types mismatch in table alloc fs/io_uring.c: In function 'io_alloc_page_table': include/linux/minmax.h:20:28: warning: comparison of distinct pointer types lacks a cast Cast everything to size_t using min_t. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: `9123c8ffce` ("io_uring: add helpers for 2 level table alloc") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/50f420a956bca070a43810d4a805293ed54f39d8.1623759527.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:40:17 -06:00
Fam Zheng	dd9ae8a0b2	io_uring: Fix comment of io_get_sqe The sqe_ptr argument has been gone since `709b302fad` (io_uring: simplify io_get_sqring, 2020-04-08), made the return value of the function. Update the comment accordingly. Signed-off-by: Fam Zheng <fam.zheng@bytedance.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210604164256.12242-1-fam.zheng@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:39:16 -06:00
Pavel Begunkov	441b8a7803	io_uring: optimise non-drain path Replace drain checks with one-way flag set upon seeing the first IOSQE_IO_DRAIN request. There are several places where it cuts cycles well: 1) It's much faster than the fast check with two conditions in io_drain_req() including pretty complex list_empty_careful(). 2) We can mark io_queue_sqe() inline now, that's a huge win. 3) It replaces timeout and drain checks in io_commit_cqring() with a single flags test. Also great not touching ->defer_list there without a reason so limiting cache bouncing. It adds a small amount of overhead to drain path, but it's negligible. The main nuisance is that once it meets any DRAIN request in io_uring instance lifetime it will _always_ go through a slower path, so drain-less and offset-mode timeout less applications are preferable. The overhead in that case would be not big, but it's worth to bear in mind. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/98d2fff8c4da5144bb0d08499f591d4768128ea3.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	76cc33d791	io_uring: refactor io_req_defer() Rename io_req_defer() into io_drain_req() and refactor it uncoupling it from io_queue_sqe() error handling and preparing for coming optimisations. Also, prioritise non IOSQE_ASYNC path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4f17dd56e7fbe52d1866f8acd8efe3284d2bebcb.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	0499e582aa	io_uring: move uring_lock location ->uring_lock is prevalently used for submission, even though it protects many other things like iopoll, registeration, selected bufs, and more. And it's placed together with ->cq_wait poked on completion and CQ waiting sides. Move them apart, ->uring_lock goes to the submission data, and cq_wait to completion related chunk. The last one requires some reshuffling so everything needed by io_cqring_ev_posted*() is in one cacheline. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dea5e845caee4c98aa0922b46d713154d81f7bd8.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	311997b3fc	io_uring: wait heads renaming We use several wait_queue_head's for different purposes, but namings are confusing. First rename ctx->cq_wait into ctx->poll_wait, because this one is used for polling an io_uring instance. Then rename ctx->wait into ctx->cq_wait, which is responsible for CQE waiting. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/47b97a097780c86c67b20b6ccc4e077523dce682.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	5ed7a37d21	io_uring: clean up check_overflow flag There are no users of ->sq_check_overflow, only ->cq_check_overflow is used. Combine it and move out of completion related part of struct io_ring_ctx. A not so obvious benefit of it is fitting all completion side fields into a single cacheline. It was taking 2 lines before with 56B padding, and io_cqring_ev_posted*() were still touching both of them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/25927394964df31d113e3c729416af573afff5f5.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	5e159204d7	io_uring: small io_submit_sqe() optimisation submit_state.link is used only to assemble a link and not used for actual submission, so clear it before io_queue_sqe() in io_submit_sqe(), awhile it's hot and in caches and queueing doesn't spoil it. May also potentially help compiler with spilling or to do other optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1579939426f3ad6b55af3005b1389bbbed7d780d.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	f18ee4cf0a	io_uring: optimise completion timeout flushing io_commit_cqring() might be very hot and we definitely don't want to touch ->timeout_list there, because 1) it's shared with the submission side so might lead to cache bouncing and 2) may need to load an extra cache line, especially for IRQ completions. We're interested in it at the completion side only when there are offset-mode timeouts, which are not so popular. Replace list_empty(->timeout_list) hot path check with a new one-way flag, which is set when we prepare the first offset-mode timeout. note: the flag sits in the same line as briefly used after ->rings Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e4892ec68b71a69f92ffbea4a1499be3ec0d463b.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	15641e4270	io_uring: don't cache number of dropped SQEs Kill ->cached_sq_dropped and wire DRAIN sequence number correction via ->cq_extra, which is there exactly for that purpose. User visible dropped counter will be populated by incrementing it instead of keeping a copy, similarly as it was done not so long ago with cq_overflow. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/088aceb2707a534d531e2770267c4498e0507cc1.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	17d3aeb33c	io_uring: refactor io_get_sqe() The line of io_get_sqe() evaluating @head consists of too many operations including READ_ONCE(), it's not convenient for probing. Refactor it also improving readability. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/866ad6e4ef4851c7c61f6b0e08dbd0a8d1abce84.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:39 -06:00
Pavel Begunkov	7f1129d227	io_uring: shuffle more fields into SQ ctx section Since moving locked_free_* out of struct io_submit_state ctx->submit_state is accessed on submission side only, so move it into the submission section. Same goes for rsrc table pointers/nodes/etc., they must be taken and checked during submission because sync'ed by uring_lock, so move them there as well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8a5899a50afc6ccca63249e716f580b246f3dec6.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:39 -06:00
Pavel Begunkov	b52ecf8cb5	io_uring: move ctx->flags from SQ cacheline ctx->flags are heavily used by both, completion and submission sides, so move it out from the ctx fields related to submissions. Instead, place it together with ctx->refs, because it's already cacheline-aligned and so pads lots of space, and both almost never change. Also, in most occasions they are accessed together as refs are taken at submission time and put back during completion. Do same with ctx->rings, where the pointer itself is never modified apart from ring init/free. Note: in percpu mode, struct percpu_ref doesn't modify the struct itself but takes indirection with ref->percpu_count_ptr. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4c48c173e63d35591383ba2b87e8b8e8dfdbd23d.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:39 -06:00
Pavel Begunkov	c7af47cf0f	io_uring: keep SQ pointers in a single cacheline sq_array and sq_sqes are always used together, however they are in different cachelines, where the borderline is right before cq_overflow_list is rather rarely touched. Move the fields together so it loads only one cacheline. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3ef2411a94874da06492506a8897eff679244f49.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:39 -06:00
Colin Ian King	fdd1dc316e	io_uring: Fix incorrect sizeof operator for copy_from_user call Static analysis is warning that the sizeof being used is should be of *data->tags[i] and not data->tags[i]. Although these are the same size on 64 bit systems it is not a portable assumption to assume this is true for all cases. Fix this by using a temporary pointer tag_slot to make the code a clearer. Addresses-Coverity: ("Sizeof not portable") Fixes: `d878c81610` ("io_uring: hide rsrc tag copy into generic helpers") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210615130011.57387-1-colin.king@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:37:11 -06:00
Pavel Begunkov	aeab9506ef	io_uring: inline io_iter_do_read() There are only two calls in source code of io_iter_do_read(), the function is small and pretty hot though is failed to get inlined. Makr it as inline. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/25a26dae7660da73fbc2244b361b397ef43d3caf.1623634182.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:13 -06:00
Pavel Begunkov	78cc687be9	io_uring: unify SQPOLL and user task cancellations Merge io_uring_cancel_sqpoll() and __io_uring_cancel() as it's easier to have a conditional ctx traverse inside than keeping them in sync. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/adfe24d6dad4a3883a40eee54352b8b65ac851bb.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:13 -06:00
Pavel Begunkov	09899b1915	io_uring: cache task struct refs tctx in submission part is always synchronised because is executed from the task's context, so we can batch allocate tctx/task references and store them across syscall boundaries. It avoids enough of operations, including an atomic for getting task ref and a percpu_counter_add() function call, which still fallback to spinlock for large batching cases (around >=32). Should be good for SQPOLL submitting in small portions and coming at some moment bpf submissions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/14b327b973410a3eec1f702ecf650e100513aca9.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:13 -06:00
Pavel Begunkov	2d091d62b1	io_uring: don't vmalloc rsrc tags We don't really need vmalloc for keeping tags, it's not a hot path and is there out of convenience, so replace it with two level tables to not litter kernel virtual memory mappings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/241a3422747113a8909e7e1030eb585d4a349e0d.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:13 -06:00
Pavel Begunkov	9123c8ffce	io_uring: add helpers for 2 level table alloc Some parts like fixed file table use 2 level tables, factor out helpers for allocating/deallocating them as more users are to come. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1709212359cd82eb416d395f86fc78431ccfc0aa.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:13 -06:00
Pavel Begunkov	157d257f99	io_uring: remove rsrc put work irq save/restore io_rsrc_put_work() is executed by workqueue in non-irq context, so no need for irqsave/restore variants of spinlocking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2a7f77220735f4ad404ac885b4d73bdf42d2f836.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:13 -06:00
Pavel Begunkov	d878c81610	io_uring: hide rsrc tag copy into generic helpers Make io_rsrc_data_alloc() taking care of rsrc tags loading on registration, so we don't need to repeat it for each new rsrc type. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5609680697bd09735de10561b75edb95283459da.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:13 -06:00
Pavel Begunkov	eef51daa72	io_uring: rename function task_file What at some moment was references to struct file used to control lifetimes of task/ctx is now just internal tctx structures/nodes, so rename outdated task_file() routines into something more sensible. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e2fbce42932154c2631ce58ffbffaa232afe18d5.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:12 -06:00
Pavel Begunkov	cb3d8972c7	io_uring: refactor io_iopoll_req_issued A simple refactoring of io_iopoll_req_issued(), move in_async inside so we don't pass it around and save on double checking it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1513bfde4f0c835be25ac69a82737ab0668d7665.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:12 -06:00
Pavel Begunkov	976517f162	io_uring: fix blocking inline submission There is a complaint against sys_io_uring_enter() blocking if it submits stdin reads. The problem is in __io_file_supports_async(), which sees that it's a cdev and allows it to be processed inline. Punt char devices using generic rules of io_file_supports_async(), including checking for presence of *_iter() versions of rw callbacks. Apparently, it will affect most of cdevs with some exceptions like null and zero devices. Cc: stable@vger.kernel.org Reported-by: Birk Hirdman <lonjil@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d60270856b8a4560a639ef5f76e55eb563633599.1623236455.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:05 -06:00
Pavel Begunkov	40dad765c0	io_uring: enable shmem/memfd memory registration Relax buffer registration restictions, which filters out file backed memory, and allow shmem/memfd as they have normal anonymous pages underneath. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:05 -06:00
Pavel Begunkov	d0acdee296	io_uring: don't bounce submit_state cachelines struct io_submit_state contains struct io_comp_state and so locked_free_, that renders cachelines around ->locked_free being invalidated on most non-inline completions, that may terrorise caches if submissions and completions are done by different tasks. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/290cb5412b76892e8631978ee8ab9db0c6290dd5.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:05 -06:00
Pavel Begunkov	d068b5068d	io_uring: rename io_get_cqring Rename io_get_cqring() into io_get_cqe() for consistency with SQ, and just because the old name is not as clear. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a46a53e3f781de372f5632c184e61546b86515ce.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:05 -06:00
Pavel Begunkov	8f6ed49a44	io_uring: kill cached_cq_overflow There are two copies of cq_overflow, shared with userspace and internal cached one. It was needed for DRAIN accounting, but now we have yet another knob to tune the accounting, i.e. cq_extra, and we can throw away the internal counter and just increment the one in the shared ring. If user modifies it as so never gets the right overflow value ever again, it's its problem, even though before we would have restored it back by next overflow. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8427965f5175dd051febc63804909861109ce859.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:05 -06:00
Pavel Begunkov	ea5ab3b579	io_uring: deduce cq_mask from cq_entries No need to cache cq_mask, it's exactly cq_entries - 1, so just deduce it to not carry it around. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d439efad0503c8398451dae075e68a04362fbc8d.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:05 -06:00
Pavel Begunkov	a566c5562d	io_uring: remove dependency on ring->sq/cq_entries We have numbers of {sq,cq} entries cached in ctx, don't look up them in user-shared rings as 1) it may fetch additional cacheline 2) user may change it and so it's always error prone. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/745d31bc2da41283ddd0489ef784af5c8d6310e9.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:05 -06:00
Pavel Begunkov	b13a8918d3	io_uring: better locality for rsrc fields ring has two types of resource-related fields: used for request submission, and field needed for update/registration. Reshuffle them into these two groups for better locality and readability. The second group is not in the hot path, so it's natural to place them somewhere in the end. Also update an outdated comment. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/05b34795bb4440f4ec4510f08abd5a31830f8ca0.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	b986af7e2d	io_uring: shuffle rarely used ctx fields There is a bunch of scattered around ctx fields that are almost never used, e.g. only on ring exit, plunge them to the end, better locality, better aesthetically. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/782ff94b00355923eae757d58b1a47821b5b46d4.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	93d2bcd2cb	io_uring: make fail flag not link specific The main difference is in req_set_fail_links() renamed into req_set_fail(), which now sets REQ_F_FAIL_LINK/REQ_F_FAIL flag unconditional on whether it has been a link or not. It only matters in io_disarm_next(), which already handles it well, and all calls to it have a fast path checking REQ_F_LINK/HARDLINK. It looks cleaner, and sheds binary size text data bss dec hex filename 84235 12390 8 96633 17979 ./fs/io_uring.o 84151 12414 8 96573 1793d ./fs/io_uring.o Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e2224154dd6e53b665ac835d29436b177872fa10.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	3dd0c97a9e	io_uring: get rid of files in exit cancel We don't match against files on cancellation anymore, so no need to drag around files_struct anymore, just pass a flag telling whether only inflight or all requests should be killed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7bfc5409a78f8e2d6b27dec3293ec2d248677348.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	acfb381d9d	io_uring: simplify waking sqo_sq_wait Going through submission in __io_sq_thread() and still having a full SQ is rather unexpected, so remove a check for SQ fullness and just wake up whoever wait on sqo_sq_wait. Also skip if it doesn't do submission in the first place, likely may to happen for SQPOLL sharing and/or IOPOLL. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e2e91751e87b1a39f8d63ef884aaff578123f61e.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	21f2fc080f	io_uring: remove unused park_task_work As sqpoll cancel via task_work is killed, remove everything related to park_task_work as it's not used anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/310d8b76a2fbbf3e139373500e04ad9af7ee3dbb.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	aaa9f0f481	io_uring: improve sq_thread waiting check If SQPOLL task finds a ring requesting it to continue running, no need to set wake flag to rest of the rings as it will be cleared in a moment anyway, so hide it in a single sqd->ctx_list loop. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1ee5a696d9fd08645994c58ee147d149a8957d94.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	e4b6d902a9	io_uring: improve sqpoll event/state handling As sqd->state changes rarely, don't check every event one by one but look them all at once. Add a helper function. Also don't go into event waiting sleeping with STOP flag set. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/645025f95c7eeec97f88ff497785f4f1d6f3966f.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	9690557e22	io_uring: add feature flag for rsrc tags Add IORING_FEAT_RSRC_TAGS indicating that io_uring supports a bunch of new IORING_REGISTER operations, in particular IORING_REGISTER_[FILES[,UPDATE]2,BUFFERS[2,UPDATE]] that support rsrc tagging, and also indicating implemented dynamic fixed buffer updates. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9b995d4045b6c6b4ab7510ca124fd25ac2203af7.1623339162.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-10 16:33:51 -06:00
Pavel Begunkov	992da01aa9	io_uring: change registration/upd/rsrc tagging ABI There are ABI moments about recently added rsrc registration/update and tagging that might become a nuisance in the future. First, IORING_REGISTER_RSRC[_UPD] hide different types of resources under it, so breaks fine control over them by restrictions. It works for now, but once those are wanted under restrictions it would require a rework. It was also inconvenient trying to fit a new resource not supporting all the features (e.g. dynamic update) into the interface, so better to return to IORING_REGISTER_* top level dispatching. Second, register/update were considered to accept a type of resource, however that's not a good idea because there might be several ways of registration of a single resource type, e.g. we may want to add non-contig buffers or anything more exquisite as dma mapped memory. So, remove IORING_RSRC_[FILE,BUFFER] out of the ABI, and place them internally for now to limit changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9b554897a7c17ad6e3becc48dfed2f7af9f423d5.1623339162.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-10 16:33:51 -06:00
Pavel Begunkov	216e583596	io_uring: fix misaccounting fix buf pinned pages As Andres reports "... io_sqe_buffer_register() doesn't initialize imu. io_buffer_account_pin() does imu->acct_pages++, before calling io_account_mem(ctx, imu->acct_pages).", leading to evevntual -ENOMEM. Initialise the field. Reported-by: Andres Freund <andres@anarazel.de> Fixes: `41edf1a5ec` ("io_uring: keep table of pointers to ubufs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/438a6f46739ae5e05d9c75a0c8fa235320ff367c.1622285901.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-29 19:27:21 -06:00
Marco Elver	b16ef427ad	io_uring: fix data race to avoid potential NULL-deref Commit `ba5ef6dc8a` ("io_uring: fortify tctx/io_wq cleanup") introduced setting tctx->io_wq to NULL a bit earlier. This has caused KCSAN to detect a data race between accesses to tctx->io_wq: write to 0xffff88811d8df330 of 8 bytes by task 3709 on cpu 1: io_uring_clean_tctx fs/io_uring.c:9042 [inline] __io_uring_cancel fs/io_uring.c:9136 io_uring_files_cancel include/linux/io_uring.h:16 [inline] do_exit kernel/exit.c:781 do_group_exit kernel/exit.c:923 get_signal kernel/signal.c:2835 arch_do_signal_or_restart arch/x86/kernel/signal.c:789 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] ... read to 0xffff88811d8df330 of 8 bytes by task 6412 on cpu 0: io_uring_try_cancel_iowq fs/io_uring.c:8911 [inline] io_uring_try_cancel_requests fs/io_uring.c:8933 io_ring_exit_work fs/io_uring.c:8736 process_one_work kernel/workqueue.c:2276 ... With the config used, KCSAN only reports data races with value changes: this implies that in the case here we also know that tctx->io_wq was non-NULL. Therefore, depending on interleaving, we may end up with: [CPU 0] \| [CPU 1] io_uring_try_cancel_iowq() \| io_uring_clean_tctx() if (!tctx->io_wq) // false \| ... ... \| tctx->io_wq = NULL io_wq_cancel_cb(tctx->io_wq, ...) \| ... -> NULL-deref \| Note: It is likely that thus far we've gotten lucky and the compiler optimizes the double-read into a single read into a register -- but this is never guaranteed, and can easily change with a different config! Fix the data race by restoring the previous behaviour, where both setting io_wq to NULL and put of the wq are _serialized_ after concurrent io_uring_try_cancel_iowq() via acquisition of the uring_lock and removal of the node in io_uring_del_task_file(). Fixes: `ba5ef6dc8a` ("io_uring: fortify tctx/io_wq cleanup") Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Reported-by: syzbot+bf2b3d0435b9b728946c@syzkaller.appspotmail.com Signed-off-by: Marco Elver <elver@google.com> Cc: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20210527092547.2656514-1-elver@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-27 07:44:49 -06:00
Pavel Begunkov	17a91051fe	io_uring/io-wq: close io-wq full-stop gap There is an old problem with io-wq cancellation where requests should be killed and are in io-wq but are not discoverable, e.g. in @next_hashed or @linked vars of io_worker_handle_work(). It adds some unreliability to individual request canellation, but also may potentially get __io_uring_cancel() stuck. For instance: 1) An __io_uring_cancel()'s cancellation round have not found any request but there are some as desribed. 2) __io_uring_cancel() goes to sleep 3) Then workers wake up and try to execute those hidden requests that happen to be unbound. As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT in advance, so preventing 3) from executing unbound requests. The workers will initially break looping because of getting a signal as they are threads of the dying/exec()'ing user task. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-25 19:39:58 -06:00
Pavel Begunkov	ba5ef6dc8a	io_uring: fortify tctx/io_wq cleanup We don't want anyone poking into tctx->io_wq awhile it's being destroyed by io_wq_put_and_exit(), and even though it shouldn't even happen, if buggy would be preferable to get a NULL-deref instead of subtle delayed failure or UAF. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/827b021de17926fd807610b3e53a5a5fa8530856.1621513214.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-20 07:29:11 -06:00
Pavel Begunkov	7a27472770	io_uring: don't modify req->poll for rw __io_queue_proc() is used by both poll and apoll, so we should not access req->poll directly but selecting right struct io_poll_iocb depending on use case. Reported-and-tested-by: syzbot+a84b8783366ecb1c65d0@syzkaller.appspotmail.com Fixes: `ea6a693d86` ("io_uring: disable multishot poll for double poll add cases") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4a6a1de31142d8e0250fe2dfd4c8923d82a5bbfc.1621251795.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-17 07:28:48 -06:00
Pavel Begunkov	489809e2e2	io_uring: increase max number of reg buffers Since recent changes instead of storing a large array of struct io_mapped_ubuf, we store pointers to them, that is 4 times slimmer and we should not to so worry about restricting max number of registererd buffer slots, increase the limit 4 times. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d3dee1da37f46da416aa96a16bf9e5094e10584d.1620990371.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-14 06:06:34 -06:00
Pavel Begunkov	2d74d0421e	io_uring: further remove sqpoll limits on opcodes There are three types of requests that left disabled for sqpoll, namely epoll ctx, statx, and resources update. Since SQPOLL task is now closely mimics a userspace thread, remove the restrictions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/909b52d70c45636d8d7897582474ea5aab5eed34.1620990306.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-14 06:06:23 -06:00
Pavel Begunkov	447c19f3b5	io_uring: fix ltout double free on completion race Always remove linked timeout on io_link_timeout_fn() from the master request link list, otherwise we may get use-after-free when first io_link_timeout_fn() puts linked timeout in the fail path, and then will be found and put on master's free. Cc: stable@vger.kernel.org # 5.10+ Fixes: `90cd7e4249` ("io_uring: track link timeout's master explicitly") Reported-and-tested-by: syzbot+5a864149dd970b546223@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/69c46bf6ce37fec4fdcd98f0882e18eb07ce693a.1620990121.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-14 06:06:15 -06:00
Pavel Begunkov	a298232ee6	io_uring: fix link timeout refs WARNING: CPU: 0 PID: 10242 at lib/refcount.c:28 refcount_warn_saturate+0x15b/0x1a0 lib/refcount.c:28 RIP: 0010:refcount_warn_saturate+0x15b/0x1a0 lib/refcount.c:28 Call Trace: __refcount_sub_and_test include/linux/refcount.h:283 [inline] __refcount_dec_and_test include/linux/refcount.h:315 [inline] refcount_dec_and_test include/linux/refcount.h:333 [inline] io_put_req fs/io_uring.c:2140 [inline] io_queue_linked_timeout fs/io_uring.c:6300 [inline] __io_queue_sqe+0xbef/0xec0 fs/io_uring.c:6354 io_submit_sqe fs/io_uring.c:6534 [inline] io_submit_sqes+0x2bbd/0x7c50 fs/io_uring.c:6660 __do_sys_io_uring_enter fs/io_uring.c:9240 [inline] __se_sys_io_uring_enter+0x256/0x1d60 fs/io_uring.c:9182 io_link_timeout_fn() should put only one reference of the linked timeout request, however in case of racing with the master request's completion first io_req_complete() puts one and then io_put_req_deferred() is called. Cc: stable@vger.kernel.org # 5.12+ Fixes: `9ae1f8dd37` ("io_uring: fix inconsistent lock state") Reported-by: syzbot+a2910119328ce8e7996f@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ff51018ff29de5ffa76f09273ef48cb24c720368.1620417627.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-08 22:11:49 -06:00
Thadeu Lima de Souza Cascardo	d1f8280887	io_uring: truncate lengths larger than MAX_RW_COUNT on provide buffers Read and write operations are capped to MAX_RW_COUNT. Some read ops rely on that limit, and that is not guaranteed by the IORING_OP_PROVIDE_BUFFERS. Truncate those lengths when doing io_add_buffers, so buffer addresses still use the uncapped length. Also, take the chance and change struct io_buffer len member to __u32, so it matches struct io_provide_buffer len member. This fixes CVE-2021-3491, also reported as ZDI-CAN-13546. Fixes: `ddf0322db7` ("io_uring: add IORING_OP_PROVIDE_BUFFERS") Reported-by: Billy Jheng Bing-Jhong (@st424204) Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-05 15:17:35 -06:00
Zqiang	bb6659cc0a	io_uring: Fix memory leak in io_sqe_buffers_register() unreferenced object 0xffff8881123bf0a0 (size 32): comm "syz-executor557", pid 8384, jiffies 4294946143 (age 12.360s) backtrace: [<ffffffff81469b71>] kmalloc_node include/linux/slab.h:579 [inline] [<ffffffff81469b71>] kvmalloc_node+0x61/0xf0 mm/util.c:587 [<ffffffff815f0b3f>] kvmalloc include/linux/mm.h:795 [inline] [<ffffffff815f0b3f>] kvmalloc_array include/linux/mm.h:813 [inline] [<ffffffff815f0b3f>] kvcalloc include/linux/mm.h:818 [inline] [<ffffffff815f0b3f>] io_rsrc_data_alloc+0x4f/0xc0 fs/io_uring.c:7164 [<ffffffff815f26d8>] io_sqe_buffers_register+0x98/0x3d0 fs/io_uring.c:8383 [<ffffffff815f84a7>] __io_uring_register+0xf67/0x18c0 fs/io_uring.c:9986 [<ffffffff81609222>] __do_sys_io_uring_register fs/io_uring.c:10091 [inline] [<ffffffff81609222>] __se_sys_io_uring_register fs/io_uring.c:10071 [inline] [<ffffffff81609222>] __x64_sys_io_uring_register+0x112/0x230 fs/io_uring.c:10071 [<ffffffff842f616a>] do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47 [<ffffffff84400068>] entry_SYSCALL_64_after_hwframe+0x44/0xae Fix data->tags memory leak, through io_rsrc_data_free() to release data memory space. Reported-by: syzbot+0f32d05d8b6cd8d7ea3e@syzkaller.appspotmail.com Signed-off-by: Zqiang <qiang.zhang@windriver.com> Link: https://lore.kernel.org/r/20210430082515.13886-1-qiang.zhang@windriver.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-30 06:44:22 -06:00
Colin Ian King	cf3770e784	io_uring: Fix premature return from loop and memory leak Currently the -EINVAL error return path is leaking memory allocated to data. Fix this by not returning immediately but instead setting the error return variable to -EINVAL and breaking out of the loop. Kudos to Pavel Begunkov for suggesting a correct fix. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210429104602.62676-1-colin.king@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-29 13:26:19 -06:00
Pavel Begunkov	47b228ce6f	io_uring: fix unchecked error in switch_start() io_rsrc_node_switch_start() can fail, don't forget to check returned error code. Reported-by: syzbot+a4715dd4b7c866136f79@syzkaller.appspotmail.com Fixes: `eae071c9b4` ("io_uring: prepare fixed rw for dynanic buffers") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c4c06e2f3f0c8e43bd8d0a266c79055bcc6b6e60.1619693112.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-29 13:26:19 -06:00
Pavel Begunkov	6224843d56	io_uring: allow empty slots for reg buffers Allow empty reg buffer slots any request using which should fail. This allows users to not register all buffers in advance, but do it lazily and/or on demand via updates. That is achieved by setting iov_base and iov_len to zero for registration and/or buffer updates. Empty buffer can't have a non-zero tag. Implementation details: to not add extra overhead to io_import_fixed(), create a dummy buffer crafted to fail any request using it, and set it to all empty buffer slots. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7e95e4d700082baaf010c648c72ac764c9cc8826.1619611868.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-29 13:26:19 -06:00
Pavel Begunkov	b0d658ec88	io_uring: add more build check for uapi Add a couple of BUILD_BUG_ON() checking some rsrc uapi structs and SQE flags. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ff960df4d5026b9fb5bfd80994b9d3667d3926da.1619536280.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-29 13:26:18 -06:00
Pavel Begunkov	dddca22636	io_uring: dont overlap internal and user req flags CQE flags take one byte that we store in req->flags together with other REQ_F_* internal flags. CQE flags are copied directly into req and then verified that requires some handling on failures, e.g. to make sure that that copy doesn't set some of the internal flags. Move all internal flags to take bits after the first byte, so we don't need extra handling and make it safer overall. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b8b5b02d1ab9d786fcc7db4a3fe86db6b70b8987.1619536280.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-29 13:26:18 -06:00
Pavel Begunkov	2840f710f2	io_uring: fix drain with rsrc CQEs Resource emitted CQEs are not bound to requests, so fix up counters used for DRAIN/defer logic. Fixes: `b60c8dce33` ("io_uring: preparation for rsrc tagging") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2b32f5f0a40d5928c3466d028f936e167f0654be.1619536280.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-29 13:26:18 -06:00
Linus Torvalds	625434dafd	for-5.13/io_uring-2021-04-27 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIRBUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjt5D/9de6zCaha6CyfIIPiU+crropQ2jPzO49cb WzcOCmdhSv0GtYlhdnIqCOo5p8mRDWJAEBU9upTDTCWOx9hwr5Ms0TCNQHxuQ/T0 4Ll+/cMsOxeTypiykfMtOG9TEmYSria2vTJKLgpyaP4ohfJa3uT7r2NZ8NK/8T4t wwbJ+jCSKewelI1l0XD8k8LBU39FS/KRgLTdfYj/rCW3PWt/ZE2eSIYjZQvMCVOC 3fIdgOOJAMQVQafz+YAeJd2E+/l5/8YcJVKpJMVtBNbqTHIjA4EsInZauy8TpBgW OzJ3I+XdF70qZM119tI/nXw3sb0e+UV0fRsIXLkOwTEBzowernrAtsEwAOP+qFKS 2YnqSKOSjMO5d5Mpkz6T0MDMloU45jph88lUH0RoShVxGa7jv+TMOL6QU1oOyxc1 +gPPbApQs9WtSZDHsTJ0xFLpol804UDQmwb38mHdzedDVSE7iip1jANkw6LEhKkJ Mlg60ZF1Z305G+cDhrbs02ZGVa+fzbrtXtLlTqZw8bNX9lBp0JLtDpzskjbnUmck 6A04nfg+Eto5GvAn+FRBuOCPridLEk2K6ygko/gwQWsYCgqkCgRuqjlIQCSZy5iu jHEFixIXKn6eACf+YzLVxSLyEQrmFyDSypbN7LvzoKJYo/loy8Q1+42nGlrVC3zi +CB1NokPng== =ZJ8L -----END PGP SIGNATURE----- Merge tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: - Support for multi-shot mode for POLL requests - More efficient reference counting. This is shamelessly stolen from the mm side. Even though referencing is mostly single/dual user, the 128 count was retained to keep the code the same. Maybe this should/could be made generic at some point. - Removal of the need to have a manager thread for each ring. The manager threads only job was checking and creating new io-threads as needed, instead we handle this from the queue path. - Allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICE. Since 5.12, this thread is "just" a regular application thread, so no need to restrict use of it anymore. - Cleanup of how internal async poll data lifetime is managed. - Fix for syzbot reported crash on SQPOLL cancelation. - Make buffer registration more like file registrations, which includes flexibility in avoiding full set unregistration and re-registration. - Fix for io-wq affinity setting. - Be a bit more defensive in task->pf_io_worker setup. - Various SQPOLL fixes. - Cleanup of SQPOLL creds handling. - Improvements to in-flight request tracking. - File registration cleanups. - Tons of cleanups and little fixes * tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block: (156 commits) io_uring: maintain drain logic for multishot poll requests io_uring: Check current->io_uring in io_uring_cancel_sqpoll io_uring: fix NULL reg-buffer io_uring: simplify SQPOLL cancellations io_uring: fix work_exit sqpoll cancellations io_uring: Fix uninitialized variable up.resv io_uring: fix invalid error check after malloc io_uring: io_sq_thread() no longer needs to reset current->pf_io_worker kernel: always initialize task->pf_io_worker to NULL io_uring: update sq_thread_idle after ctx deleted io_uring: add full-fledged dynamic buffers support io_uring: implement fixed buffers registration similar to fixed files io_uring: prepare fixed rw for dynanic buffers io_uring: keep table of pointers to ubufs io_uring: add generic rsrc update with tags io_uring: add IORING_REGISTER_RSRC io_uring: enumerate dynamic resources io_uring: add generic path for rsrc update io_uring: preparation for rsrc tagging io_uring: decouple CQE filling from requests ...	2021-04-28 14:56:09 -07:00
Hao Xu	7b289c3833	io_uring: maintain drain logic for multishot poll requests Now that we have multishot poll requests, one SQE can emit multiple CQEs. given below example: sqe0(multishot poll)-->sqe1-->sqe2(drain req) sqe2 is designed to issue after sqe0 and sqe1 completed, but since sqe0 is a multishot poll request, sqe2 may be issued after sqe0's event triggered twice before sqe1 completed. This isn't what users leverage drain requests for. Here the solution is to wait for multishot poll requests fully completed. To achieve this, we should reconsider the req_need_defer equation, the original one is: all_sqes(excluding dropped ones) == all_cqes(including dropped ones) This means we issue a drain request when all the previous submitted SQEs have generated their CQEs. Now we should consider multishot requests, we deduct all the multishot CQEs except the cancellation one, In this way a multishot poll request behave like a normal request, so: all_sqes == all_cqes - multishot_cqes(except cancellations) Here we introduce cq_extra for it. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/1618298439-136286-1-git-send-email-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-27 07:38:58 -06:00
Palash Oswal	6d042ffb59	io_uring: Check current->io_uring in io_uring_cancel_sqpoll syzkaller identified KASAN: null-ptr-deref Write in io_uring_cancel_sqpoll. io_uring_cancel_sqpoll is called by io_sq_thread before calling io_uring_alloc_task_context. This leads to current->io_uring being NULL. io_uring_cancel_sqpoll should not have to deal with threads where current->io_uring is NULL. In order to cast a wider safety net, perform input sanitisation directly in io_uring_cancel_sqpoll and return for NULL value of current->io_uring. This is safe since if current->io_uring isn't set, then there's no way for the task to have submitted any requests. Reported-by: syzbot+be51ca5a4d97f017cd50@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Palash Oswal <hello@oswalpalash.com> Link: https://lore.kernel.org/r/20210427125148.21816-1-hello@oswalpalash.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-27 07:37:39 -06:00
Pavel Begunkov	0b8c0e7c96	io_uring: fix NULL reg-buffer io_import_fixed() doesn't expect a registered buffer slot to be NULL and would fail stumbling on it. We don't allow it, but if during __io_sqe_buffers_update() rsrc removal succeeds but following register fails, we'll get such a situation. Do it atomically and don't remove buffers until we sure that a new one can be set. Fixes: `634d00df5e` ("io_uring: add full-fledged dynamic buffers support") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/830020f9c387acddd51962a3123b5566571b8c6d.1619446608.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-26 09:03:57 -06:00
Pavel Begunkov	9f59a9d88d	io_uring: simplify SQPOLL cancellations All sqpoll rings (even sharing sqpoll task) are currently dead bound to the task that created them, iow when owner task dies it kills all its SQPOLL rings and their inflight requests via task_work infra. It's neither the nicist way nor the most convenient as adds extra locking/waiting and dependencies. Leave it alone and rely on SIGKILL being delivered on its thread group exit, so there are only two cases left: 1) thread group is dying, so sqpoll task gets a signal and exit itself cancelling all requests. 2) an sqpoll ring is dying. Because refs_kill() is called the sqpoll not going to submit any new request, and that's what we need. And io_ring_exit_work() will do all the cancellation itself before actually killing ctx, so sqpoll doesn't need to worry about it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3cd7f166b9c326a2c932b70e71a655b03257b366.1619389911.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-26 06:59:25 -06:00
Pavel Begunkov	28090c1338	io_uring: fix work_exit sqpoll cancellations After closing an SQPOLL ring, io_ring_exit_work() kicks in and starts doing cancellations via io_uring_try_cancel_requests(). It will go through io_uring_try_cancel_iowq(), which uses ctx->tctx_list, but as SQPOLL task don't have a ctx note, its io-wq won't be reachable and so is left not cancelled. It will eventually cancelled when one of the tasks dies, but if a thread group survives for long and changes rings, it will spawn lots of unreclaimed resources and live locked works. Cancel SQPOLL task's io-wq separately in io_ring_exit_work(). Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a71a7fe345135d684025bb529d5cb1d8d6b46e10.1619389911.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-26 06:59:25 -06:00
Colin Ian King	615cee49b3	io_uring: Fix uninitialized variable up.resv The variable up.resv is not initialized and is being checking for a non-zero value in the call to _io_register_rsrc_update. Fix this by explicitly setting the variable to 0. Addresses-Coverity: ("Uninitialized scalar variable)" Fixes: `c3bdad0271` ("io_uring: add generic rsrc update with tags") Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20210426094735.8320-1-colin.king@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-26 06:51:09 -06:00
Pavel Begunkov	a2b4198cab	io_uring: fix invalid error check after malloc Now we allocate io_mapped_ubuf instead of bvec, so we clearly have to check its address after allocation. Fixes: `41edf1a5ec` ("io_uring: keep table of pointers to ubufs") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d28eb1bc4384284f69dbce35b9f70c115ff6176f.1619392565.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-26 06:50:35 -06:00
Stefan Metzmacher	a2a7cc32a5	io_uring: io_sq_thread() no longer needs to reset current->pf_io_worker This is done by create_io_thread() now. Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:29:05 -06:00
Hao Xu	2b4ae19c6d	io_uring: update sq_thread_idle after ctx deleted we shall update sq_thread_idle anytime we do ctx deletion from ctx_list Fixes:734551df6f9b ("io_uring: fix shared sqpoll cancellation hangs") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/1619256380-236460-1-git-send-email-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:25 -06:00
Pavel Begunkov	634d00df5e	io_uring: add full-fledged dynamic buffers support Hook buffers into all rsrc infrastructure, including tagging and updates. Suggested-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/119ed51d68a491dae87eb55fb467a47870c86aad.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Bijan Mottahedeh	bd54b6fe33	io_uring: implement fixed buffers registration similar to fixed files Apply fixed_rsrc functionality for fixed buffers support. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [rebase, remove multi-level tables, fix unregister on exit] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/17035f4f75319dc92962fce4fc04bc0afb5a68dc.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	eae071c9b4	io_uring: prepare fixed rw for dynanic buffers With dynamic buffer updates, registered buffers in the table may change at any moment. First of all we want to prevent future races between updating and importing (i.e. io_import_fixed()), where the latter one may happen without uring_lock held, e.g. from io-wq. Save the first loaded io_mapped_ubuf buffer and reuse. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/21a2302d07766ae956640b6f753292c45200fe8f.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	41edf1a5ec	io_uring: keep table of pointers to ubufs Instead of keeping a table of ubufs convert them into pointers to ubuf, so we can atomically read one pointer and be sure that the content of ubuf won't change. Because it was already dynamically allocating imu->bvec, throw both imu and bvec into a single structure so they can be allocated together. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b96efa4c5febadeccf41d0e849ac099f4c83b0d3.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	c3bdad0271	io_uring: add generic rsrc update with tags Add IORING_REGISTER_RSRC_UPDATE, which also supports passing in rsrc tags. Implement it for registered files. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d4dc66df204212f64835ffca2c4eb5e8363f2f05.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	792e35824b	io_uring: add IORING_REGISTER_RSRC Add a new io_uring_register() opcode for rsrc registeration. Instead of accepting a pointer to resources, fds or iovecs, it @arg is now pointing to a struct io_uring_rsrc_register, and the second argument tells how large that struct is to make it easily extendible by adding new fields. All that is done mainly to be able to pass in a pointer with tags. Pass it in and enable CQE posting for file resources. Doesn't support setting tags on update yet. A design choice made here is to not post CQEs on rsrc de-registration, but only when we updated-removed it by rsrc dynamic update. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c498aaec32a4bb277b2406b9069662c02cdda98c.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	fdecb66281	io_uring: enumerate dynamic resources As resources are getting more support and common parts, it'll be more convenient to index resources and use it for indexing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f0be63e9310212d5601d36277c2946ff7a040485.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	98f0b3b4f1	io_uring: add generic path for rsrc update Extract some common parts for rsrc update, will be used reg buffers support dynamic (i.e. quiesce-lee) managing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b49c3ff6b9ff0e530295767604fe4de64d349e04.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	b60c8dce33	io_uring: preparation for rsrc tagging We need a way to notify userspace when a lazily removed resource actually died out. This will be done by associating a tag, which is u64 exactly like req->user_data, with each rsrc (e.g. buffer of file). A CQE will be posted once a resource is actually put down. Tag 0 is a special value set by default, for whcih it don't generate an CQE, so providing the old behaviour. Don't expose it to the userspace yet, but prepare internally, allocate buffers, add all posting hooks, etc. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2e6beec5eabe7216bb61fb93cdf5aaf65812a9b0.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	d4d19c19d6	io_uring: decouple CQE filling from requests Make __io_cqring_fill_event() agnostic of struct io_kiocb, pass all the data needed directly into it. Will be used to post rsrc removal completions, which don't have an associated request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c9b8da9e42772db2033547dfebe479dc972a0f2c.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	44b31f2fa2	io_uring: return back rsrc data free helper Add io_rsrc_data_free() helper for destroying rsrc_data, easier for search and the function will get more stuff to destroy shortly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/562d1d53b5ff184f15b8949a63d76ef19c4ba9ec.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	fff4db76be	io_uring: move __io_sqe_files_unregister A preparation patch moving __io_sqe_files_unregister() definition closer to other "files" functions without any modification. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/95caf17fe837e67bd1f878395f07049062a010d4.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Hao Xu	724cb4f9ec	io_uring: check sqring and iopoll_list before shedule do this to avoid race below: userspace kernel \| check sqring and iopoll_list submit sqe \| check IORING_SQ_NEED_WAKEUP \| (which is not set) \| \| \| set IORING_SQ_NEED_WAKEUP wait cqe \| schedule(never wakeup again) Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/1619018351-75883-1-git-send-email-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-23 08:26:41 -06:00
Pavel Begunkov	f2a48dd09b	io_uring: refactor io_sq_offload_create() Just a bit of code tossing in io_sq_offload_create(), so it looks a bit better. No functional changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/939776f90de8d2cdd0414e1baa29c8ec0926b561.1618916549.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-20 12:55:28 -06:00
Pavel Begunkov	07db298a1c	io_uring: safer sq_creds putting Put sq_creds as a part of io_ring_ctx_free(), it's easy to miss doing it in io_sq_thread_finish(), especially considering past mistakes related to ring creation failures. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3becb1866467a1de82a97345a0a90d7fb8ff875e.1618916549.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-20 12:55:28 -06:00
Pavel Begunkov	3a0a690235	io_uring: move inflight un-tracking into cleanup REQ_F_INFLIGHT deaccounting doesn't do any spinlocking or resource freeing anymore, so it's safe to move it into the normal cleanup flow, i.e. into io_clean_op(), so making it cleaner. Also move io_req_needs_clean() to be first in io_dismantle_req() so it doesn't reload req->flags. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/90653a3a5de4107e3a00536fa4c2ea5f2c38a4ac.1618916549.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-20 12:55:28 -06:00
Pavel Begunkov	734551df6f	io_uring: fix shared sqpoll cancellation hangs [ 736.982891] INFO: task iou-sqp-4294:4295 blocked for more than 122 seconds. [ 736.982897] Call Trace: [ 736.982901] schedule+0x68/0xe0 [ 736.982903] io_uring_cancel_sqpoll+0xdb/0x110 [ 736.982908] io_sqpoll_cancel_cb+0x24/0x30 [ 736.982911] io_run_task_work_head+0x28/0x50 [ 736.982913] io_sq_thread+0x4e3/0x720 We call io_uring_cancel_sqpoll() one by one for each ctx either in sq_thread() itself or via task works, and it's intended to cancel all requests of a specified context. However the function uses per-task counters to track the number of inflight requests, so it counts more requests than available via currect io_uring ctx and goes to sleep for them to appear (e.g. from IRQ), that will never happen. Cancel a bit more than before, i.e. all ctxs that share sqpoll and continue to use shared counters. Don't forget that we should not remove ctx from the list before running that task_work sqpoll-cancel, otherwise the function wouldn't be able to find the context and will hang. Reported-by: Joakim Hassila <joj@mac.com> Reported-by: Jens Axboe <axboe@kernel.dk> Fixes: `37d1e2e364` ("io_uring: move SQPOLL thread io-wq forked worker") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1bded7e6c6b32e0bae25fce36be2868e46b116a0.1618752958.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-19 11:34:32 -06:00
Pavel Begunkov	3b763ba1c7	io_uring: remove extra sqpoll submission halting SQPOLL task won't submit requests for a context that is currently dying, so no need to remove ctx from sqd_list prior the main loop of io_ring_exit_work(). Kill it, will be removed by io_sq_thread_finish() and only brings confusion and lockups. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f220c2b786ba0f9499bebc9f3cd9714d29efb6a5.1618752958.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-19 11:34:32 -06:00
Pavel Begunkov	75c4021aac	io_uring: check register restriction afore quiesce Move restriction checks of __io_uring_register() before quiesce, saves from waiting for requests in fail case and simplifies the code a bit. Also add array_index_nospec() for safety Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/88d7913c9280ee848fdb7b584eea37a465391cee.1618488258.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-17 19:20:08 -06:00
Pavel Begunkov	38134ada0c	io_uring: fix overflows checks in provide buffers Colin reported before possible overflow and sign extension problems in io_provide_buffers_prep(). As Linus pointed out previous attempt did nothing useful, see `d81269fecb` ("io_uring: fix provide_buffers sign extension"). Do that with help of check_<op>_overflow helpers. And fix struct io_provide_buf::len type, as it doesn't make much sense to keep it signed. Reported-by: Colin Ian King <colin.king@canonical.com> Fixes: `efe68c1ca8` ("io_uring: validate the full range of provided buffers for access") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/46538827e70fce5f6cdb50897cff4cacc490f380.1618488258.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-17 19:20:07 -06:00
Pavel Begunkov	c82d5bc703	io_uring: don't fail submit with overflow backlog Don't fail submission attempts if there are CQEs in the overflow backlog, but give away the decision making to the userspace. It might be very inconvenient to the userspace, especially if submission and completion are done by different threads. We can remove it because of recent changes, where requests are now not locked by the backlog, backlog entries are allocated separately, so they take less space and cgroup accounted. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-17 19:19:41 -06:00
Jens Axboe	a7be7c23cf	io_uring: fix merge error for async resubmit A hand-edit while applying this patch on top of a new base resulted in a reverted check for re-issue, resulting in spurious -EAGAIN errors. Fixes: `8c130827f4` ("io_uring: don't alter iopoll reissue fail ret code") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-16 09:47:08 -06:00
Jens Axboe	75652a30ff	io_uring: tie req->apoll to request lifetime We manage these separately right now, just tie it to the request lifetime and make it be part of the usual REQ_F_NEED_CLEANUP logic. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-16 09:47:02 -06:00
Jens Axboe	4e3d9ff905	io_uring: put flag checking for needing req cleanup in one spot We have this in two spots right now, which is a bit fragile. In preparation for moving REQ_F_POLLED cleanup into the same spot, move the check into a separate helper so we only have it once. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-16 09:45:47 -06:00
Jens Axboe	ea6a693d86	io_uring: disable multishot poll for double poll add cases The re-add handling isn't correct for the multi wait case, so let's just disable it for now explicitly until we can get that sorted out. This just turns it into a one-shot request. Since we pass back whether or not a poll request terminates in multishot mode on completion, this should not break properly behaving applications that check for IORING_CQE_F_MORE on completion. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-15 20:17:11 -06:00
Pavel Begunkov	c7d95613c7	io_uring: fix early sqd_list removal sqpoll hangs [ 245.463317] INFO: task iou-sqp-1374:1377 blocked for more than 122 seconds. [ 245.463334] task:iou-sqp-1374 state:D flags:0x00004000 [ 245.463345] Call Trace: [ 245.463352] __schedule+0x36b/0x950 [ 245.463376] schedule+0x68/0xe0 [ 245.463385] __io_uring_cancel+0xfb/0x1a0 [ 245.463407] do_exit+0xc0/0xb40 [ 245.463423] io_sq_thread+0x49b/0x710 [ 245.463445] ret_from_fork+0x22/0x30 It happens when sqpoll forgot to run park_task_work and goes to exit, then exiting user may remove ctx from sqd_list, and so corresponding io_sq_thread() -> io_uring_cancel_sqpoll() won't be executed. Hopefully it just stucks in do_exit() in this case. Fixes: `dbe1bdbb39` ("io_uring: handle signals for IO threads like a normal thread") Reported-by: Joakim Hassila <joj@mac.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-14 13:07:27 -06:00
Pavel Begunkov	c5de00366e	io_uring: move poll update into remove not add Having poll update function as a part of IORING_OP_POLL_ADD is not great, we have to do hack around struct layouts and add some overhead in the way of more popular POLL_ADD. Even more serious drawback is that POLL_ADD requires file and always grabs it, and so poll update, which doesn't need it. Incorporate poll update into IORING_OP_POLL_REMOVE instead of IORING_OP_POLL_ADD. It also more consistent with timeout remove/update. Fixes: `b69de288e9` ("io_uring: allow events and user_data update of running poll requests") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-14 10:43:49 -06:00
Pavel Begunkov	9096af3e9c	io_uring: add helper for parsing poll events Isolate poll mask SQE parsing and preparations into a new function, which will be reused shortly. Fixes: `b69de288e9` ("io_uring: allow events and user_data update of running poll requests") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-14 10:43:47 -06:00
Pavel Begunkov	9ba5fac8cf	io_uring: fix POLL_REMOVE removing apoll Don't allow REQ_OP_POLL_REMOVE to kill apoll requests, users should not know about it. Also, remove weird -EACCESS in io_poll_update(), it shouldn't know anything about apoll, and have to work even if happened to have a poll and an async poll'ed request with same user_data. Fixes: `b69de288e9` ("io_uring: allow events and user_data update of running poll requests") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-14 10:43:42 -06:00
Pavel Begunkov	7f00651aeb	io_uring: refactor io_ring_exit_work() Don't reinit io_ring_exit_work()'s exit work/completions on each iteration, that's wasteful. Also add list_rotate_left(), so if we failed to complete the task job, we don't try it again and again but defer it until others are processed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-14 10:42:31 -06:00
Pavel Begunkov	f39c8a5b11	io_uring: inline io_iopoll_getevents() io_iopoll_getevents() is of no use to us anymore, io_iopoll_check() handles all the cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7e50b8917390f38bee4f822c6f4a6a98a27be037.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:55 -06:00
Pavel Begunkov	e9979b36a4	io_uring: skip futile iopoll iterations The only way to get out of io_iopoll_getevents() and continue iterating is to have empty iopoll_list, otherwise the main loop would just exit. So, instead of the unlock on 8th time heuristic, do that based on iopoll_list. Also, as no one can add new requests to iopoll_list while io_iopoll_check() hold uring_lock, it's useless to spin with the list empty, return in that case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5b8ebe84f5fff7ffa1f708952dfef7fc78b668e2.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:55 -06:00
Pavel Begunkov	cce4b8b0ce	io_uring: don't fail overflow on in_idle As CQE overflows are now untied from requests and so don't hold any ref, we don't need to handle exiting/exec'ing cases there anymore. Moreover, it's much nicer in regards to userspace to save overflowed CQEs whenever possible, so remove failing on in_idle. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d873b7dab75c7f3039ead9628a745bea01f2cfd2.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:55 -06:00
Pavel Begunkov	e31001a3ab	io_uring: clean up io_poll_remove_waitqs() Move some parts of io_poll_remove_waitqs() that are opcode independent. Looks better and stresses that both do __io_poll_remove_one(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bbc717f82117cc335c89cbe67ec8d72608178732.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:55 -06:00
Pavel Begunkov	fd9c7bc542	io_uring: refactor hrtimer_try_to_cancel uses Don't save return values of hrtimer_try_to_cancel() in a variable, but use right away. It's in general safer to not have an intermediate variable, which may be reused and passed out wrongly, but it be contracted out. Also clean io_timeout_extract(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d2566ef7ce632e6882dc13e022a26249b3fd30b5.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:55 -06:00
Pavel Begunkov	8c855885b8	io_uring: add timeout completion_lock annotation Add one more sparse locking annotation for readability in io_kill_timeout(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bdbb22026024eac29203c1aa0045c4954a2488d1.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:54 -06:00
Pavel Begunkov	9d8058926b	io_uring: split poll and poll update structures struct io_poll_iocb became pretty nasty combining also update fields. Split them, so we would have more clarity to it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b2f74d64ffebb57a648f791681af086c7211e3a4.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:54 -06:00
Pavel Begunkov	66d2d00d0a	io_uring: fix uninit old data for poll event upd Both IORING_POLL_UPDATE_EVENTS and IORING_POLL_UPDATE_USER_DATA need old_user_data to find/cancel a poll request, but it's set only for the first one. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ab08fd35b7652e977f9a475f01741b04102297f1.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:54 -06:00
Pavel Begunkov	084804002e	io_uring: fix leaking reg files on exit If io_sqe_files_unregister() faults on io_rsrc_ref_quiesce(), it will fail to do unregister leaving files referenced. And that may well happen because of a strayed signal or just because it does allocations inside. In io_ring_ctx_free() do an unsafe version of unregister, as it's guaranteed to not have requests by that point and so quiesce is useless. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e696e9eade571b51997d0dc1d01f144c6d685c05.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:54 -06:00
Pavel Begunkov	f70865db5f	io_uring: return back safer resurrect Revert of revert of "io_uring: wait potential ->release() on resurrect", which adds a helper for resurrect not racing completion reinit, as was removed because of a strange bug with no clear root or link to the patch. Was improved, instead of rcu_synchronize(), just wait_for_completion() because we're at 0 refs and it will happen very shortly. Specifically use non-interruptible version to ignore all pending signals that may have ended prior interruptible wait. This reverts commit `cb5e1b8130`. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7a080c20f686d026efade810b116b72f88abaff9.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 09:33:10 -06:00
Pavel Begunkov	e4335ed33e	io_uring: improve hardlink code generation req_set_fail_links() condition checking is bulky. Even though it's always in a slow path, it's inlined and generates lots of extra code, simplify it be moving HARDLINK checking into helpers killing linked requests. text data bss dec hex filename before: 79318 12330 8 91656 16608 ./fs/io_uring.o after: 79126 12330 8 91464 16548 ./fs/io_uring.o Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/96a9387db658a9d5a44ecbfd57c2a62cb888c9b6.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 09:33:07 -06:00
Pavel Begunkov	88885f66e8	io_uring: improve sqo stop Set IO_SQ_THREAD_SHOULD_STOP before taking sqd lock, so the sqpoll task sees earlier. Not a problem, it will stop eventually. Also check invariant that it's stopped only once. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/653b24ee93843a50ff65a45847d9138f5adb76d7.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 09:33:04 -06:00
Pavel Begunkov	aeca241b0b	io_uring: split file table from rsrc nodes We don't need to store file tables in rsrc nodes, for now it's easier to handle tables not generically, so move file tables into the context. A nice side effect is having one less pointer dereference for request with fixed file initialisation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/de9fc4cd3545f24c26c03be4556f58ba3d18b9c3.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 09:33:01 -06:00
Pavel Begunkov	87094465d0	io_uring: cleanup buffer register In preparation for more changes do a little cleanup of io_sqe_buffers_register(). Move all args/invariant checking into it from io_buffers_map_alloc(), because it's confusing. And add a bit more cleaning for the loop. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/93292cb9708c8455e5070cc855861d94e11ca042.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 09:32:58 -06:00
Pavel Begunkov	7f61a1e9ef	io_uring: add buffer unmap helper Add a helper for unmapping registered buffers, better than double indexing and will be reused in the future. Suggested-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66cbc6ea863be865bac7b7080ed6a3d5c542b71f.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 09:32:55 -06:00
Pavel Begunkov	3e9424989b	io_uring: simplify io_rsrc_data refcounting We don't take many references of struct io_rsrc_data, only one per each io_rsrc_node, so using percpu refs is overkill. Use atomic ref instead, which is much simpler. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1551d90f7c9b183cf2f0d7b5e5b923430acb03fa.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 09:32:47 -06:00
Jens Axboe	a1ff1e3f0e	io_uring: provide io_resubmit_prep() stub for !CONFIG_BLOCK Randy reports the following error on CONFIG_BLOCK not being set: ../fs/io_uring.c: In function ‘kiocb_done’: ../fs/io_uring.c:2766:7: error: implicit declaration of function ‘io_resubmit_prep’; did you mean ‘io_put_req’? [-Werror=implicit-function-declaration] if (io_resubmit_prep(req)) { Provide a dummy stub for io_resubmit_prep() like we do for io_rw_should_reissue(), which also helps remove an ifdef sequence from io_complete_rw_iopoll() as well. Fixes: `8c130827f4` ("io_uring: don't alter iopoll reissue fail ret code") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 06:40:02 -06:00
Pavel Begunkov	8d13326e56	io_uring: optimise fill_event() by inlining There are three cases where we much care about performance of io_cqring_fill_event() -- flushing inline completions, iopoll and io_req_complete_post(). Inline a hot part of fill_event() into them. All others are not as important and we don't want to bloat binary for them, so add a noinline version of the function for all other use use cases. nops test(batch=32): 16.932 vs 17.822 KIOPS Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a11d59424bf4417aca33f5ec21008bb3b0ebd11e.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	ff64216423	io_uring: always pass cflags into fill_event() A simple preparation patch inlining io_cqring_fill_event(), which only role was to pass cflags=0 into an actual fill event. It helps to keep number of related helpers sane in following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/704f9c85b7d9843e4ad50a9f057200c58f5adc6e.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	44c769de6f	io_uring: optimise non-eventfd post-event Eventfd is not the canonical way of using io_uring, annotate io_should_trigger_evfd() with likely so it improves code generation for non-eventfd branch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/42fdaa51c68d39479f02cef4fe5bcb24624d60fa.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	4af3417a34	io_uring: refactor compat_msghdr import Add an entry for user pointer to compat_msghdr into io_connect, so it's explicit that we may use it as this, and removes annoying casts. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/73fd644dea1518f528d3648981cf777ce6e537e9.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	0bdf3398b0	io_uring: enable inline completion for more cases Take advantage of delayed/inline completion flushing and pass right issue flags for completion of open, open2, fadvise and poll remove opcodes. All others either already use it or always punted and never executed inline. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0badc7512e82f7350b73bb09abbebbecbdd5dab8.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	a1fde923e3	io_uring: refactor io_close A small refactoring shrinking it and making easier to read. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/19b24eed7cd491a0243b50366dd2a23b558e2665.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	3f48cf18f8	io_uring: unify files and task cancel Now __io_uring_cancel() and __io_uring_files_cancel() are very similar and mostly differ by how we count requests, merge them and allow tctx_inflight() to handle counting. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1a5986a97df4dc1378f3fe0ca1eb483dbcf42112.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	b303fe2e5a	io_uring: track inflight requests through counter Instead of keeping requests in a inflight_list, just track them with a per tctx atomic counter. Apart from it being much easier and more consistent with task cancel, it frees ->inflight_entry from being shared between iopoll and cancel-track, so less headache for us. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3c2ee0863cd7eeefa605f3eaff4c1c461a6f1157.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	368b208085	io_uring: unify task and files cancel loops Move tracked inflight number check up the stack into __io_uring_files_cancel() so it's similar to task cancel. Will be used for further cleaning. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dca5a395efebd1e3e0f3bbc6b9640c5e8aa7e468.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:40 -06:00
Pavel Begunkov	0ea13b448e	io_uring: simplify apoll hash removal hash_del() works well with non-hashed nodes, there's no need to check if it is hashed first. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:40 -06:00
Pavel Begunkov	e27414bef7	io_uring: refactor io_poll_complete() Remove error parameter from io_poll_complete(), 0 is always passed, and do a bit of cleaning on top. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:40 -06:00
Pavel Begunkov	f40b964a66	io_uring: clean up io_poll_task_func() io_poll_complete() always fills an event (even an overflowed one), so we always should do io_cqring_ev_posted() afterwards. And that's what is currently happening, because second EPOLLONESHOT check is always true, it can't return !done for oneshots. Remove those branching, it's much easier to read. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:40 -06:00
Jens Axboe	cb3b200e4f	io_uring: don't attempt re-add of multishot poll request if racing We currently allow racy updates to multishot requests, but we can end up double adding the poll request if both completion and update does it. Ensure that we skip re-add on the update side if someone else is completing it. Fixes: `b69de288e9` ("io_uring: allow events and user_data update of running poll requests") Reported-by: Joakim Hassila <joj@mac.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:35 -06:00
Pavel Begunkov	53a3126756	io_uring: kill outdated comment about splice punt The splice/tee comment in io_prep_async_work() isn't relevant since the section was moved, delete it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/892a549c89c3d422b679677b8e68ffd3fcb736b6.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:35 -06:00
Pavel Begunkov	a04b0ac0cb	io_uring: encapsulate fixed files into struct Add struct io_fixed_file representing a single registered file, first to hide ugly struct file *, which may be misleading, and secondly to retype it to unsigned long as conversions to it and back to file for handling and masking FFS_* flags are getting nasty. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/78669731a605a7614c577c3de552631cfaf0869a.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:35 -06:00
Pavel Begunkov	846a4ef22b	io_uring: refactor file tables alloc/free Introduce a heler io_free_file_tables() doing all the cleaning, there are several places where it's hand coded. Also move all allocations into io_sqe_alloc_file_tables() and rename it, so all of it is in one place. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/502a84ebf41ff119b095e59661e678eacb752bf8.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:35 -06:00
Pavel Begunkov	f4f7d21ce4	io_uring: don't quiesce intial files register There is no reason why we would want to fully quiesce ring on IORING_REGISTER_FILES, if it's already registered we fail. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/563bb8060bb2d3efbc32fce6101678281c574d2a.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:35 -06:00
Pavel Begunkov	9a321c9849	io_uring: set proper FFS* flags on reg file update Set FFS_* flags (e.g. FFS_ASYNC_READ) not only in initial registration but also on registered files update. Not a bug, but may miss getting profit out of the feature. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/df29a841a2d3d3695b509cdffce5070777d9d942.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:35 -06:00
Pavel Begunkov	044118069a	io_uring: deduplicate NOSIGNAL setting Set MSG_NOSIGNAL and REQ_F_NOWAIT in send/recv prep routines and don't duplicate it in all four send/recv handlers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e1133a3ed1c0e192975b7341ea4b0bf91f63b132.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:35 -06:00
Pavel Begunkov	df9727affa	io_uring: put link timeout req consistently Don't put linked timeout req in io_async_find_and_cancel() but do it in io_link_timeout_fn(), so we have only one point for that and won't have to do it differently as it's now (put vs put_deferred). Btw, improve a bit io_async_find_and_cancel()'s locking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d75b70957f245275ab7cba83e0ac9c1b86aae78a.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	c4ea060e85	io_uring: simplify overflow handling Overflowed CQEs doesn't lock requests anymore, so we don't care so much about cancelling them, so kill cq_overflow_flushed and simplify the code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5799867aeba9e713c32f49aef78e5e1aef9fbc43.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	e07785b002	io_uring: lock annotate timeouts and poll Add timeout and poll ->comletion_lock annotations for Sparse, makes life easier while looking at the functions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2345325643093d41543383ba985a735aeb899eac.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	47e90392c8	io_uring: kill unused forward decls Kill unused forward declarations for io_ring_file_put() and io_queue_next(). Also btw rename the first one. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/64aa27c3f9662e14615cc119189f5eaf12989671.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	4751f53d74	io_uring: store reg buffer end instead of length It's a bit more convenient for us to store a registered buffer end address instead of length, see struct io_mapped_ubuf, as it allow to not recompute it every time. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/39164403fe92f1dc437af134adeec2423cdf9395.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	75769e3f73	io_uring: improve import_fixed overflow checks Replace a hand-coded overflow check with a specialised function. Even though compilers are smart enough to generate identical binary (i.e. check carry bit), but it's more foolproof and conveys the intention better. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e437dcdc929bacbb6f11a4824ecbbf17225cb82a.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	0aec38fda2	io_uring: refactor io_async_cancel() Remove extra tctx==NULL checks that are already done by io_async_cancel_one(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/70c2a8b958d942e86958a28af0452966ce1095b0.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	e146a4a3f6	io_uring: remove unused hash_wait No users of io_uring_ctx::hash_wait left, kill it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e25cb83c233a5f75f15275596b49fbafbea606fa.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	7394161cb8	io_uring: better ref handling in poll_remove_one Instead of io_put_req() to drop not a final ref, use req_ref_put(), which is slimmer and will also check the invariant. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/85b5774ce13ae55cc2e705abdc8cbafe1212f1bd.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	89b5066ea1	io_uring: combine lock/unlock sections on exit io_ring_exit_work() already does uring_lock lock/unlock, no need to repeat it for lock waiting trick in io_ring_ctx_free(). Move the waiting with comments and spinlocking into io_ring_exit_work. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a8ae0589b0ea64ad4791e2c282e4e9b713dd7024.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	215c390260	io_uring: remove useless is_dying check on quiesce rsrc_data refs should always be valid for potential submitters, io_rsrc_ref_quiesce() restores it before unlocking, so percpu_ref_is_dying() check in io_sqe_files_unregister() does nothing and misleading. Concurrent quiesce is prevented with struct io_rsrc_data::quiesce. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bf97055e1748ee3a382e66daf384a469eb90b931.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	28a9fe2521	io_uring: reuse io_rsrc_node_destroy() Reuse io_rsrc_node_destroy() in __io_rsrc_put_work(). Also move it to a more appropriate place -- to the other node routines, and remove forward declaration. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cccafba41aee1e5bb59988704885b1340aef3a27.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	a7f0ed5acd	io_uring: ctx-wide rsrc nodes If we're going to ever support multiple types of resources we need shared rsrc nodes to not bloat requests, that is implemented in this patch. It also gives a nicer API and saves one pointer dereference in io_req_set_rsrc_node(). We may say that all requests bound to a resource belong to one and only one rsrc node, and considering that nodes are removed and recycled strictly in-order, this separates requests into generations, where generation are changed on each node switch (i.e. io_rsrc_node_switch()). The API is simple, io_rsrc_node_switch() switches to a new generation if needed, and also optionally kills a passed in io_rsrc_data. Each call to io_rsrc_node_switch() have to be preceded with io_rsrc_node_switch_start(). The start function is idempotent and should not necessarily be followed by switch. One difference is that once a node was set it will always retain a valid rsrc node, even on unregister. It may be a nuisance at the moment, but makes much sense for multiple types of resources. Another thing changed is that nodes are bound to/associated with a io_rsrc_data later just before killing (i.e. switching). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7e9c693b4b9a2f47aa784b616ce29843021bb65a.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	e7c78371bb	io_uring: refactor io_queue_rsrc_removal() Pass rsrc_node into io_queue_rsrc_removal() explicitly. Just a simple preparation patch, makes following changes nicer. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/002889ce4de7baf287f2b010eef86ffe889174c6.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	40ae0ff70f	io_uring: move rsrc_put callback into io_rsrc_data io_rsrc_node's callback operates only on a single io_rsrc_data and only with its resources, so rsrc_put() callback is actually a property of io_rsrc_data. Move it there, it makes code much nicecr. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9417c2fba3c09e8668f05747006a603d416d34b4.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	82fbcfa996	io_uring: encapsulate rsrc node manipulations io_rsrc_node_get() and io_rsrc_node_set() are always used together, merge them into one so most users don't even see io_rsrc_node and don't need to care about it. It helped to catch io_sqe_files_register() inferring rsrc data argument for get and set differently, not a problem but a good sign. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0827b080b2e61b3dec795380f7e1a1995595d41f.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	f3baed3992	io_uring: use rsrc prealloc infra for files reg Keep it consistent with update and use io_rsrc_node_prealloc() + io_rsrc_node_get() in io_sqe_files_register() as well, that will be used in future patches, not as error prone and allows to deduplicate rsrc_node init. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cf87321e6be5e38f4dc7fe5079d2aa6945b1ace0.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	221aa92409	io_uring: simplify io_rsrc_node_ref_zero Replace queue_delayed_work() with mod_delayed_work() in io_rsrc_node_ref_zero() as the later one can schedule a new work, and cleanup it further for better readability. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3b2b23e3a1ea4bbf789cd61815d33e05d9ff945e.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	b895c9a632	io_uring: name rsrc bits consistently Keep resource related structs' and functions' naming consistent, in particular use "io_rsrc" prefix for everything. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/962f5acdf810f3a62831e65da3932cde24f6d9df.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Jens Axboe	b2e720ace2	io_uring: fix race around poll update and poll triggering Joakim reports that in some conditions he sees a multishot poll request being canceled, and that it coincides with getting -EALREADY on modification. As part of the poll update procedure, there's a small window where the request is marked as canceled, and if this coincides with the event actually triggering, then we can get a spurious -ECANCELED and termination of the multishot request. Don't mark the poll request as being canceled for update. We also don't care if we race on removal unless it's a one-shot request, we can safely updated for either case. Fixes: `b69de288e9` ("io_uring: allow events and user_data update of running poll requests") Reported-by: Joakim Hassila <joj@mac.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:17 -06:00
Pavel Begunkov	50e96989d7	io_uring: reg buffer overflow checks hardening We are safe with overflows in io_sqe_buffer_register() because it will just yield alloc failure, but it's nicer to check explicitly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2b0625551be3d97b80a5fd21c8cd79dc1c91f0b5.1616624589.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:42:00 -06:00
Jens Axboe	548d819d1e	io_uring: allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICE Now that we have any worker being attached to the original task as threads, accounting of CPU time is directly attributed to the original task as well. This means that we no longer have to restrict SQPOLL to needing elevated privileges, as it's really no different from just having the task spawn a busy looping thread in userspace. Reported-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:42:00 -06:00
Jens Axboe	685fe7feed	io-wq: eliminate the need for a manager thread io-wq relies on a manager thread to create/fork new workers, as needed. But there's really no strong need for it anymore. We have the following cases that fork a new worker: 1) Work queue. This is done from the task itself always, and it's trivial to create a worker off that path, if needed. 2) All workers have gone to sleep, and we have more work. This is called off the sched out path. For this case, use a task_work items to queue a fork-worker operation. 3) Hashed work completion. Don't think we need to do anything off this case. If need be, it could just use approach 2 as well. Part of this change is incrementing the running worker count before the fork, to avoid cases where we observe we need a worker and then queue creation of one. Then new work comes in, we fork a new one. That last queue operation should have waited for the previous worker to come up, it's quite possible we don't even need it. Hence move the worker running from before we fork it off to more efficiently handle that case. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:42:00 -06:00
Jens Axboe	b69de288e9	io_uring: allow events and user_data update of running poll requests This adds two new POLL_ADD flags, IORING_POLL_UPDATE_EVENTS and IORING_POLL_UPDATE_USER_DATA. As with the other POLL_ADD flag, these are masked into sqe->len. If set, the POLL_ADD will have the following behavior: - sqe->addr must contain the the user_data of the poll request that needs to be modified. This field is otherwise invalid for a POLL_ADD command. - If IORING_POLL_UPDATE_EVENTS is set, sqe->poll_events must contain the new mask for the existing poll request. There are no checks for whether these are identical or not, if a matching poll request is found, then it is re-armed with the new mask. - If IORING_POLL_UPDATE_USER_DATA is set, sqe->off must contain the new user_data for the existing poll request. A POLL_ADD with any of these flags set may complete with any of the following results: 1) 0, which means that we successfully found the existing poll request specified, and performed the re-arm procedure. Any error from that re-arm will be exposed as a completion event for that original poll request, not for the update request. 2) -ENOENT, if no existing poll request was found with the given user_data. 3) -EALREADY, if the existing poll request was already in the process of being removed/canceled/completing. 4) -EACCES, if an attempt was made to modify an internal poll request (eg not one originally issued ass IORING_OP_POLL_ADD). The usual -EINVAL cases apply as well, if any invalid fields are set in the sqe for this command type. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:42:00 -06:00
Jens Axboe	b2cb805f6d	io_uring: abstract out a io_poll_find_helper() We'll need this helper for another purpose, for now just abstract it out and have io_poll_cancel() use it for lookups. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:42:00 -06:00
Jens Axboe	5082620fb2	io_uring: terminate multishot poll for CQ ring overflow If we hit overflow and fail to allocate an overflow entry for the completion, terminate the multishot poll mode. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Jens Axboe	b2c3f7e171	io_uring: abstract out helper for removing poll waitqs/hashes No functional changes in this patch, just preparation for kill multishot poll on CQ overflow. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Jens Axboe	88e41cf928	io_uring: add multishot mode for IORING_OP_POLL_ADD The default io_uring poll mode is one-shot, where once the event triggers, the poll command is completed and won't trigger any further events. If we're doing repeated polling on the same file or socket, then it can be more efficient to do multishot, where we keep triggering whenever the event becomes true. This deviates from the usual norm of having one CQE per SQE submitted. Add a CQE flag, IORING_CQE_F_MORE, which tells the application to expect further completion events from the submitted SQE. Right now the only user of this is POLL_ADD in multishot mode. Since sqe->poll_events is using the space that we normally use for adding flags to commands, use sqe->len for the flag space for POLL_ADD. Multishot mode is selected by setting IORING_POLL_ADD_MULTI in sqe->len. An application should expect more CQEs for the specificed SQE if the CQE is flagged with IORING_CQE_F_MORE. In multishot mode, only cancelation or an error will terminate the poll request, in which case the flag will be cleared. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Jens Axboe	7471e1afab	io_uring: include cflags in completion trace event We should be including the completion flags for better introspection on exactly what completion event was logged. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	6c2450ae55	io_uring: allocate memory for overflowed CQEs Instead of using a request itself for overflowed CQE stashing, allocate a separate entry. The disadvantage is that the allocation may fail and it will be accounted as lost (see rings->cq_overflow), so we lose reliability in case of memory pressure if the application is driving the CQ ring into overflow. However, it opens a way for for multiple CQEs per an SQE and even generating SQE-less CQEs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: use GFP_ATOMIC \| __GFP_ACCOUNT] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Jens Axboe	464dca612b	io_uring: mask in error/nval/hangup consistently for poll Instead of masking these in as part of regular POLL_ADD prep, do it in io_init_poll_iocb(), and include NVAL as that's generally unmaskable, and RDHUP alongside the HUP that is already set. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	9532b99bd9	io_uring: optimise rw complete error handling Expect read/write to succeed and create a hot path for this case, in particular hide all error handling with resubmission under a single check with the desired result. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	ab454438aa	io_uring: hide iter revert in resubmit_prep Move iov_iter_revert() resetting iterator in case of -EIOCBQUEUED into io_resubmit_prep(), so we don't do heavy revert in hot path, also saves a couple of checks. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	8c130827f4	io_uring: don't alter iopoll reissue fail ret code When reissue_prep failed in io_complete_rw_iopoll(), we change return code to -EIO to prevent io_iopoll_complete() from doing resubmission. Mark requests with a new flag (i.e. REQ_F_DONT_REISSUE) instead and retain the original return value. It also removes io_rw_reissue() from io_iopoll_complete() that will be used later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	1c98679db9	io_uring: optimise kiocb_end_write for !ISREG file_end_write() is only for regular files, so the function do a couple of dereferences to get inode and check for it. However, we already have REQ_F_ISREG at hand, just use it and inline file_end_write(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	59d7001345	io_uring: kill unused REQ_F_NO_FILE_TABLE current->files are always valid now even for io-wq threads, so kill not used anymore REQ_F_NO_FILE_TABLE. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	e1d675df1a	io_uring: don't init req->work fully in advance req->work is mostly unused unless it's punted, and io_init_req() is too hot for fully initialising it. Fortunately, we can skip init work.next as it's controlled by io-wq, and can not touch work.flags by moving everything related into io_prep_async_work(). The only field left is req->work.creds, but there is nothing can be done, keep maintaining it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	05356d86c6	io_uring: remove tctx->sqpoll struct io_uring_task::sqpoll is not used anymore, kill it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	682076801a	io_uring: don't do extra EXITING cancellations io_match_task() matches all requests with PF_EXITING task, even though those may be valid requests. It was necessary for SQPOLL cancellation, but now it kills all requests before exiting via io_uring_cancel_sqpoll(), so it's not needed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	d4729fbde7	io_uring: don't clear REQ_F_LINK_TIMEOUT REQ_F_LINK_TIMEOUT is a hint that to look for linked timeouts to cancel, we're leaving it even when it's already fired. Hence don't care to clear it in io_kill_linked_timeout(), it's safe and is called only once. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	c15b79dee5	io_uring: optimise io_req_task_work_add() Inline io_task_work_add() into io_req_task_work_add(). They both work with a request, so keeping them separate doesn't make things much more clear, but merging allows optimise it. Apart from small wins like not reading req->ctx or not calculating @notify in the hot path, i.e. with tctx->task_state set, it avoids doing wake_up_process() for every single add, but only after actually done task_work_add(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	e1d767f078	io_uring: abolish old io_put_file() io_put_file() doesn't do a good job at generating a good code. Inline it, so we can check REQ_F_FIXED_FILE first, prioritising FIXED_FILE case over requests without files, and saving a memory load in that case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	094bae49e5	io_uring: optimise io_dismantle_req() fast path Reshuffle io_dismantle_req() checks to put most of slow path stuff under a single if. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	68fb897966	io_uring: inline io_clean_op()'s fast path Inline io_clean_op(), leaving __io_clean_op() but renaming it. This will be used in following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	2593553a01	io_uring: remove __io_req_task_cancel() Both io_req_complete_failed() and __io_req_task_cancel() do the same thing: set failure flag, put both req refs and emit an CQE. The former one is a bit more advance as it puts req back into a req cache, so make it to take over __io_req_task_cancel() and remove the last one. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	dac7a09864	io_uring: add helper flushing locked_free_list Add a new helper io_flush_cached_locked_reqs() that splices locked_free_list to free_list, and does it right doing all sync and invariant reinit. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	a05432fb49	io_uring: refactor io_free_req_deferred() We don't care about ret value in io_free_req_deferred(), make the code a bit more concise. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	0d85035a73	io_uring: inline io_put_req and friends One big omission is that io_put_req() haven't been marked inline, and at least gcc 9 doesn't inline it, not to mention that it's really hot and extra function call is intolerable, especially when it doesn't put a final ref. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	8dd03afe61	io_uring: refactor rsrc refnode allocation There are two problems: 1) we always allocate refnodes in advance and free them if those haven't been used. It's expensive, takes two allocations, where one of them is percpu. And it may be pretty common not actually using them. 2) Current API with allocating a refnode and setting some of the fields is error prone, we don't ever want to have a file node runninng fixed buffer callback... Solve both with pre-init/get API. Pre-init just leaves the node for later if not used, and for get (i.e. io_rsrc_refnode_get()), you need to explicitly pass all arguments setting callbacks/etc., so it's more resilient. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	dd78f49260	io_uring: refactor io_flush_cached_reqs() Emphasize that return value of io_flush_cached_reqs() depends on number of requests in the cache. It looks nicer and might help tools from false-negative analyses. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	1840038e11	io_uring: optimise success case of __io_queue_sqe Move the case of successfully issued request by doing that check first. It's not much of a difference, just generates slightly better code for me. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	de968c182b	io_uring: inline __io_queue_linked_timeout() Inline __io_queue_linked_timeout(), we don't need it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	966706579a	io_uring: keep io_req_free_batch() call locality Don't do a function call (io_dismantle_req()) in the middle and place it to near other function calls, otherwise may lead to excessive register spilling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	cf27f3b149	io_uring: optimise tctx node checks/alloc First of all, w need to set tctx->sqpoll only when we add a new entry into ->xa, so move it from the hot path. Also extract a hot path for io_uring_add_task_file() as an inline helper. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	33f993da98	io_uring: optimise io_uring_enter() Add unlikely annotations, because my compiler pretty much mispredicts every first check, and apart jumping around in the fast path, it also generates extra instructions, like in advance setting ret value. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	493f3b158a	io_uring: don't take ctx refs in task_work handler __tctx_task_work() guarantees that ctx won't be killed while running task_works, so we can remove now unnecessary ctx pinning for internally armed polling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Jens Axboe	45ab03b19e	io_uring: transform ret == 0 for poll cancelation completions We can set canceled == true and complete out-of-line, ensure that we catch that and correctly return -ECANCELED if the poll operation got canceled. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Jens Axboe	b9b0e0d39c	io_uring: correct comment on poll vs iopoll The correct function is io_iopoll_complete(), which deals with completions of IOPOLL requests, not io_poll_complete(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Jens Axboe	7b29f92da3	io_uring: cache async and regular file state for fixed files We have to dig quite deep to check for particularly whether or not a file supports a fast-path nonblock attempt. For fixed files, we can do this lookup once and cache the state instead. This adds two new bits to track whether we support async read/write attempt, and lines up the REQ_F_ISREG bit with those two. The file slot re-uses the last 3 (or 2, for 32-bit) of the file pointer to cache that state, and then we mask it in when we go and use a fixed file. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Jens Axboe	d44f554e10	io_uring: don't check for io_uring_fops for fixed files We don't allow them at registration time, so limit the check for needing inflight tracking in io_file_get() to the non-fixed path. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	c9dca27dc7	io_uring: simplify io_sqd_update_thread_idle() Use a more comprehensible() max instead of hand coding it with ifs in io_sqd_update_thread_idle(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Jens Axboe	abc54d6343	io_uring: switch to atomic_t for io_kiocb reference count io_uring manipulates references twice for each request, and hence is very sensitive to performance of the reference count. This commit borrows a trick from: commit `f958d7b528` Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu Apr 11 10:06:20 2019 -0700 mm: make page ref count overflow check tighter and more explicit and switches to atomic_t for references, while still retaining overflow and underflow checks. This is good for a 2-3% increase in peak IOPS on a single core. Before: IOPS=2970879, IOS/call=31/31, inflight=128 (128) IOPS=2952597, IOS/call=31/31, inflight=128 (128) IOPS=2943904, IOS/call=31/31, inflight=128 (128) IOPS=2930006, IOS/call=31/31, inflight=96 (96) and after: IOPS=3054354, IOS/call=31/31, inflight=128 (128) IOPS=3059038, IOS/call=31/31, inflight=128 (128) IOPS=3060320, IOS/call=31/31, inflight=128 (128) IOPS=3068256, IOS/call=31/31, inflight=96 (96) Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Jens Axboe	de9b4ccad7	io_uring: wrap io_kiocb reference count manipulation in helpers No functional changes in this patch, just in preparation for handling the references a bit more efficiently. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	179ae0d15e	io_uring: simplify io_resubmit_prep() If not for async_data NULL check, io_resubmit_prep() is already an rw specific version of io_req_prep_async(), but slower because 1) it always goes through io_import_iovec() even if following io_setup_async_rw() the result 2) instead of initialising iovec/iter in-place it does it on-stack and then copies with io_setup_async_rw(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	b7e298d265	io_uring: merge defer_prep() and prep_async() Merge two function and do renaming in favour of the second one, it relays the meaning better. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	26f0505a9c	io_uring: rethink def->needs_async_data needs_async_data controls allocation of async_data, and used in two cases. 1) when async setup requires it (by io_req_prep_async() or handler themselves), and 2) when op always needs additional space to operate, like timeouts do. Opcode preps already don't bother about the second case and do allocation unconditionally, restrict needs_async_data to the first case only and rename it into needs_async_setup. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: update for IOPOLL fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	6cb78689fa	io_uring: untie alloc_async_data and needs_async_data All opcode handlers pretty well know whether they need async data or not, and can skip testing for needs_async_data. The exception is rw the generic path, but those test the flag by hand anyway. So, check the flag and make io_alloc_async_data() allocating unconditionally. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	2e052d443d	io_uring: refactor out send/recv async setup IORING_OP_[SEND,RECV] don't need async setup neither will get into io_req_prep_async(). Remove them from io_req_prep_async() and remove needs_async_data checks from the related setup functions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	8c3f9cd160	io_uring: use better types for cflags __io_cqring_fill_event() takes cflags as long to squeeze it into u32 in an CQE, awhile all users pass int or unsigned. Replace it with unsigned int and store it as u32 in struct io_completion to match CQE. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	9fb8cb49c7	io_uring: refactor provide/remove buffer locking Always complete request holding the mutex instead of doing that strange dancing with conditional ordering. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	f41db2732d	io_uring: add a helper failing not issued requests Add a simple helper doing CQE posting, marking request for link-failure, and putting both submission and completion references. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	dafecf19e2	io_uring: further deduplicate file slot selection io_fixed_file_slot() and io_file_from_index() behave pretty similarly, DRY and call one from another. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	2c4b8eb643	io_uring: reuse io_req_task_queue_fail() Use io_req_task_queue_fail() on the fail path of io_req_task_queue(). It's unlikely to happen, so don't care about additional overhead, but allows to keep all the req->result invariant in a single function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:57 -06:00
Pavel Begunkov	e83acd7d37	io_uring: avoid taking ctx refs for task-cancel Don't bother to take a ctx->refs for io_req_task_cancel() because it take uring_lock before putting a request, and the context is promised to stay alive until unlock happens. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:57 -06:00
Pavel Begunkov	9728463737	io_uring: fix rw req completion WARNING: at fs/io_uring.c:8578 io_ring_exit_work.cold+0x0/0x18 As reissuing is now passed back by REQ_F_REISSUE and kiocb_done() internally uses __io_complete_rw(), it may stop after setting the flag so leaving a dangling request. There are tricky edge cases, e.g. reading beyound file, boundary, so the easiest way is to hand code reissue in kiocb_done() as __io_complete_rw() was doing for us before. Fixes: `230d50d448` ("io_uring: move reissue into regular IO path") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f602250d292f8a84cca9a01d747744d1e797be26.1617842918.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-08 13:32:59 -06:00
Pavel Begunkov	6ad7f2332e	io_uring: clear F_REISSUE right after getting it There are lots of ways r/w request may continue its path after getting REQ_F_REISSUE, it's not necessarily io-wq and can be, e.g. apoll, and submitted via io_async_task_func() -> __io_req_task_submit() Clear the flag right after getting it, so the next attempt is well prepared regardless how the request will be executed. Fixes: `230d50d448` ("io_uring: move reissue into regular IO path") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/11dcead939343f4e27cab0074d34afcab771bfa4.1617842918.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-07 22:10:19 -06:00

... 4 5 6 7 8 ...

1762 Commits