linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-24 05:02:12 +00:00

Author	SHA1	Message	Date
Hrvoje Zeba	8042d6ce8c	io_uring: remove superfluous check for sqe->off in io_accept() This field contains a pointer to addrlen and checking to see if it's set returns -EINVAL if the caller sets addr & addrlen pointers. Fixes: `17f2fe35d0` ("io_uring: add support for IORING_OP_ACCEPT") Signed-off-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Jens Axboe	181e448d87	io_uring: async workers should inherit the user creds If we don't inherit the original task creds, then we can confuse users like fuse that pass creds in the request header. See link below on identical aio issue. Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#u Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Jens Axboe	576a347b7a	io-wq: have io_wq_create() take a 'data' argument We currently pass in 4 arguments outside of the bounded size. In preparation for adding one more argument, let's bundle them up in a struct to make it more readable. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Pavel Begunkov	311ae9e159	io_uring: fix dead-hung for non-iter fixed rw Read/write requests to devices without implemented read/write_iter using fixed buffers can cause general protection fault, which totally hangs a machine. io_import_fixed() initialises iov_iter with bvec, but loop_rw_iter() accesses it as iovec, dereferencing random address. kmap() page by page in this case Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Jens Axboe	f8e85cf255	io_uring: add support for IORING_OP_CONNECT This allows an application to call connect() in an async fashion. Like other opcodes, we first try a non-blocking connect, then punt to async context if we have to. Note that we can still return -EINPROGRESS, and in that case the caller should use IORING_OP_POLL_ADD to do an async wait for completion of the connect request (just like for regular connect(2), except we can do it async here too). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Jens Axboe	c4a2ed72c9	io_uring: only return -EBUSY for submit on non-flushed backlog We return -EBUSY on submit when we have a CQ ring overflow backlog, but that can be a bit problematic if the application is using pure userspace poll of the CQ ring. For that case, if the ring briefly overflowed and we have pending entries in the backlog, the submit flushes the backlog successfully but still returns -EBUSY. If we're able to fully flush the CQ ring backlog, let the submission proceed. Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Pavel Begunkov	f9bd67f69a	io_uring: only !null ptr to io_issue_sqe() Pass only non-null @nxt to io_issue_sqe() and handle it at the caller's side. And propagate it. - kiocb_done() is only called from io_read() and io_write(), which are only called from io_issue_sqe(), so it's @nxt != NULL - io_put_req_find_next() is called either with explicitly non-null local nxt, or from one of the functions in io_issue_sqe() switch (or their callees). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Pavel Begunkov	b18fdf71e0	io_uring: simplify io_req_link_next() "if (nxt)" is always true, as it was checked in the while's condition. io_wq_current_is_worker() is unnecessary, as non-async callers don't pass nxt, so io_queue_async_work() will be called for them anyway. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Pavel Begunkov	944e58bfed	io_uring: pass only !null to io_req_find_next() Make io_req_find_next() and io_req_link_next() to accept only non-null nxt, and handle it in callers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Pavel Begunkov	70cf9f3270	io_uring: remove io_free_req_find_next() There is only one one-liner user of io_free_req_find_next(). Inline it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Pavel Begunkov	9835d6fafb	io_uring: add likely/unlikely in io_get_sqring() The number of SQEs to submit is specified by a user, so io_get_sqring() in most of the cases succeeds. Hint compilers about that. Checking ASM genereted by gcc 9.2.0 for x64, there is one branch misprediction. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:10 -07:00
Pavel Begunkov	d732447fed	io_uring: rename __io_submit_sqe() __io_submit_sqe() is issuing requests, so call it as such. Moreover, it ends by calling io_iopoll_req_issued(). Rename it and make terminology clearer. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:10 -07:00
Jens Axboe	915967f69c	io_uring: improve trace_io_uring_defer() trace point We don't have shadow requests anymore, so get rid of the shadow argument. Add the user_data argument, as that's often useful to easily match up requests, instead of having to look at request pointers. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:10 -07:00
Pavel Begunkov	1b4a51b6d0	io_uring: drain next sqe instead of shadowing There's an issue with the shadow drain logic in that we drop the completion lock after deciding to defer a request, then re-grab it later and assume that the state is still the same. In the mean time, someone else completing a request could have found and issued it. This can cause a stall in the queue, by having a shadow request inserted that nobody is going to drain. Additionally, if we fail allocating the shadow request, we simply ignore the drain. Instead of using a shadow request, defer the next request/link instead. This also has the following advantages: - removes semi-duplicated code - doesn't allocate memory for shadows - works better if only the head marked for drain - doesn't need complex synchronisation On the flip side, it removes the shadow->seq == last_drain_in_in_link->seq optimization. That shouldn't be a common case, and can always be added back, if needed. Fixes: `4fe2c96315` ("io_uring: add support for link with drain") Cc: Jackie Liu <liuyun01@kylinos.cn> Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:10 -07:00
Jens Axboe	b76da70fc3	io_uring: close lookup gap for dependent next work When we find new work to process within the work handler, we queue the linked timeout before we have issued the new work. This can be problematic for very short timeouts, as we have a window where the new work isn't visible. Allow the work handler to store a callback function for this in the work item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that is set, then io-wq will call the callback when it has setup the new work item. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:10 -07:00
Jens Axboe	4d7dd46297	io_uring: allow finding next link independent of req reference count We currently try and start the next link when we put the request, and only if we were going to free it. This means that the optimization to continue executing requests from the same context often fails, as we're not putting the final reference. Add REQ_F_LINK_NEXT to keep track of this, and allow io_uring to find the next request more efficiently. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Jens Axboe	eb065d301e	io_uring: io_allocate_scq_urings() should return a sane state We currently rely on the ring destroy on cleaning things up in case of failure, but io_allocate_scq_urings() can leave things half initialized if only parts of it fails. Be nice and return with either everything setup in success, or return an error with things nicely cleaned up. Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Pavel Begunkov	bbad27b2f6	io_uring: Always REQ_F_FREE_SQE for allocated sqe Always mark requests with allocated sqe and deallocate it in __io_free_req(). It's easier to follow and doesn't add edge cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Jens Axboe	5d960724b0	io_uring: io_fail_links() should only consider first linked timeout We currently clear the linked timeout field if we cancel such a timeout, but we should only attempt to cancel if it's the first one we see. Others should simply be freed like other requests, as they haven't been started yet. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Pavel Begunkov	09fbb0a83e	io_uring: Fix leaking linked timeouts let have a dependant link: REQ -> LINK_TIMEOUT -> LINK_TIMEOUT 1. submission stage: submission references for REQ and LINK_TIMEOUT are dropped. So, references respectively (1,1,2) 2. io_put(REQ) + FAIL_LINKS stage: calls io_fail_links(), which for all linked timeouts will call cancel_timeout() and drop 1 reference. So, references after: (0,0,1). That's a leak. Make it treat only the first linked timeout as such, and pass others through __io_double_put_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Pavel Begunkov	f70193d6d8	io_uring: remove redundant check Pass any IORING_OP_LINK_TIMEOUT request further, where it will eventually fail in io_issue_sqe(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Pavel Begunkov	d3b35796b1	io_uring: break links for failed defer If io_req_defer() failed, it needs to cancel a dependant link. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Jens Axboe	fba38c272a	io_uring: request cancellations should break links We currently don't explicitly break links if a request is cancelled, but we should. Add explicitly link breakage for all types of request cancellations that we support. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:05 -07:00
Jens Axboe	b0dd8a4126	io_uring: correct poll cancel and linked timeout expiration completion Currently a poll request fills a completion entry of 0, even if it got cancelled. This is odd, and it makes it harder to support with chains. Ensure that it returns -ECANCELED in the completions events if it got cancelled, and furthermore ensure that the linked timeout that triggered it completes with -ETIME if we did indeed trigger the completions through a timeout. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:05 -07:00
Jens Axboe	e0e328c4b3	io_uring: remove dead REQ_F_SEQ_PREV flag With the conversion to io-wq, we no longer use that flag. Kill it. Fixes: `561fb04a6a` ("io_uring: replace workqueue usage with io-wq") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:05 -07:00
Jens Axboe	94ae5e77a9	io_uring: fix sequencing issues with linked timeouts We have an issue with timeout links that are deeper in the submit chain, because we only handle it upfront, not from later submissions. Move the prep + issue of the timeout link to the async work prep handler, and do it normally for non-async queue. If we validate and prepare the timeout links upfront when we first see them, there's nothing stopping us from supporting any sort of nesting. Fixes: `2665abfd75` ("io_uring: add support for linked SQE timeouts") Reported-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:05 -07:00
Jens Axboe	ad8a48acc2	io_uring: make req->timeout be dynamically allocated There are a few reasons for this: - As a prep to improving the linked timeout logic - io_timeout is the biggest member in the io_kiocb opcode union This also enables a few cleanups, like unifying the timer setup between IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple arguments to the link/prep helpers. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:01 -07:00
Jens Axboe	978db57e2c	io_uring: make io_double_put_req() use normal completion path If we don't use the normal completion path, we may skip killing links that should be errored and freed. Add __io_double_put_req() for use within the completion path itself, other calls should just use io_double_put_req(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:48:31 -07:00
Jens Axboe	0e0702dac2	io_uring: cleanup return values from the queueing functions __io_queue_sqe(), io_queue_sqe(), io_queue_link_head() all return 0/err, but the caller doesn't care since the errors are handled inline. Clean these up and just make them void. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:48:31 -07:00
Jens Axboe	95a5bbae05	io_uring: io_async_cancel() should pass in 'nxt' request pointer If we have a linked request, this enables us to pass it back directly without having to go through async context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:48:31 -07:00
Linus Torvalds	fb4b3d3fd0	for-5.5/io_uring-20191121 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WxNwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgps4kD/9SIDXhYhhE8fNqeAF7Uouu8fxgwnkY3hSI 43vJwCziiDxWWJH5mYW7/83VNOMZKHIbiYMnU6iEUsRQ/sG/wI0wEfAQZDHLzCKt cko2q7zAC1/4rtoslwJ3q04hE2Ap/nb93ELZBVr7fOAuODBNFUp/vifAojvsMPKz hNMNPq/vYg7c/iYMZKSBdtjE3tqceFNBjAVNMB9dHKQLeexEy4ve7AjBeawWsSi7 GesnQ5w5u5LqkMYwLslpv/oVjHiiFWgGnDAvBNvykQvVy+DfB54KSqMV11W1aqdU l6L+ENfZasEvlk1yMAth2Foq4vlscm5MKEb6VdJhXWHHXtXkcBmz7RBqPmjSvXCY wS5GZRw8oYtTcid0aQf+t/wgRNTDJsGsnsT32qto41No3Z7vlIDHUDxHZGTA+gEL E8j9rDx6EXMTo3EFbC8XZcfsorhPJ1HKAyw1YFczHtYzJEQUR9jJe3f/Q9u6K2Vy s/EhkVeHa/lEd7kb6mI+6lQjGe1FXl7AHauDuaaEfIOZA/xJB3Bad5Wjq1va1cUO TX+37zjzFzJghhSIBGYq7G7iT4AMecPQgxHzCdCyYfW5S4Uur9tMmIElwVPI/Pjl kDZ9gdg9lm6JifZ9Ab8QcGhuQQTF3frwX9VfgrVgcqyvm38AiYzVgL9ZJnxRS/Cy ZfLNkACXqQ== =YZ9s -----END PGP SIGNATURE----- Merge tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: "A lot of stuff has been going on this cycle, with improving the support for networked IO (and hence unbounded request completion times) being one of the major themes. There's been a set of fixes done this week, I'll send those out as well once we're certain we're fully happy with them. This contains: - Unification of the "normal" submit path and the SQPOLL path (Pavel) - Support for sparse (and bigger) file sets, and updating of those file sets without needing to unregister/register again. - Independently sized CQ ring, instead of just making it always 2x the SQ ring size. This makes it more flexible for networked applications. - Support for overflowed CQ ring, never dropping events but providing backpressure on submits. - Add support for absolute timeouts, not just relative ones. - Support for generic cancellations. This divorces io_uring from workqueues as well, which additionally gets us one step closer to generic async system call support. - With cancellations, we can support grabbing the process file table as well, just like we do mm context. This allows support for system calls that create file descriptors, like accept4() support that's built on top of that. - Support for io_uring tracing (Dmitrii) - Support for linked timeouts. These abort an operation if it isn't completed by the time noted in the linke timeout. - Speedup tracking of poll requests - Various cleanups making the coder easier to follow (Jackie, Pavel, Bob, YueHaibing, me) - Update MAINTAINERS with new io_uring list" * tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block: (64 commits) io_uring: make POLL_ADD/POLL_REMOVE scale better io-wq: remove now redundant struct io_wq_nulls_list io_uring: Fix getting file for non-fd opcodes io_uring: introduce req_need_defer() io_uring: clean up io_uring_cancel_files() io-wq: ensure free/busy list browsing see all items io-wq: ensure we have a stable view of ->cur_work for cancellations io_wq: add get/put_work handlers to io_wq_create() io_uring: check for validity of ->rings in teardown io_uring: fix potential deadlock in io_poll_wake() io_uring: use correct "is IO worker" helper io_uring: fix -ENOENT issue with linked timer with short timeout io_uring: don't do flush cancel under inflight_lock io_uring: flag SQPOLL busy condition to userspace io_uring: make ASYNC_CANCEL work with poll and timeout io_uring: provide fallback request for OOM situations io_uring: convert accept4() -ERESTARTSYS into -EINTR io_uring: fix error clear of ->file_table in io_sqe_files_register() io_uring: separate the io_free_req and io_free_req_find_next interface io_uring: keep io_put_req only responsible for release and put req ...	2019-11-25 10:40:27 -08:00
Jens Axboe	eac406c61c	io_uring: make POLL_ADD/POLL_REMOVE scale better One of the obvious use cases for these commands is networking, where it's not uncommon to have tons of sockets open and polled for. The current implementation uses a list for insertion and lookup, which works fine for file based use cases where the count is usually low, it breaks down somewhat for higher number of files / sockets. A test case with 30k sockets being polled for and cancelled takes: real 0m6.968s user 0m0.002s sys 0m6.936s with the patch it takes: real 0m0.233s user 0m0.010s sys 0m0.176s If you go to 50k sockets, it gets even more abysmal with the current code: real 0m40.602s user 0m0.010s sys 0m40.555s with the patch it takes: real 0m0.398s user 0m0.000s sys 0m0.341s Change is pretty straight forward, just replace the cancel_list with a red/black tree instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-14 12:09:58 -07:00
Pavel Begunkov	a320e9fa1e	io_uring: Fix getting file for non-fd opcodes For timeout requests and bunch of others io_uring tries to grab a file with specified fd, which is usually stdin/fd=0. Update io_op_needs_file() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-13 19:41:01 -07:00
Bob Liu	9d858b2148	io_uring: introduce req_need_defer() Makes the code easier to read. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-13 19:41:01 -07:00
Bob Liu	2f6d9b9d63	io_uring: clean up io_uring_cancel_files() We don't use the return value anymore, drop it. Also drop the unecessary double cancel_req value check. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-13 19:41:01 -07:00
Jens Axboe	5e559561a8	io_uring: ensure registered buffer import returns the IO length A test case was reported where two linked reads with registered buffers failed the second link always. This is because we set the expected value of a request in req->result, and if we don't get this result, then we fail the dependent links. For some reason the registered buffer import returned -ERROR/0, while the normal import returns -ERROR/length. This broke linked commands with registered buffers. Fix this by making io_import_fixed() correctly return the mapped length. Cc: stable@vger.kernel.org # v5.3 Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-13 16:15:14 -07:00
Pavel Begunkov	5683e5406e	io_uring: Fix getting file for timeout For timeout requests io_uring tries to grab a file with specified fd, which is usually stdin/fd=0. Update io_op_needs_file() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-13 15:25:57 -07:00
Jens Axboe	7d7230652e	io_wq: add get/put_work handlers to io_wq_create() For cancellation, we need to ensure that the work item stays valid for as long as ->cur_work is valid. Right now we can't safely dereference the work item even under the wqe->lock, because while the ->cur_work pointer will remain valid, the work could be completing and be freed in parallel. Only invoke ->get/put_work() on items we know that the caller queued themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed when we're queueing a flush item, for instance. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-13 11:37:54 -07:00
Jens Axboe	15dff286d0	io_uring: check for validity of ->rings in teardown Normally the rings are always valid, the exception is if we failed to allocate the rings at setup time. syzbot reports this: RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229 RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 8903 Comm: syz-executor410 Not tainted 5.4.0-rc7-next-20191113 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline] RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline] RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592 Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1 ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61 RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100 RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0 R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000 FS: 0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: io_cqring_overflow_flush+0x6b9/0xa90 fs/io_uring.c:673 io_ring_ctx_wait_and_kill+0x24f/0x7c0 fs/io_uring.c:4260 io_uring_create fs/io_uring.c:4600 [inline] io_uring_setup+0x1256/0x1cc0 fs/io_uring.c:4626 __do_sys_io_uring_setup fs/io_uring.c:4639 [inline] __se_sys_io_uring_setup fs/io_uring.c:4636 [inline] __x64_sys_io_uring_setup+0x54/0x80 fs/io_uring.c:4636 do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x441229 Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229 RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: ---[ end trace b0f5b127a57f623f ]--- RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline] RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline] RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592 Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1 ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61 RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100 RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0 R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000 FS: 0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 which is exactly the case of failing to allocate the SQ/CQ rings, and then entering shutdown. Check if the rings are valid before trying to access them at shutdown time. Reported-by: syzbot+21147d79607d724bd6f3@syzkaller.appspotmail.com Fixes: `1d7bb1d50f` ("io_uring: add support for backlogged CQ ring") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-13 09:11:36 -07:00
Jens Axboe	7c9e7f0fe0	io_uring: fix potential deadlock in io_poll_wake() We attempt to run the poll completion inline, but we're using trylock to do so. This avoids a deadlock since we're grabbing the locks in reverse order at this point, we already hold the poll wq lock and we're trying to grab the completion lock, while the normal rules are the reverse of that order. IO completion for a timeout link will need to grab the completion lock, but that's not safe from this context. Put the completion under the completion_lock in io_poll_wake(), and mark the request as entering the completion with the completion_lock already held. Fixes: `2665abfd75` ("io_uring: add support for linked SQE timeouts") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-12 12:26:34 -07:00
Jens Axboe	960e432dfa	io_uring: use correct "is IO worker" helper Since we switched to io-wq, the dependent link optimization for when to pass back work inline has been broken. Fix this by providing a suitable io-wq helper for io_uring to use to detect when to do this. Fixes: `561fb04a6a` ("io_uring: replace workqueue usage with io-wq") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-12 08:02:26 -07:00
Jens Axboe	93bd25bb69	io_uring: make timeout sequence == 0 mean no sequence Currently we make sequence == 0 be the same as sequence == 1, but that's not super useful if the intent is really to have a timeout that's just a pure timeout. If the user passes in sqe->off == 0, then don't apply any sequence logic to the request, let it purely be driven by the timeout specified. Reported-by: 李通洲 <carter.li@eoitek.com> Reviewed-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-12 00:18:51 -07:00
Jens Axboe	76a46e066e	io_uring: fix -ENOENT issue with linked timer with short timeout If you prep a read (for example) that needs to get punted to async context with a timer, if the timeout is sufficiently short, the timer request will get completed with -ENOENT as it could not find the read. The issue is that we prep and start the timer before we start the read. Hence the timer can trigger before the read is even started, and the end result is then that the timer completes with -ENOENT, while the read starts instead of being cancelled by the timer. Fix this by splitting the linked timer into two parts: 1) Prep and validate the linked timer 2) Start timer The read is then started between steps 1 and 2, so we know that the timer will always have a consistent view of the read request state. Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-11 16:33:22 -07:00
Jens Axboe	768134d4f4	io_uring: don't do flush cancel under inflight_lock We can't safely cancel under the inflight lock. If the work hasn't been started yet, then io_wq_cancel_work() simply marks the work as cancelled and invokes the work handler. But if the work completion needs to grab the inflight lock because it's grabbing user files, then we'll deadlock trying to finish the work as we already hold that lock. Instead grab a reference to the request, if it isn't already zero. If it's zero, then we know it's going through completion anyway, and we can safely ignore it. If it's not zero, then we can drop the lock and attempt to cancel from there. This also fixes a missing finish_wait() at the end of io_uring_cancel_files(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-11 16:33:17 -07:00
Jens Axboe	c1edbf5f08	io_uring: flag SQPOLL busy condition to userspace Now that we have backpressure, for SQPOLL, we have one more condition that warrants flagging that the application needs to enter the kernel: we failed to submit IO due to backpressure. Make sure we catch that and flag it appropriately. If we run into backpressure issues with the SQPOLL thread, flag it as such to the application by setting IORING_SQ_NEED_WAKEUP. This will cause the application to enter the kernel, and that will flush the backlog and clear the condition. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-11 16:33:11 -07:00
Jens Axboe	47f467686e	io_uring: make ASYNC_CANCEL work with poll and timeout It's a little confusing that we have multiple types of command cancellation opcodes now that we have a generic one. Make the generic one work with POLL_ADD and TIMEOUT commands as well, that makes for an easier to use API for the application. The fact that they currently don't is a bit confusing. Add a helper that takes care of it, so we can user it from both IORING_OP_ASYNC_CANCEL and from the linked timeout cancellation. Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-11 16:33:05 -07:00
Jens Axboe	0ddf92e848	io_uring: provide fallback request for OOM situations One thing that really sucks for userspace APIs is if the kernel passes back -ENOMEM/-EAGAIN for resource shortages. The application really has no idea of what to do in those cases. Should it try and reap completions? Probably a good idea. Will it solve the issue? Who knows. This patch adds a simple fallback mechanism if we fail to allocate memory for a request. If we fail allocating memory from the slab for a request, we punt to a pre-allocated request. There's just one of these per io_ring_ctx, but the important part is if we ever return -EBUSY to the application, the applications knows that it can wait for events and make forward progress when events have completed. This is the important part. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-11 16:32:55 -07:00
Jens Axboe	8e3cca1270	io_uring: convert accept4() -ERESTARTSYS into -EINTR If we cancel a pending accept operating with a signal, we get -ERESTARTSYS returned. Turn that into -EINTR for userspace, we should not be return -ERESTARTSYS. Fixes: `17f2fe35d0` ("io_uring: add support for IORING_OP_ACCEPT") Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-10 20:29:49 -07:00
Jens Axboe	46568e9be7	io_uring: fix error clear of ->file_table in io_sqe_files_register() syzbot reports that when using failslab and friends, we can get a double free in io_sqe_files_unregister(): BUG: KASAN: double-free or invalid-free in io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185 CPU: 1 PID: 8819 Comm: syz-executor452 Not tainted 5.4.0-rc6-next-20191108 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x197/0x210 lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374 kasan_report_invalid_free+0x65/0xa0 mm/kasan/report.c:468 __kasan_slab_free+0x13a/0x150 mm/kasan/common.c:450 kasan_slab_free+0xe/0x10 mm/kasan/common.c:480 __cache_free mm/slab.c:3426 [inline] kfree+0x10a/0x2c0 mm/slab.c:3757 io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185 io_ring_ctx_free fs/io_uring.c:3998 [inline] io_ring_ctx_wait_and_kill+0x348/0x700 fs/io_uring.c:4060 io_uring_release+0x42/0x50 fs/io_uring.c:4068 __fput+0x2ff/0x890 fs/file_table.c:280 ____fput+0x16/0x20 fs/file_table.c:313 task_work_run+0x145/0x1c0 kernel/task_work.c:113 exit_task_work include/linux/task_work.h:22 [inline] do_exit+0x904/0x2e60 kernel/exit.c:817 do_group_exit+0x135/0x360 kernel/exit.c:921 __do_sys_exit_group kernel/exit.c:932 [inline] __se_sys_exit_group kernel/exit.c:930 [inline] __x64_sys_exit_group+0x44/0x50 kernel/exit.c:930 do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x43f2c8 Code: 31 b8 c5 f7 ff ff 48 8b 5c 24 28 48 8b 6c 24 30 4c 8b 64 24 38 4c 8b 6c 24 40 4c 8b 74 24 48 4c 8b 7c 24 50 48 83 c4 58 c3 66 <0f> 1f 84 00 00 00 00 00 48 8d 35 59 ca 00 00 0f b6 d2 48 89 fb 48 RSP: 002b:00007ffd5b976008 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043f2c8 RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000 RBP: 00000000004bf0a8 R08: 00000000000000e7 R09: ffffffffffffffd0 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001 R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000 This happens if we fail allocating the file tables. For that case we do free the file table correctly, but we forget to set it to NULL. This means that ring teardown will see it as being non-NULL, and attempt to free it again. Fix this by clearing the file_table pointer if we free the table. Reported-by: syzbot+3254bc44113ae1e331ee@syzkaller.appspotmail.com Fixes: `65e19f54d2` ("io_uring: support for larger fixed file sets") Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-10 20:29:49 -07:00
Jackie Liu	c69f8dbe24	io_uring: separate the io_free_req and io_free_req_find_next interface Similar to the distinction between io_put_req and io_put_req_find_next, io_free_req has been modified similarly, with no functional changes. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-10 20:29:49 -07:00
Jackie Liu	ec9c02ad4c	io_uring: keep io_put_req only responsible for release and put req We already have io_put_req_find_next to find the next req of the link. we should not use the io_put_req function to find them. They should be functions of the same level. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-10 20:29:49 -07:00
Jackie Liu	a197f664a0	io_uring: remove passed in 'ctx' function parameter ctx if possible Many times, the core of the function is req, and req has already set req->ctx at initialization time, so there is no need to pass in the ctx from the caller. Cleanup, no functional change. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-10 20:29:49 -07:00
Jens Axboe	206aefde4f	io_uring: reduce/pack size of io_ring_ctx With the recent flurry of additions and changes to io_uring, the layout of io_ring_ctx has become a bit stale. We're right now at 704 bytes in size on my x86-64 build, or 11 cachelines. This patch does two things: - We have to completion structs embedded, that we only use for quiesce of the ctx (or shutdown) and for sqthread init cases. That 2x32 bytes right there, let's dynamically allocate them. - Reorder the struct a bit with an eye on cachelines, use cases, and holes. With this patch, we're down to 512 bytes, or 8 cachelines. Reviewed-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-10 20:29:49 -07:00
Jens Axboe	5f8fd2d3e0	io_uring: properly mark async work as bounded vs unbounded Now that io-wq supports separating the two request lifetime types, mark the following IO as having unbounded runtimes: - Any read/write to a non-regular file - Any specific networked IO - Any poll command Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-07 11:57:17 -07:00
Jens Axboe	c5def4ab84	io-wq: add support for bounded vs unbunded work io_uring supports request types that basically have two different lifetimes: 1) Bounded completion time. These are requests like disk reads or writes, which we know will finish in a finite amount of time. 2) Unbounded completion time. These are generally networked IO, where we have no idea how long they will take to complete. Another example is POLL commands. This patch provides support for io-wq to handle these differently, so we don't starve bounded requests by tying up workers for too long. By default all work is bounded, unless otherwise specified in the work item. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-07 11:41:35 -07:00
Jens Axboe	1d7bb1d50f	io_uring: add support for backlogged CQ ring Currently we drop completion events, if the CQ ring is full. That's fine for requests with bounded completion times, but it may make it harder or impossible to use io_uring with networked IO where request completion times are generally unbounded. Or with POLL, for example, which is also unbounded. After this patch, we never overflow the ring, we simply store requests in a backlog for later flushing. This flushing is done automatically by the kernel. To prevent the backlog from growing indefinitely, if the backlog is non-empty, we apply back pressure on IO submissions. Any attempt to submit new IO with a non-empty backlog will get an -EBUSY return from the kernel. This is a signal to the application that it has backlogged CQ events, and that it must reap those before being allowed to submit more IO. Note that if we do return -EBUSY, we will have filled whatever backlogged events into the CQ ring first, if there's room. This means the application can safely reap events WITHOUT entering the kernel and waiting for them, they are already available in the CQ ring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-09 11:45:29 -07:00
Jens Axboe	78e19bbef3	io_uring: pass in io_kiocb to fill/add CQ handlers This is in preparation for handling CQ ring overflow a bit smarter. We should not have any functional changes in this patch. Most of the changes are fairly straight forward, the only ones that stick out a bit are the ones that change __io_free_req() to take the reference count into account. If the request hasn't been submitted yet, we know it's safe to simply ignore references and free it. But let's clean these up too, as later patches will depend on the caller doing the right thing if the completion logging grabs a reference to the request. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-08 06:57:27 -07:00
Jens Axboe	84f97dc233	io_uring: make io_cqring_events() take 'ctx' as argument The rings can be derived from the ctx, and we need the ctx there for a future change. No functional changes in this patch. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-08 06:57:21 -07:00
Jens Axboe	2665abfd75	io_uring: add support for linked SQE timeouts While we have support for generic timeouts, we don't have a way to tie a timeout to a specific SQE. The generic timeouts simply trigger wakeups on the CQ ring. This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid as a link to a previous command. The timeout specific can be either relative or absolute, following the same rules as IORING_OP_TIMEOUT. If the timeout triggers before the dependent command completes, it will attempt to cancel that command. Likewise, if the dependent command completes before the timeout triggers, it will cancel the timeout. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-07 19:12:40 -07:00
Jens Axboe	e977d6d34f	io_uring: abstract out io_async_cancel_one() helper We're going to need this helper in a future patch, so move it out of io_async_cancel() and into its own separate function. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-07 12:31:31 -07:00
Pavel Begunkov	267bc90442	io_uring: use inlined struct sqe_submit req->submit is always up-to-date, use it directly Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-06 20:23:02 -07:00
Pavel Begunkov	50585b9a07	io_uring: Use submit info inlined into req Stack allocated struct sqe_submit is passed down to the submission path along with a request (a.k.a. struct io_kiocb), and will be copied into req->submit for async requests. As space for it is already allocated, fill req->submit in the first place instead of using on-stack one. As a result: 1. sqe->submit is the only place for sqe_submit and is always valid, so we don't need to track which one to use. 2. don't need to copy in case of async 3. allows to simplify the code by not carrying it as an argument all the way down 4. allows to reduce number of function arguments / potentially improve spilling The downside is that stack is most probably be cached, that's not true for just allocated memory for a request. Another concern is cache pollution. Though, a request would be touched and fetched along with req->submit at some point anyway, so shouldn't be a problem. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-06 20:23:00 -07:00
Pavel Begunkov	196be95cd5	io_uring: allocate io_kiocb upfront Let io_submit_sqes() to allocate io_kiocb before fetching an sqe. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-06 20:21:37 -07:00
Pavel Begunkov	e5eb6366ac	io_uring: io_queue_link*() right after submit After a call to io_submit_sqe(), it's already known whether it needs to queue a link or not. Do it there, as it's simplier and doesn't keep an extra variable across the loop. Reviewed-by：Bob Liu <bob.liu@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-06 11:20:11 -07:00
Pavel Begunkov	ae9428ca61	io_uring: Merge io_submit_sqes and io_ring_submit io_submit_sqes() and io_ring_submit() are doing the same stuff with a little difference. Deduplicate them. Reviewed-by：Bob Liu <bob.liu@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-06 11:20:07 -07:00
Jens Axboe	3aa5fa0305	io_uring: kill dead REQ_F_LINK_DONE flag We had no more use for this flag after the conversion to io-wq, kill it off. Fixes: `561fb04a6a` ("io_uring: replace workqueue usage with io-wq") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-05 20:34:32 -07:00
Jens Axboe	f1f40853c0	io_uring: fixup a few spots where link failure isn't flagged If a request fails, we need to ensure we set REQ_F_FAIL_LINK on it if REQ_F_LINK is set. Any failure in the chain should break the chain. We were missing a few spots where this should be done. It might be nice to generalize this somewhat at some point, as long as we factor in the fact that failure looks different for each request type. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-05 20:33:16 -07:00
Jens Axboe	89723d0bd6	io_uring: enable optimized link handling for IORING_OP_POLL_ADD As introduced by commit: `ba816ad61f` ("io_uring: run dependent links inline if possible") enable inline dependent link running for poll commands. io_poll_complete_work() is the most important change, as it allows a linked sequence of { POLL, READ } (for example) to proceed inline instead of needing to get punted to another async context. The submission side only potentially matters for sqthread, but may as well include that bit. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-05 15:32:58 -07:00
Jens Axboe	51c3ff62ca	io_uring: add completion trace event We currently don't have a completion event trace, add one of those. And to better be able to match up submissions and completions, add user_data to the submission trace as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 07:07:52 -07:00
Jackie Liu	e9ffa5c2b7	io_uring: set -EINTR directly when a signal wakes up in io_cqring_wait We didn't use -ERESTARTSYS to tell the application layer to restart the system call, but instead return -EINTR. we can set -EINTR directly when wakeup by the signal, which can help us save an assignment operation and comparison operation. Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-01 08:36:36 -06:00
Jens Axboe	62755e35df	io_uring: support for generic async request cancel This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to cancel requests that have been punted to async context and are now in-flight. This works for regular read/write requests to files, as long as they haven't been started yet. For socket based IO (or things like accept4(2)), we can cancel work that is already running as well. To cancel a request, the sqe must have ->addr set to the user_data of the request it wishes to cancel. If the request is cancelled successfully, the original request is completed with -ECANCELED and the cancel request is completed with a result of 0. If the request was already running, the original may or may not complete in error. The cancel request will complete with -EALREADY for that case. And finally, if the request to cancel wasn't found, the cancel request is completed with -ENOENT. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-01 08:35:31 -06:00
Jens Axboe	6873e0bd6a	io_uring: ensure we clear io_kiocb->result before each issue We use io_kiocb->result == -EAGAIN as a way to know if we need to re-submit a polled request, as -EAGAIN reporting happens out-of-line for IO submission failures. This field is cleared when we originally allocate the request, but it isn't reset when we retry the submission from async context. This can cause issues where we think something needs a re-issue, but we're really just reading stale data. Reset ->result whenever we re-prep a request for polled submission. Cc: stable@vger.kernel.org Fixes: `9e645e1105` ("io_uring: add support for sqe links") Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-30 14:45:22 -06:00
Jens Axboe	975c99a570	io_uring: io_wq_create() returns an error pointer, not NULL syzbot reported an issue where we crash at setup time if failslab is used. The issue is that io_wq_create() returns an error pointer on failure, not NULL. Hence io_uring thought the io-wq was setup just fine, but in reality it's a garbage error pointer. Use IS_ERR() instead of a NULL check, and assign ret appropriately. Reported-by: syzbot+221cc24572a2fed23b6b@syzkaller.appspotmail.com Fixes: `561fb04a6a` ("io_uring: replace workqueue usage with io-wq") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-30 08:42:56 -06:00
Jens Axboe	842f96124c	io_uring: fix race with canceling timeouts If we get -1 from hrtimer_try_to_cancel(), we know that the timer is running. Hence leave all completion to the timeout handler. If we don't, we can corrupt the list and miss a completion. Fixes: `11365043e5` ("io_uring: add support for canceling timeout requests") Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 15:43:30 -06:00
Jens Axboe	65e19f54d2	io_uring: support for larger fixed file sets There's been a few requests for supporting more fixed files than 1024. This isn't really tricky to do, we just need to split up the file table into multiple tables and index appropriately. As we do so, reduce the max single file table to 512. This enables us to do single page allocs always for the tables, which is an improvement over the situation prior. This patch adds support for up to 64K files, which should be enough for everyone. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 12:43:06 -06:00
Jens Axboe	b7620121dc	io_uring: protect fixed file indexing with array_index_nospec() We index the file tables with a user given value. After we check it's within our limits, use array_index_nospec() to prevent any spectre attacks here. Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 12:43:06 -06:00
Jens Axboe	17f2fe35d0	io_uring: add support for IORING_OP_ACCEPT This allows an application to call accept4() in an async fashion. Like other opcodes, we first try a non-blocking accept, then punt to async context if we have to. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 12:43:06 -06:00
Jens Axboe	fcb323cc53	io_uring: io_uring: add support for async work inheriting files This is in preparation for adding opcodes that need to add new files in a process file table, system calls like open(2) or accept4(2). If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work item. If work that needs to get punted to async context have this set, the async worker will assume the original task file table before executing the work. Note that opcodes that need access to the current files of an application cannot be done through IORING_SETUP_SQPOLL. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 12:43:06 -06:00
Jens Axboe	561fb04a6a	io_uring: replace workqueue usage with io-wq Drop various work-arounds we have for workqueues: - We no longer need the async_list for tracking sequential IO. - We don't have to maintain our own mm tracking/setting. - We don't need a separate workqueue for buffered writes. This didn't even work that well to begin with, as it was suboptimal for multiple buffered writers on multiple files. - We can properly cancel pending interruptible work. This fixes deadlocks with particularly socket IO, where we cannot cancel them when the io_uring is closed. Hence the ring will wait forever for these requests to complete, which may never happen. This is different from disk IO where we know requests will complete in a finite amount of time. - Due to being able to cancel work interruptible work that is already running, we can implement file table support for work. We need that for supporting system calls that add to a process file table. - It gets us one step closer to adding async support for any system call. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 12:43:06 -06:00
Pavel Begunkov	95a1b3ff9a	io_uring: Fix mm_fault with READ/WRITE_FIXED Commit `fb5ccc9878` ("io_uring: Fix broken links with offloading") introduced a potential performance regression with unconditionally taking mm even for READ/WRITE_FIXED operations. Return the logic handling it back. mm-faulted requests will go through the generic submission path, so honoring links and drains, but will fail further on req->has_user check. Fixes: `fb5ccc9878` ("io_uring: Fix broken links with offloading") Cc: stable@vger.kernel.org # v5.4 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:24:23 -06:00
Pavel Begunkov	fa45622808	io_uring: remove index from sqe_submit submit->index is used only for inbound check in submission path (i.e. head < ctx->sq_entries). However, it always will be true, as 1. it's already validated by io_get_sqring() 2. ctx->sq_entries can't be changedd in between, because of held ctx->uring_lock and ctx->refs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:24:21 -06:00
Dmitrii Dolgov	c826bd7a74	io_uring: add set of tracing events To trace io_uring activity one can get an information from workqueue and io trace events, but looks like some parts could be hard to identify via this approach. Making what happens inside io_uring more transparent is important to be able to reason about many aspects of it, hence introduce the set of tracing events. All such events could be roughly divided into two categories: * those, that are helping to understand correctness (from both kernel and an application point of view). E.g. a ring creation, file registration, or waiting for available CQE. Proposed approach is to get a pointer to an original structure of interest (ring context, or request), and then find relevant events. io_uring_queue_async_work also exposes a pointer to work_struct, to be able to track down corresponding workqueue events. * those, that provide performance related information. Mostly it's about events that change the flow of requests, e.g. whether an async work was queued, or delayed due to some dependencies. Another important case is how io_uring optimizations (e.g. registered files) are utilized. Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:24:18 -06:00
Jens Axboe	11365043e5	io_uring: add support for canceling timeout requests We might have cases where the need for a specific timeout is gone, add support for canceling an existing timeout operation. This works like the POLL_REMOVE command, where the application passes in the user_data of the timeout it wishes to cancel in the sqe->addr field. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:50 -06:00
Jens Axboe	a41525ab2e	io_uring: add support for absolute timeouts This is a pretty trivial addition on top of the relative timeouts we have now, but it's handy for ensuring tighter timing for those that are building scheduling primitives on top of io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:48 -06:00
Jackie Liu	ba5290ccb6	io_uring: replace s->needs_lock with s->in_async There is no function change, just to clean up the code, use s->in_async to make the code know where it is. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:47 -06:00
Jens Axboe	33a107f0a1	io_uring: allow application controlled CQ ring size We currently size the CQ ring as twice the SQ ring, to allow some flexibility in not overflowing the CQ ring. This is done because the SQE life time is different than that of the IO request itself, the SQE is consumed as soon as the kernel has seen the entry. Certain application don't need a huge SQ ring size, since they just submit IO in batches. But they may have a lot of requests pending, and hence need a big CQ ring to hold them all. By allowing the application to control the CQ ring size multiplier, we can cater to those applications more efficiently. If an application wants to define its own CQ ring size, it must set IORING_SETUP_CQSIZE in the setup flags, and fill out io_uring_params->cq_entries. The value must be a power of two. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:46 -06:00
Jens Axboe	c3a31e6056	io_uring: add support for IORING_REGISTER_FILES_UPDATE Allows the application to remove/replace/add files to/from a file set. Passes in a struct: struct io_uring_files_update { __u32 offset; __s32 *fds; }; that holds an array of fds, size of array passed in through the usual nr_args part of the io_uring_register() system call. The logic is as follows: 1) If ->fds[i] is -1, the existing file at i + ->offset is removed from the set. 2) If ->fds[i] is a valid fd, the existing file at i + ->offset is replaced with ->fds[i]. For case #2, is the existing file is currently empty (fd == -1), the new fd is simply added to the array. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:44 -06:00
Jens Axboe	08a451739a	io_uring: allow sparse fixed file sets This is in preparation for allowing updates to fixed file sets without requiring a full unregister+register. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:43 -06:00
Jens Axboe	ba816ad61f	io_uring: run dependent links inline if possible Currently any dependent link is executed from a new workqueue context, which means that we'll be doing a context switch per link in the chain. If we are running the completion of the current request from our async workqueue and find that the next request is a link, then run it directly from the workqueue context instead of forcing another switch. This improves the performance of linked SQEs, and reduces the CPU overhead. Reviewed-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:41 -06:00
Jens Axboe	044c1ab399	io_uring: don't touch ctx in setup after ring fd install syzkaller reported an issue where it looks like a malicious app can trigger a use-after-free of reading the ctx ->sq_array and ->rings value right after having installed the ring fd in the process file table. Defer ring fd installation until after we're done reading those values. Fixes: `75b28affdd` ("io_uring: allocate the two rings together") Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-28 09:15:33 -06:00
Pavel Begunkov	7b20238d28	io_uring: Fix leaked shadow_req io_queue_link_head() owns shadow_req after taking it as an argument. By not freeing it in case of an error, it can leak the request along with taken ctx->refs. Reviewed-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-27 21:29:18 -06:00
Jens Axboe	2b2ed9750f	io_uring: fix bad inflight accounting for SETUP_IOPOLL\|SETUP_SQTHREAD We currently assume that submissions from the sqthread are successful, and if IO polling is enabled, we use that value for knowing how many completions to look for. But if we overflowed the CQ ring or some requests simply got errored and already completed, they won't be available for polling. For the case of IO polling and SQTHREAD usage, look at the pending poll list. If it ever hits empty then we know that we don't have anymore pollable requests inflight. For that case, simply reset the inflight count to zero. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-25 10:58:53 -06:00
Jens Axboe	498ccd9eda	io_uring: used cached copies of sq->dropped and cq->overflow We currently use the ring values directly, but that can lead to issues if the application is malicious and changes these values on our behalf. Created in-kernel cached versions of them, and just overwrite the user side when we update them. This is similar to how we treat the sq/cq ring tail/head updates. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-25 10:58:45 -06:00
Pavel Begunkov	935d1e4590	io_uring: Fix race for sqes with userspace io_ring_submit() finalises with 1. io_commit_sqring(), which releases sqes to the userspace 2. Then calls to io_queue_link_head(), accessing released head's sqe Reorder them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-25 09:02:01 -06:00
Pavel Begunkov	fb5ccc9878	io_uring: Fix broken links with offloading io_sq_thread() processes sqes by 8 without considering links. As a result, links will be randomely subdivided. The easiest way to fix it is to call io_get_sqring() inside io_submit_sqes() as do io_ring_submit(). Downsides: 1. This removes optimisation of not grabbing mm_struct for fixed files 2. It submitting all sqes in one go, without finer-grained sheduling with cq processing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-25 09:01:59 -06:00
Pavel Begunkov	84d55dc5b9	io_uring: Fix corrupted user_data There is a bug, where failed linked requests are returned not with specified @user_data, but with garbage from a kernel stack. The reason is that io_fail_links() uses req->user_data, which is uninitialised when called from io_queue_sqe() on fail path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-25 09:01:58 -06:00
zhangyi (F)	a1f58ba46f	io_uring: correct timeout req sequence when inserting a new entry The sequence number of the timeout req (req->sequence) indicate the expected completion request. Because of each timeout req consume a sequence number, so the sequence of each timeout req on the timeout list shouldn't be the same. But now, we may get the same number (also incorrect) if we insert a new entry before the last one, such as submit such two timeout reqs on a new ring instance below. req->sequence req_1 (count = 2): 2 req_2 (count = 1): 2 Then, if we submit a nop req, req_2 will still timeout even the nop req finished. This patch fix this problem by adjust the sequence number of each reordered reqs when inserting a new entry. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-23 22:09:56 -06:00
zhangyi (F)	ef03681ae8	io_uring : correct timeout req sequence when waiting timeout The sequence number of reqs on the timeout_list before the timeout req should be adjusted in io_timeout_fn(), because the current timeout req will consumes a slot in the cq_ring and cq_tail pointer will be increased, otherwise other timeout reqs may return in advance without waiting for enough wait_nr. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-23 22:09:56 -06:00
Jens Axboe	bc808bced3	io_uring: revert "io_uring: optimize submit_and_wait API" There are cases where it isn't always safe to block for submission, even if the caller asked to wait for events as well. Revert the previous optimization of doing that. This reverts two commits: `bf7ec93c64` `c576666863` Fixes: `c576666863` ("io_uring: optimize submit_and_wait API") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-23 22:09:56 -06:00
Linus Torvalds	d418d07005	for-linus-2019-10-18 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2qbF0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgptsuEADEKL8pta74uy50pl0t8l9fZ++U+wdIeEIW 9uumpOEPnI2GpkG1sOyKWK6tl8InQLw6pAquP9MoT2BHXqFHk7NIgtvk67lwQeoc dRwklVfvOLAdnKzyfODqE9Fh9BgczZIuOLzgdtNqrPKqgJfFRCwN94Kj/r2tYuy7 v+riK3A49u12dOLtjU6ciNgZ0m1iUX9s0+PFYVUXtJHU/1OYToQaKP+sgWiue0Ca VJP/L4MLYD0a7tfd92WAK7xWLsYWTDw1Gg20hXH/tV+IIDQ5+OXhu2s6PuqI7c0y cZqWHQHBDkZMQvT8+V+YqZtEa+xwVCom51prJEPasmdq3fGx+2sDC1HQiySao1ML wfFxZvFvY9fm6M7p2xsSNEcOmamrx1aLLyNSbjIvAqLUDYJWWS56BHsKyTU5Z+Jp RA9dpq8iR6ISaIAcFf0IB0pJSv1HEeHyo/ixlALqezBFJaMdhWy/M+dEbWKtix9M s19ozcpe+omN9+O0anlLtzKNgj2Xnjiwuu8mhVcqn6uG/p6GUOup+lNvTW/fig3I JBH8kObjYXL181V9rYVqFutnuqcf2HYqMvV2vzAmg4LYnPVUmU7HMj8zEpxc4N+f Evd77j0wXmY9S+4JERxaqQZuvKBEIkvM1rkk3N4NbNghfa7QL4aW+I9cWtuelPC2 E+DK7if0Gg== =rvkw -----END PGP SIGNATURE----- Merge tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - NVMe pull request from Keith that address deadlocks, double resets, memory leaks, and other regression. - Fixup elv_support_iosched() for bio based devices (Damien) - Fixup for the ahci PCS quirk (Dan) - Socket O_NONBLOCK handling fix for io_uring (me) - Timeout sequence io_uring fixes (yangerkun) - MD warning fix for parameter default_layout (Song) - blkcg activation fixes (Tejun) - blk-rq-qos node deletion fix (Tejun) * tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block: nvme-pci: Set the prp2 correctly when using more than 4k page io_uring: fix logic error in io_timeout io_uring: fix up O_NONBLOCK handling for sockets md/raid0: fix warning message for parameter default_layout libata/ahci: Fix PCS quirk application blk-rq-qos: fix first node deletion of rq_qos_del() blkcg: Fix multiple bugs in blkcg_activate_policy() io_uring: consider the overflow of sequence for timeout req nvme-tcp: fix possible leakage during error flow nvmet-loop: fix possible leakage during error flow block: Fix elv_support_iosched() nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL nvme: Wait for reset state when required nvme: Prevent resets during paused controller state nvme: Restart request timers in resetting state nvme: Remove ADMIN_ONLY state nvme-pci: Free tagset if no IO queues nvme: retain split access workaround for capability reads nvme: fix possible deadlock when nvme_update_formats fails	2019-10-18 22:29:36 -04:00

1 2 3 4 5 ...

275 Commits