linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-21 19:41:42 +00:00

Author	SHA1	Message	Date
Jens Axboe	5f3829fdd6	io_uring/filetable: kill io_reset_alloc_hint() helper It's only used internally, and in one spot, just open-code ti. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:30 -06:00
Jens Axboe	cb1717a7cd	io_uring/filetable: remove io_file_from_index() helper It's only used in fdinfo, nothing really gained from having this helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:30 -06:00
Jens Axboe	b54a14041e	io_uring/rsrc: add io_rsrc_node_lookup() helper There are lots of spots open-coding this functionality, add a generic helper that does the node lookup in a speculation safe way. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:30 -06:00
Jens Axboe	3597f2786b	io_uring/rsrc: unify file and buffer resource tables For files, there's nr_user_files/file_table/file_data, and buffers have nr_user_bufs/user_bufs/buf_data. There's no reason why file_table and file_data can't be the same thing, and ditto for the buffer side. That gets rid of more io_ring_ctx state that's in two spots rather than just being in one spot, as it should be. Put all the registered file data in one locations, and ditto on the buffer front. This also avoids having both io_rsrc_data->nodes being an allocated array, and ->user_bufs[] or ->file_table.nodes. There's no reason to have this information duplicated. Keep it in one spot, io_rsrc_data, along with how many resources are available. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:45:23 -06:00
Jens Axboe	f38f284764	io_uring: only initialize io_kiocb rsrc_nodes when needed Add the empty node initializing to the preinit part of the io_kiocb allocation, and reset them if they have been used. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:44:30 -06:00
Jens Axboe	0701db7439	io_uring/rsrc: add an empty io_rsrc_node for sparse buffer entries Rather than allocate an io_rsrc_node for an empty/sparse buffer entry, add a const entry that can be used for that. This just needs checking for writing the tag, and the put check needs to check for that sparse node rather than NULL for validity. This avoids allocating rsrc nodes for sparse buffer entries. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:44:30 -06:00
Jens Axboe	fbbb8e991d	io_uring/rsrc: get rid of io_rsrc_node allocation cache It's not going to be needed in the fast path going forward, so kill it off. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:44:30 -06:00
Jens Axboe	7029acd8a9	io_uring/rsrc: get rid of per-ring io_rsrc_node list Work in progress, but get rid of the per-ring serialization of resource nodes, like registered buffers and files. Main issue here is that one node can otherwise hold up a bunch of other nodes from getting freed, which is especially a problem for file resource nodes and networked workloads where some descriptors may not see activity in a long time. As an example, instantiate an io_uring ring fd and create a sparse registered file table. Even 2 will do. Then create a socket and register it as fixed file 0, F0. The number of open files in the app is now 5, with 0/1/2 being the usual stdin/out/err, 3 being the ring fd, and 4 being the socket. Register this socket (eg "the listener") in slot 0 of the registered file table. Now add an operation on the socket that uses slot 0. Finally, loop N times, where each loop creates a new socket, registers said socket as a file, then unregisters the socket, and finally closes the socket. This is roughly similar to what a basic accept loop would look like. At the end of this loop, it's not unreasonable to expect that there would still be 5 open files. Each socket created and registered in the loop is also unregistered and closed. But since the listener socket registered first still has references to its resource node due to still being active, each subsequent socket unregistration is stuck behind it for reclaim. Hence 5 + N files are still open at that point, where N is awaiting the final put held up by the listener socket. Rewrite the io_rsrc_node handling to NOT rely on serialization. Struct io_kiocb now gets explicit resource nodes assigned, with each holding a reference to the parent node. A parent node is either of type FILE or BUFFER, which are the two types of nodes that exist. A request can have two nodes assigned, if it's using both registered files and buffers. Since request issue and task_work completion is both under the ring private lock, no atomics are needed to handle these references. It's a simple unlocked inc/dec. As before, the registered buffer or file table each hold a reference as well to the registered nodes. Final put of the node will remove the node and free the underlying resource, eg unmap the buffer or put the file. Outside of removing the stall in resource reclaim described above, it has the following advantages: 1) It's a lot simpler than the previous scheme, and easier to follow. No need to specific quiesce handling anymore. 2) There are no resource node allocations in the fast path, all of that happens at resource registration time. 3) The structs related to resource handling can all get simplified quite a bit, like io_rsrc_node and io_rsrc_data. io_rsrc_put can go away completely. 4) Handling of resource tags is much simpler, and doesn't require persistent storage as it can simply get assigned up front at registration time. Just copy them in one-by-one at registration time and assign to the resource node. The only real downside is that a request is now explicitly limited to pinning 2 resources, one file and one buffer, where before just assigning a resource node to a request would pin all of them. The upside is that it's easier to follow now, as an individual resource is explicitly referenced and assigned to the request. With this in place, the above mentioned example will be using exactly 5 files at the end of the loop, not N. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-02 15:44:18 -06:00
Jens Axboe	1d60d74e85	io_uring/rw: fix missing NOWAIT check for O_DIRECT start write When io_uring starts a write, it'll call kiocb_start_write() to bump the super block rwsem, preventing any freezes from happening while that write is in-flight. The freeze side will grab that rwsem for writing, excluding any new writers from happening and waiting for existing writes to finish. But io_uring unconditionally uses kiocb_start_write(), which will block if someone is currently attempting to freeze the mount point. This causes a deadlock where freeze is waiting for previous writes to complete, but the previous writes cannot complete, as the task that is supposed to complete them is blocked waiting on starting a new write. This results in the following stuck trace showing that dependency with the write blocked starting a new write: task:fio state:D stack:0 pid:886 tgid:886 ppid:876 Call trace: __switch_to+0x1d8/0x348 __schedule+0x8e8/0x2248 schedule+0x110/0x3f0 percpu_rwsem_wait+0x1e8/0x3f8 __percpu_down_read+0xe8/0x500 io_write+0xbb8/0xff8 io_issue_sqe+0x10c/0x1020 io_submit_sqes+0x614/0x2110 __arm64_sys_io_uring_enter+0x524/0x1038 invoke_syscall+0x74/0x268 el0_svc_common.constprop.0+0x160/0x238 do_el0_svc+0x44/0x60 el0_svc+0x44/0xb0 el0t_64_sync_handler+0x118/0x128 el0t_64_sync+0x168/0x170 INFO: task fsfreeze:7364 blocked for more than 15 seconds. Not tainted 6.12.0-rc5-00063-g76aaf945701c #7963 with the attempting freezer stuck trying to grab the rwsem: task:fsfreeze state:D stack:0 pid:7364 tgid:7364 ppid:995 Call trace: __switch_to+0x1d8/0x348 __schedule+0x8e8/0x2248 schedule+0x110/0x3f0 percpu_down_write+0x2b0/0x680 freeze_super+0x248/0x8a8 do_vfs_ioctl+0x149c/0x1b18 __arm64_sys_ioctl+0xd0/0x1a0 invoke_syscall+0x74/0x268 el0_svc_common.constprop.0+0x160/0x238 do_el0_svc+0x44/0x60 el0_svc+0x44/0xb0 el0t_64_sync_handler+0x118/0x128 el0t_64_sync+0x168/0x170 Fix this by having the io_uring side honor IOCB_NOWAIT, and only attempt a blocking grab of the super block rwsem if it isn't set. For normal issue where IOCB_NOWAIT would always be set, this returns -EAGAIN which will have io_uring core issue a blocking attempt of the write. That will in turn also get completions run, ensuring forward progress. Since freezing requires CAP_SYS_ADMIN in the first place, this isn't something that can be triggered by a regular user. Cc: stable@vger.kernel.org # 5.10+ Reported-by: Peter Mann <peter.mann@sh.cz> Link: https://lore.kernel.org/io-uring/38c94aec-81c9-4f62-b44e-1d87f5597644@sh.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-31 08:21:02 -06:00
Jens Axboe	e410ffca58	io_uring/rsrc: kill io_charge_rsrc_node() It's only used from __io_req_set_rsrc_node(), and it takes both the ctx and node itself, while never using the ctx. Just open-code the basic refs++ in __io_req_set_rsrc_node() instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:28 -06:00
Jens Axboe	743fb58a35	io_uring/splice: open code 2nd direct file assignment In preparation for not pinning the whole registered file table, open code the second potential direct file assignment. This will be handled by appropriate helpers in the future, for now just do it manually. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:28 -06:00
Jens Axboe	aaa736b186	io_uring: specify freeptr usage for SLAB_TYPESAFE_BY_RCU io_kiocb cache Doesn't matter right now as there's still some bytes left for it, but let's prepare for the io_kiocb potentially growing and add a specific freeptr offset for it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:28 -06:00
Jens Axboe	ff1256b8f3	io_uring/rsrc: move struct io_fixed_file to rsrc.h header There's no need for this internal structure to be visible, move it to the private rsrc.h header instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:28 -06:00
Jens Axboe	a85f31052b	io_uring/nop: add support for testing registered files and buffers Useful for testing performance/efficiency impact of registered files and buffers, vs (particularly) non-registered files. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:28 -06:00
Jens Axboe	aa00f67adc	io_uring: add support for fixed wait regions Generally applications have 1 or a few waits of waiting, yet they pass in a struct io_uring_getevents_arg every time. This needs to get copied and, in turn, the timeout value needs to get copied. Rather than do this for every invocation, allow the application to register a fixed set of wait regions that can simply be indexed when asking the kernel to wait on events. At ring setup time, the application can register a number of these wait regions and initialize region/index 0 upfront: struct io_uring_reg_wait reg; reg = io_uring_setup_reg_wait(ring, nr_regions, &ret); / set timeout and mark as set, sigmask/sigmask_sz as needed / reg->ts.tv_sec = 0; reg->ts.tv_nsec = 100000; reg->flags = IORING_REG_WAIT_TS; where nr_regions >= 1 && nr_regions <= PAGE_SIZE / sizeof(reg). The above initializes index 0, but 63 other regions can be initialized, if needed. Now, instead of doing: struct __kernel_timespec timeout = { .tv_nsec = 100000, }; io_uring_submit_and_wait_timeout(ring, &cqe, nr, &t, NULL); to wait for events for each submit_and_wait, or just wait, operation, it can just reference the above region at offset 0 and do: io_uring_submit_and_wait_reg(ring, &cqe, nr, 0); to achieve the same goal of waiting 100usec without needing to copy both struct io_uring_getevents_arg (24b) and struct __kernel_timeout (16b) for each invocation. Struct io_uring_reg_wait looks as follows: struct io_uring_reg_wait { struct __kernel_timespec ts; __u32 min_wait_usec; __u32 flags; __u64 sigmask; __u32 sigmask_sz; __u32 pad[3]; __u64 pad2[2]; }; embedding the timeout itself in the region, rather than passing it as a pointer as well. Note that the signal mask is still passed as a pointer, both for compatability reasons, but also because there doesn't seem to be a lot of high frequency waits scenarios that involve setting and resetting the signal mask for each wait. The application is free to modify any region before a wait call, or it can use keep multiple regions with different settings to avoid needing to modify the same one for wait calls. Up to a page size of regions is mapped by default, allowing PAGE_SIZE / 64 available regions for use. The registered region must fit within a page. On a 4kb page size system, that allows for 64 wait regions if a full page is used, as the size of struct io_uring_reg_wait is 64b. The region registered must be aligned to io_uring_reg_wait in size. It's valid to register less than 64 entries. In network performance testing with zero-copy, this reduced the time spent waiting on the TX side from 3.12% to 0.3% and the RX side from 4.4% to 0.3%. Wait regions are fixed for the lifetime of the ring - once registered, they are persistent until the ring is torn down. The regions support minimum wait timeout as well as the regular waits. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:28 -06:00
Jens Axboe	371b47da25	io_uring: change io_get_ext_arg() to use uaccess begin + end In scenarios where a high frequency of wait events are seen, the copy of the struct io_uring_getevents_arg is quite noticeable in the profiles in terms of time spent. It can be seen as up to 3.5-4.5%. Rewrite the copy-in logic, saving about 0.5% of the time. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	0a54a7dd0a	io_uring: switch struct ext_arg from __kernel_timespec to timespec64 This avoids intermediate storage for turning a __kernel_timespec user pointer into an on-stack struct timespec64, only then to turn it into a ktime_t. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	b898b8c99e	io_uring/sqpoll: wait on sqd->wait for thread parking io_sqd_handle_event() just does a mutex unlock/lock dance when it's supposed to park, somewhat relying on full ordering with the thread trying to park it which does a similar unlock/lock dance on sqd->lock. However, with adaptive spinning on mutexes, this can waste an awful lot of time. Normally this isn't very noticeable, as parking and unparking the thread isn't a common (or fast path) occurence. However, in testing ring resizing, it's testing exactly that, as each resize will require the SQPOLL to safely park and unpark. Have io_sq_thread_park() explicitly wait on sqd->park_pending being zero before attempting to grab the sqd->lock again. In a resize test, this brings the runtime of SQPOLL down from about 60 seconds to a few seconds, just like the !SQPOLL tests. And saves a ton of spinning time on the mutex, on both sides. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	79cfe9e59c	io_uring/register: add IORING_REGISTER_RESIZE_RINGS Once a ring has been created, the size of the CQ and SQ rings are fixed. Usually this isn't a problem on the SQ ring side, as it merely controls the available number of requests that can be submitted in a single system call, and there's rarely a need to change that. For the CQ ring, it's a different story. For most efficient use of io_uring, it's important that the CQ ring never overflows. This means that applications must size it for the worst case scenario, which can be wasteful. Add IORING_REGISTER_RESIZE_RINGS, which allows an application to resize the existing rings. It takes a struct io_uring_params argument, the same one which is used to setup the ring initially, and resizes rings according to the sizes given. Certain properties are always inherited from the original ring setup, like SQE128/CQE32 and other setup options. The implementation only allows flag associated with how the CQ ring is sized and clamped. Existing unconsumed SQE and CQE entries are copied as part of the process. If either the SQ or CQ resized destination ring cannot hold the entries already present in the source rings, then the operation is failed with -EOVERFLOW. Any register op holds ->uring_lock, which prevents new submissions, and the internal mapping holds the completion lock as well across moving CQ ring state. To prevent races between mmap and ring resizing, add a mutex that's solely used to serialize ring resize and mmap. mmap_sem can't be used here, as as fork'ed process may be doing mmaps on the ring as well. The ctx->resize_lock is held across mmap operations, and the resize will grab it before swapping out the already mapped new data. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	d090bffab6	io_uring/memmap: explicitly return -EFAULT for mmap on NULL rings The later mapping will actually check this too, but in terms of code clarify, explicitly check for whether or not the rings and sqes are valid during validation. That makes it explicit that if they are non-NULL, they are valid and can get mapped. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	81d8191eb9	io_uring: abstract out a bit of the ring filling logic Abstract out a io_uring_fill_params() helper, which fills out the necessary bits of struct io_uring_params. Add it to io_uring.h as well, in preparation for having another internal user of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	09d0a8ea7f	io_uring: move max entry definition and ring sizing into header In preparation for needing this somewhere else, move the definitions for the maximum CQ and SQ ring size into io_uring.h. Make the rings_size() helper available as well, and have it take just the setup flags argument rather than the fill ring pointer. That's all that is needed. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Pavel Begunkov	882dec6c39	io_uring/net: clean up io_msg_copy_hdr Put sr->umsg into a local variable, so it doesn't repeat "sr->umsg->" for every field. It looks nicer, and likely without the patch it compiles into a bunch of umsg memory reads. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/26c2f30b491ea7998bfdb5bb290662572a61064d.1729607201.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Pavel Begunkov	5283878735	io_uring/net: don't alias send user pointer reads We keep user pointers in an union, which could be a user buffer or a user pointer to msghdr. What is confusing is that it potenitally reads and assigns sqe->addr as one type but then uses it as another via the union. Even more, it's not even consistent across copy and zerocopy versions. Make send and sendmsg setup helpers read sqe->addr and treat it as the right type from the beginning. The end goal would be to get rid of the use of struct io_sr_msg::umsg for send requests as we only need it at the prep side. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/685d788605f5d78af18802fcabf61ba65cfd8002.1729607201.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Pavel Begunkov	ad438d070a	io_uring/net: don't store send address ptr For non "msg" requests we copy the address at the prep stage and there is no need to store the address user pointer long term. Pass the SQE into io_send_setup(), let it parse it, and remove struct io_sr_msg addr addr_len fields. It saves some space and also less confusing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/db3dce544e17ca9d4b17d2506fbbac1da8a87824.1729607201.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Pavel Begunkov	93db98f6f1	io_uring/net: split send and sendmsg prep helpers A preparation patch splitting io_sendmsg_prep_setup into two separate helpers for send and sendmsg variants. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1a2319471ba040e053b7f1d22f4af510d1118eca.1729607201.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	51c967c6c9	io_uring/net: move send zc fixed buffer import to issue path Let's keep it close with the actual import, there's no reason to do this on the prep side. With that, we can drop one of the branches checking for whether or not IORING_RECVSEND_FIXED_BUF is set. As a side-effect, get rid of req->imu usage. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	1caa00d6b6	io_uring: remove 'issue_flags' argument for io_req_set_rsrc_node() All callers already hold the ring lock and hence are passing '0', remove the argument and the conditional locking that it controlled. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	003f82b58c	io_uring/rw: get rid of using req->imu It's assigned in the same function that it's being used, get rid of it. A local variable will do just fine. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	892d3e80e1	io_uring/uring_cmd: get rid of using req->imu It's pretty pointless to use io_kiocb as intermediate storage for this, so split the validity check and the actual usage. The resource node is assigned upfront at prep time, to prevent it from going away. The actual import is never called with the ctx->uring_lock held, so grab it for the import. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	c919790060	io_uring/rsrc: don't assign bvec twice in io_import_fixed() iter->bvec is already set to imu->bvec - remove the one dead assignment and turn the other one into an addition instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Pavel Begunkov	2946f08ae9	io_uring: clean up cqe trace points We have too many helpers posting CQEs, instead of tracing completion events before filling in a CQE and thus having to pass all the data, set the CQE first, pass it to the tracing helper and let it extract everything it needs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b83c1ca9ee5aed2df0f3bb743bf5ed699cce4c86.1729267437.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Pavel Begunkov	9b296c625a	io_uring: static_key for !IORING_SETUP_NO_SQARRAY IORING_SETUP_NO_SQARRAY should be preferred and used by default by liburing, optimise flag checking in io_get_sqe() with a static key. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c164a48542fbb080115e2377ecf160c758562742.1729264988.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Pavel Begunkov	1e6e7602cc	io_uring: kill io_llist_xchg io_llist_xchg is only used to set the list to NULL, which can also be done with llist_del_all(). Use the latter and kill io_llist_xchg. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d6765112680d2e86a58b76166b7513391ff4e5d7.1729264960.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	b6b3eb19dd	io_uring: move cancel hash tables to kvmalloc/kvfree Convert to using kvmalloc/kfree() for the hash tables, and while at it, make it handle low memory situations better. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	8abf47a8d6	io_uring/cancel: get rid of init_hash_table() helper All it does is initialize the lists, just move the INIT_HLIST_HEAD() into the one caller. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	ba4366f57b	io_uring/poll: get rid of per-hashtable bucket locks Any access to the table is protected by ctx->uring_lock now anyway, the per-bucket locking doesn't buy us anything. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	879ba46a38	io_uring/poll: get rid of io_poll_tw_hash_eject() It serves no purposes anymore, all it does is delete the hash list entry. task_work always has the ring locked. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	085268829b	io_uring/poll: get rid of unlocked cancel hash io_uring maintains two hash lists of inflight requests: 1) ctx->cancel_table_locked. This is used when the caller has the ctx->uring_lock held already. This is only an issue side parameter, as removal or task_work will always have it held. 2) ctx->cancel_table. This is used when the issuer does NOT have the ctx->uring_lock held, and relies on the table spinlocks for access. However, it's pretty trivial to simply grab the lock in the one spot where we care about it, for insertion. With that, we can kill the unlocked table (and get rid of the _locked postfix for the other one). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:27 -06:00
Jens Axboe	829ab73e7b	io_uring/poll: remove 'ctx' argument from io_poll_req_delete() It's always req->ctx being used anyway, having this as a separate argument (that is then not even used) just makes it more confusing. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:26 -06:00
Jens Axboe	a377132154	io_uring/msg_ring: add support for sending a sync message Normally MSG_RING requires both a source and a destination ring. But some users don't always have a ring avilable to send a message from, yet they still need to notify a target ring. Add support for using io_uring_register(2) without having a source ring, using a file descriptor of -1 for that. Internally those are called blind registration opcodes. Implement IORING_REGISTER_SEND_MSG_RING as a blind opcode, which simply takes an sqe that the application can put on the stack and use the normal liburing helpers to initialize it. Then the app can call: io_uring_register(-1, IORING_REGISTER_SEND_MSG_RING, &sqe, 1); and get the same behavior in terms of the target, where a CQE is posted with the details given in the sqe. For now this takes a single sqe pointer argument, and hence arg must be set to that, and nr_args must be 1. Could easily be extended to take an array of sqes, but for now let's keep it simple. Link: https://lore.kernel.org/r/20240924115932.116167-3-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:26 -06:00
Jens Axboe	95d6c9229a	io_uring/msg_ring: refactor a few helper functions Mostly just to skip them taking an io_kiocb, rather just pass in the ctx and io_msg directly. In preparation for being able to issue a MSG_RING request without having an io_kiocb. No functional changes in this patch. Link: https://lore.kernel.org/r/20240924115932.116167-2-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:26 -06:00
Jens Axboe	f4bb2f65bb	io_uring/eventfd: move ctx->evfd_last_cq_tail into io_ev_fd Everything else about the io_uring eventfd support is nicely kept private to that code, except the cached_cq_tail tracking. With everything else in place, move io_eventfd_flush_signal() to using the ev_fd grab+release helpers, which then enables the direct use of io_ev_fd for this tracking too. Link: https://lore.kernel.org/r/20240921080307.185186-7-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:26 -06:00
Jens Axboe	83a4f865e2	io_uring/eventfd: abstract out ev_fd grab + release helpers In preparation for needing the ev_fd grabbing (and releasing) from another path, abstract out two helpers for that. Link: https://lore.kernel.org/r/20240921080307.185186-6-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:26 -06:00
Jens Axboe	3ca5a35604	io_uring/eventfd: move trigger check into a helper It's a bit hard to read what guards the triggering, move it into a helper and add a comment explaining it too. This additionally moves the ev_fd == NULL check in there as well. Link: https://lore.kernel.org/r/20240921080307.185186-5-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:26 -06:00
Jens Axboe	60c5f15800	io_uring/eventfd: move actual signaling part into separate helper In preparation for using this from multiple spots, move the signaling into a helper. Link: https://lore.kernel.org/r/20240921080307.185186-4-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:26 -06:00
Jens Axboe	3c90b80df5	io_uring/eventfd: check for the need to async notifier earlier It's not necessary to do this post grabbing a reference. With that, we can drop the out goto path as well. Link: https://lore.kernel.org/r/20240921080307.185186-3-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:26 -06:00
Jens Axboe	165126dc5e	io_uring/eventfd: abstract out ev_fd put helper We call this in two spot, have a helper for it. In preparation for extending this part. Link: https://lore.kernel.org/r/20240921080307.185186-2-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 13:43:26 -06:00
Jens Axboe	dc7e76ba7a	io_uring: IORING_OP_F[GS]ETXATTR is fine with REQ_F_FIXED_FILE Rejection of IOSQE_FIXED_FILE combined with IORING_OP_[GS]ETXATTR is fine - these do not take a file descriptor, so such combination makes no sense. The checks are misplaced, though - as it is, they triggers on IORING_OP_F[GS]ETXATTR as well, and those do take a file reference, no matter the origin. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-19 20:40:10 -04:00
Jens Axboe	ae6a888a43	io_uring/rw: fix wrong NOWAIT check in io_rw_init_file() A previous commit improved how !FMODE_NOWAIT is dealt with, but inadvertently negated a check whilst doing so. This caused -EAGAIN to be returned from reading files with O_NONBLOCK set. Fix up the check for REQ_F_SUPPORT_NOWAIT. Reported-by: Julian Orth <ju.orth@gmail.com> Link: https://github.com/axboe/liburing/issues/1270 Fixes: `f7c9134385` ("io_uring/rw: allow pollable non-blocking attempts for !FMODE_NOWAIT") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-19 09:25:45 -06:00
Jens Axboe	8f7033aa40	io_uring/sqpoll: ensure task state is TASK_RUNNING when running task_work When the sqpoll is exiting and cancels pending work items, it may need to run task_work. If this happens from within io_uring_cancel_generic(), then it may be under waiting for the io_uring_task waitqueue. This results in the below splat from the scheduler, as the ring mutex may be attempted grabbed while in a TASK_INTERRUPTIBLE state. Ensure that the task state is set appropriately for that, just like what is done for the other cases in io_run_task_work(). do not call blocking ops when !TASK_RUNNING; state=1 set at [<0000000029387fd2>] prepare_to_wait+0x88/0x2fc WARNING: CPU: 6 PID: 59939 at kernel/sched/core.c:8561 __might_sleep+0xf4/0x140 Modules linked in: CPU: 6 UID: 0 PID: 59939 Comm: iou-sqp-59938 Not tainted 6.12.0-rc3-00113-g8d020023b155 #7456 Hardware name: linux,dummy-virt (DT) pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : __might_sleep+0xf4/0x140 lr : __might_sleep+0xf4/0x140 sp : ffff80008c5e7830 x29: ffff80008c5e7830 x28: ffff0000d93088c0 x27: ffff60001c2d7230 x26: dfff800000000000 x25: ffff0000e16b9180 x24: ffff80008c5e7a50 x23: 1ffff000118bcf4a x22: ffff0000e16b9180 x21: ffff0000e16b9180 x20: 000000000000011b x19: ffff80008310fac0 x18: 1ffff000118bcd90 x17: 30303c5b20746120 x16: 74657320313d6574 x15: 0720072007200720 x14: 0720072007200720 x13: 0720072007200720 x12: ffff600036c64f0b x11: 1fffe00036c64f0a x10: ffff600036c64f0a x9 : dfff800000000000 x8 : 00009fffc939b0f6 x7 : ffff0001b6327853 x6 : 0000000000000001 x5 : ffff0001b6327850 x4 : ffff600036c64f0b x3 : ffff8000803c35bc x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000e16b9180 Call trace: __might_sleep+0xf4/0x140 mutex_lock+0x84/0x124 io_handle_tw_list+0xf4/0x260 tctx_task_work_run+0x94/0x340 io_run_task_work+0x1ec/0x3c0 io_uring_cancel_generic+0x364/0x524 io_sq_thread+0x820/0x124c ret_from_fork+0x10/0x20 Cc: stable@vger.kernel.org Fixes: `af5d68f889` ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-17 08:38:04 -06:00
Jens Axboe	858e686a30	io_uring/rsrc: ignore dummy_ubuf for buffer cloning For placeholder buffers, &dummy_ubuf is assigned which is a static value. When buffers are attempted cloned, don't attempt to grab a reference to it, as we both don't need it and it'll actively fail as dummy_ubuf doesn't have a valid reference count setup. Link: https://lore.kernel.org/io-uring/Zw8dkUzsxQ5LgAJL@ly-workstation/ Reported-by: Lai, Yi <yi1.lai@linux.intel.com> Fixes: `7cc2a6eadc` ("io_uring: add IORING_REGISTER_COPY_BUFFERS method") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-16 07:09:25 -06:00
Jens Axboe	28aabffae6	io_uring/sqpoll: close race on waiting for sqring entries When an application uses SQPOLL, it must wait for the SQPOLL thread to consume SQE entries, if it fails to get an sqe when calling io_uring_get_sqe(). It can do so by calling io_uring_enter(2) with the flag value of IORING_ENTER_SQ_WAIT. In liburing, this is generally done with io_uring_sqring_wait(). There's a natural expectation that once this call returns, a new SQE entry can be retrieved, filled out, and submitted. However, the kernel uses the cached sq head to determine if the SQRING is full or not. If the SQPOLL thread is currently in the process of submitting SQE entries, it may have updated the cached sq head, but not yet committed it to the SQ ring. Hence the kernel may find that there are SQE entries ready to be consumed, and return successfully to the application. If the SQPOLL thread hasn't yet committed the SQ ring entries by the time the application returns to userspace and attempts to get a new SQE, it will fail getting a new SQE. Fix this by having io_sqring_full() always use the user visible SQ ring head entry, rather than the internally cached one. Cc: stable@vger.kernel.org # 5.10+ Link: https://github.com/axboe/liburing/discussions/1267 Reported-by: Benedek Thaler <thaler@thaler.hu> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-15 09:13:51 -06:00
Al Viro	be5498cac2	remove pointless includes of <linux/fdtable.h> some of those used to be needed, some had been cargo-culted for no reason... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-10-07 13:34:41 -04:00
Jens Axboe	f7c9134385	io_uring/rw: allow pollable non-blocking attempts for !FMODE_NOWAIT The checking for whether or not io_uring can do a non-blocking read or write attempt is gated on FMODE_NOWAIT. However, if the file is pollable, it's feasible to just check if it's currently in a state in which it can sanely receive or send _some_ data. This avoids unnecessary io-wq punts, and repeated worthless retries before doing that punt, by assuming that some data can get delivered or received if poll tells us that is true. It also allows multishot reads to properly work with these types of files, enabling a bit of a cleanup of the logic that: `c9d952b910` ("io_uring/rw: fix cflags posting for single issue multishot read") had to put in place. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-06 20:58:53 -06:00
Jens Axboe	c9d952b910	io_uring/rw: fix cflags posting for single issue multishot read If multishot gets disabled, and hence the request will get terminated rather than persist for more iterations, then posting the CQE with the right cflags is still important. Most notably, the buffer reference needs to be included. Refactor the return of __io_read() a bit, so that the provided buffer is always put correctly, and hence returned to the application. Reported-by: Sharon Rosner <Sharon Rosner> Link: https://github.com/axboe/liburing/issues/1257 Cc: stable@vger.kernel.org Fixes: `2a975d426c` ("io_uring/rw: don't allow multishot reads without NOWAIT support") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-06 08:05:47 -06:00
Jens Axboe	c314094cb4	io_uring/net: harden multishot termination case for recv If the recv returns zero, or an error, then it doesn't matter if more data has already been received for this buffer. A condition like that should terminate the multishot receive. Rather than pass in the collected return value, pass in whether to terminate or keep the recv going separately. Note that this isn't a bug right now, as the only way to get there is via setting MSG_WAITALL with multishot receive. And if an application does that, then -EINVAL is returned anyway. But it seems like an easy bug to introduce, so let's make it a bit more explicit. Link: https://github.com/axboe/liburing/issues/1246 Cc: stable@vger.kernel.org Fixes: `b3fdea6ecb` ("io_uring: multishot recv") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-30 08:26:59 -06:00
Min-Hua Chen	17ea56b752	io_uring: fix casts to io_req_flags_t Apply __force cast to restricted io_req_flags_t type to fix the following sparse warning: io_uring/io_uring.c:2026:23: sparse: warning: cast to restricted io_req_flags_t No functional changes intended. Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com> Link: https://lore.kernel.org/r/20240922104132.157055-1-minhuadotchen@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-24 13:31:04 -06:00
Guixin Liu	3a87e26429	io_uring: fix memory leak when cache init fail Exit the percpu ref when cache init fails to free the data memory with in struct percpu_ref. Fixes: `206aefde4f` ("io_uring: reduce/pack size of io_ring_ctx") Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240923100512.64638-1-kanie@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-24 13:31:00 -06:00
Linus Torvalds	3147a0689d	for-6.12/io_uring-20240922 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbvv30QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpj3+EACs346FzM8PlZe1GxBZ6OnQX80blwoldAxC +Abl5xjoJKUgA7rY3lJVBRNR6olA/4I2VD3g8b3RT6lpd/oKzPFg7FOj5Dc/oN+c Fo6C7zZdr8caokpL4pfwgyG8ZNssQgRg8e0kRSw8A7AMo1zUazqAXtxjRzeMEOLC 1kWRYGdHCbVjx+hRIyX6KKP427Z5nXvcqFOC0BOpd5jDNYVh9WjNNyUE7trkGJ7o 1cjlpaaOURS0yU/4hue6tRnM8LDjaImyTyISvBWzKfKvpc19K1alOQNvHIoIeiBQ 5MgCNkSpbRmUTrydYVEQXl0Cia2d5+0KQsavUB9nZ8M++NftbRr/i26xT8ReZzXI NjaedDF+MyOKeJaft2ZeKH8GgWolysMBa4e89CveRxosa/6gwHCkkB4UK9b3gaBB Fij1zh/7fIVG7Tz8yNUDyGe6DzOEol1bn1KnL35/9nuCCRnSAM0vRPwJSkurlQ8B PqVUS3BArn+LQZmSZ3HJVKOHv2QAY8etqWizvVmu4DB9Ar+uZ6Ur2uwfMN9JAODP Fm2qVvxS73QlrvisdbnVbTzqBnqh3Rs4mb5my/gCWO1s67qtu3abSJCSzcnyxQdd yBMDegJxTNv6DErNjPEF4qDODwSTIzswr//kOeLns1EtDGfrK8nxUfIKPQUwLSTO Y7h2ru83uA== =goTY -----END PGP SIGNATURE----- Merge tag 'for-6.12/io_uring-20240922' of git://git.kernel.dk/linux Pull more io_uring updates from Jens Axboe: "Mostly just a set of fixes in here, or little changes that didn't get included in the initial pull request. This contains: - Move the SQPOLL napi polling outside the submission lock (Olivier) - Rename of the "copy buffers" API that got added in the 6.12 merge window. There's really no copying going on, it's just referencing the buffers. After a bit of consideration, decided that it was better to simply rename this to avoid potential confusion (me) - Shrink struct io_mapped_ubuf from 48 to 32 bytes, by changing it to start + len tracking rather than having start / end in there, and by removing the caching of folio_mask when we can just calculate it from folio_shift when we need it (me) - Fixes for the SQPOLL affinity checking (me, Felix) - Fix for how cqring waiting checks for the presence of task_work. Just check it directly rather than check for a specific notification mechanism (me) - Tweak to how request linking is represented in tracing (me) - Fix a syzbot report that deliberately sets up a huge list of overflow entries, and then hits rcu stalls when flushing this list. Just check for the need to preempt, and drop/reacquire locks in the loop. There's no state maintained over the loop itself, and each entry is yanked from head-of-list (me)" * tag 'for-6.12/io_uring-20240922' of git://git.kernel.dk/linux: io_uring: check if we need to reschedule during overflow flush io_uring: improve request linking trace io_uring: check for presence of task_work rather than TIF_NOTIFY_SIGNAL io_uring/sqpoll: do the napi busy poll outside the submission block io_uring: clean up a type in io_uring_register_get_file() io_uring/sqpoll: do not put cpumask on stack io_uring/sqpoll: retain test for whether the CPU is valid io_uring/rsrc: change ubuf->ubuf_end to length tracking io_uring/rsrc: get rid of io_mapped_ubuf->folio_mask io_uring: rename "copy buffers" to "clone buffers"	2024-09-24 11:11:38 -07:00
Linus Torvalds	f8ffbc365f	struct fd layout change (and conversion to accessor helpers) -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZvDNmgAKCRBZ7Krx/gZQ 63zrAP9vI0rf55v27twiabe9LnI7aSx5ckoqXxFIFxyT3dOYpQD/bPmoApnWDD3d 592+iDgLsema/H/0/CqfqlaNtDNY8Q0= =HUl5 -----END PGP SIGNATURE----- Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' updates from Al Viro: "Just the 'struct fd' layout change, with conversion to accessor helpers" * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add struct fd constructors, get rid of __to_fd() struct fd: representation change introduce fd_file(), convert all accessors to it.	2024-09-23 09:35:36 -07:00
Jens Axboe	eac2ca2d68	io_uring: check if we need to reschedule during overflow flush In terms of normal application usage, this list will always be empty. And if an application does overflow a bit, it'll have a few entries. However, nothing obviously prevents syzbot from running a test case that generates a ton of overflow entries, and then flushing them can take quite a while. Check for needing to reschedule while flushing, and drop our locks and do so if necessary. There's no state to maintain here as overflows always prune from head-of-list, hence it's fine to drop and reacquire the locks at the end of the loop. Link: https://lore.kernel.org/io-uring/66ed061d.050a0220.29194.0053.GAE@google.com/ Reported-by: syzbot+5fca234bd7eb378ff78e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-20 02:51:20 -06:00
Jens Axboe	eed138d67d	io_uring: improve request linking trace Right now any link trace is listed as being linked after the head request in the chain, but it's more useful to note explicitly which request a given new request is chained to. Change the link trace to dump the tail request so that chains are immediately apparent when looking at traces. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-20 00:17:46 -06:00
Jens Axboe	04beb6e0e0	io_uring: check for presence of task_work rather than TIF_NOTIFY_SIGNAL If some part of the kernel adds task_work that needs executing, in terms of signaling it'll generally use TWA_SIGNAL or TWA_RESUME. Those two directly translate to TIF_NOTIFY_SIGNAL or TIF_NOTIFY_RESUME, and can be used for a variety of use case outside of task_work. However, io_cqring_wait_schedule() only tests explicitly for TIF_NOTIFY_SIGNAL. This means it can miss if task_work got added for the task, but used a different kind of signaling mechanism (or none at all). Normally this doesn't matter as any task_work will be run once the task exits to userspace, except if: 1) The ring is setup with DEFER_TASKRUN 2) The local work item may generate normal task_work For condition 2, this can happen when closing a file and it's the final put of that file, for example. This can cause stalls where a task is waiting to make progress inside io_cqring_wait(), but there's nothing else that will wake it up. Hence change the "should we schedule or loop around" check to check for the presence of task_work explicitly, rather than just TIF_NOTIFY_SIGNAL as the mechanism. While in there, also change the ordering of what type of task_work first in terms of ordering, to both make it consistent with other task_work runs in io_uring, but also to better handle the case of defer task_work generating normal task_work, like in the above example. Reported-by: Jan Hendrik Farr <kernel@jfarr.cc> Link: https://github.com/axboe/liburing/issues/1235 Cc: stable@vger.kernel.org Fixes: `846072f16e` ("io_uring: mimimise io_cqring_wait_schedule") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-19 11:56:55 -06:00
Linus Torvalds	bdf56c7580	slab updates for 6.12 -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmbn5g0ACgkQu+CwddJF iJq+Uwf/aqnLNEpjUBzwUUhSojCpPnTtiyjv+AILTxoSTHmbu8OvN0W79+Rpbdmk O4QapAK+BCs+VL2VATwCCufcJ75Z78txO+buQE0DgwluFTIYZ+IwpUMPsK04ln6A FD1/uvP1QFx60heqcp2c4zWFBUpg4DE6ufx2A5kieO268lFcWLxyVlcdgRU79ZCt uAcV2yDLk3GvPGfxZwPKEmZUo/FmuSoBv0XgT+eWxmTu/R7hcpFse49OyjBH8Tvb 8d/RCIFgXOr8dTIjtds7eenwB/is4TkRlctezEQ0jO9/JwL/BVOgXZjD1qCtNWqz is4TWK7VV+vdq1RD+0xC2hV/+uGEwQ== =+WAm -----END PGP SIGNATURE----- Merge tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: "This time it's mostly refactoring and improving APIs for slab users in the kernel, along with some debugging improvements. - kmem_cache_create() refactoring (Christian Brauner) Over the years have been growing new parameters to kmem_cache_create() where most of them are needed only for a small number of caches - most recently the rcu_freeptr_offset parameter. To avoid adding new parameters to kmem_cache_create() and adjusting all its callers, or creating new wrappers such as kmem_cache_create_rcu(), we can now pass extra parameters using the new struct kmem_cache_args. Not explicitly initialized fields default to values interpreted as unused. kmem_cache_create() is for now a wrapper that works both with the new form: kmem_cache_create(name, object_size, args, flags) and the legacy form: kmem_cache_create(name, object_size, align, flags, ctor) - kmem_cache_destroy() waits for kfree_rcu()'s in flight (Vlastimil Babka, Uladislau Rezki) Since SLOB removal, kfree() is allowed for freeing objects allocated by kmem_cache_create(). By extension kfree_rcu() as allowed as well, which can allow converting simple call_rcu() callbacks that only do kmem_cache_free(), as there was never a kmem_cache_free_rcu() variant. However, for caches that can be destroyed e.g. on module removal, the cache owners knew to issue rcu_barrier() first to wait for the pending call_rcu()'s, and this is not sufficient for pending kfree_rcu()'s due to its internal batching optimizations. Ulad has provided a new kvfree_rcu_barrier() and to make the usage less error-prone, kmem_cache_destroy() calls it. Additionally, destroying SLAB_TYPESAFE_BY_RCU caches now again issues rcu_barrier() synchronously instead of using an async work, because the past motivation for async work no longer applies. Users of custom call_rcu() callbacks should however keep calling rcu_barrier() before cache destruction. - Debugging use-after-free in SLAB_TYPESAFE_BY_RCU caches (Jann Horn) Currently, KASAN cannot catch UAFs in such caches as it is legal to access them within a grace period, and we only track the grace period when trying to free the underlying slab page. The new CONFIG_SLUB_RCU_DEBUG option changes the freeing of individual object to be RCU-delayed, after which KASAN can poison them. - Delayed memcg charging (Shakeel Butt) In some cases, the memcg is uknown at allocation time, such as receiving network packets in softirq context. With kmem_cache_charge() these may be now charged later when the user and its memcg is known. - Misc fixes and improvements (Pedro Falcato, Axel Rasmussen, Christoph Lameter, Yan Zhen, Peng Fan, Xavier)" * tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (34 commits) mm, slab: restore kerneldoc for kmem_cache_create() io_uring: port to struct kmem_cache_args slab: make __kmem_cache_create() static inline slab: make kmem_cache_create_usercopy() static inline slab: remove kmem_cache_create_rcu() file: port to struct kmem_cache_args slab: create kmem_cache_create() compatibility layer slab: port KMEM_CACHE_USERCOPY() to struct kmem_cache_args slab: port KMEM_CACHE() to struct kmem_cache_args slab: remove rcu_freeptr_offset from struct kmem_cache slab: pass struct kmem_cache_args to do_kmem_cache_create() slab: pull kmem_cache_open() into do_kmem_cache_create() slab: pass struct kmem_cache_args to create_cache() slab: port kmem_cache_create_usercopy() to struct kmem_cache_args slab: port kmem_cache_create_rcu() to struct kmem_cache_args slab: port kmem_cache_create() to struct kmem_cache_args slab: add struct kmem_cache_args slab: s/__kmem_cache_create/do_kmem_cache_create/g memcg: add charging of already allocated slab objects mm/slab: Optimize the code logic in find_mergeable() ...	2024-09-18 08:53:53 +02:00
Olivier Langlois	53d69bdd5b	io_uring/sqpoll: do the napi busy poll outside the submission block there are many small reasons justifying this change. 1. busy poll must be performed even on rings that have no iopoll and no new sqe. It is quite possible that a ring configured for inbound traffic with multishot be several hours without receiving new request submissions 2. NAPI busy poll does not perform any credential validation 3. If the thread is awaken by task work, processing the task work is prioritary over NAPI busy loop. This is why a second loop has been created after the io_sq_tw() call instead of doing the busy loop in __io_sq_thread() outside its credential acquisition block. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/de7679adf1249446bd47426db01d82b9603b7224.1726161831.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-16 20:24:37 -06:00
Dan Carpenter	2f6a55e423	io_uring: clean up a type in io_uring_register_get_file() Originally "fd" was unsigned int but it was changed to int when we pulled this code into a separate function in commit `0b6d253e08` ("io_uring/register: provide helper to get io_ring_ctx from 'fd'"). This doesn't really cause a runtime problem because the call to array_index_nospec() will clamp negative fds to 0 and nothing else uses the negative values. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/6f6cb630-079f-4fdf-bf95-1082e0a3fc6e@stanley.mountain Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-16 12:04:10 -06:00
Felix Moessbauer	7f44beadcc	io_uring/sqpoll: do not put cpumask on stack Putting the cpumask on the stack is deprecated for a long time (since `2d3854a37e`), as these can be big. Given that, change the on-stack allocation of allowed_mask to be dynamically allocated. Fixes: `f011c9cf04` ("io_uring/sqpoll: do not allow pinning outside of cpuset") Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com> Link: https://lore.kernel.org/r/20240916111150.1266191-1-felix.moessbauer@siemens.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-16 07:49:08 -06:00
Linus Torvalds	adfc3ded5c	for-6.12/io_uring-discard-20240913 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkboUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpj7DD/oDqQ13NOHuotVbufPRDWuG6+UEaN/Pukp/ RYDWwYu/DB4v7LVWBV9COqN5jQqY2wrMpgBdZqtEnDtC7yjN6QYAT4TQdfIq/HNo NooN4ULmJzOOC6sR9MBGyzOsCbz7kmRt1nBZ7vdEXMrLXeX9JDX3bDrELf7jhKsk 84lKE/Mxs530LSzxAtN9KaOQncK5gXen4WSrZsYraU2vJFAPBkJwQGAL5pOdmsp9 NqvNE3QonPr4v99XnDJH80q44afuqffUITPjtGX52tBMO3CCUQFUpZp5fiUjfa1v Okz+SyeBE6gB7c008BGqTOgmKdQOMs3uwFDQ/xMw+pYwy+wHH4skzPP776DwAdgn C/SaVFsaXkqOXX4f+CiNJ01LmD4EOBy16LM5qE4NwLNpjQu/3EdHjNqaYfM/LCca YyQoUOsnYIRj21+oNFpKekscuEAPKG9ewyMyvfxbkk167j00lgwVwybb/2JfYvRJ i0GBY5phJnkeNUerU9SDm6RBTAjDOZ0stubTtFjugDZdrz2FmA4pBFGWjgYLiLhH 3ZCyaCAOoYW8yxxkogTzKbLx6wXb5wgS7jTHgsk+eeSSWRBTnv2sd0fn/D5m3Uw7 uBHKvauDp3zEd9MdF26QG7U6RlojEbVoyTYjnJskPsClxbch4WSpwvoEILdJRvls 1dTczxgdyw== =wlzo -----END PGP SIGNATURE----- Merge tag 'for-6.12/io_uring-discard-20240913' of git://git.kernel.dk/linux Pull io_uring async discard support from Jens Axboe: "Sitting on top of both the 6.12 block and io_uring core branches, here's support for async discard through io_uring. This allows applications to issue async discards, rather than rely on the blocking sync ioctl discards we already have. The sync support is difficult to use outside of idle/cleanup periods. On a real (but slow) device, testing shows the following results when compared to sync discard: qd64 sync discard: 21K IOPS, lat avg 3 msec (max 21 msec) qd64 async discard: 76K IOPS, lat avg 845 usec (max 2.2 msec) qd64 sync discard: 14K IOPS, lat avg 5 msec (max 25 msec) qd64 async discard: 56K IOPS, lat avg 1153 usec (max 3.6 msec) and synthetic null_blk testing with the same queue depth and block size settings as above shows: Type Trim size IOPS Lat avg (usec) Lat Max (usec) ============================================================== sync 4k 144K 444 20314 async 4k 1353K 47 595 sync 1M 56K 1136 21031 async 1M 94K 680 760" * tag 'for-6.12/io_uring-discard-20240913' of git://git.kernel.dk/linux: block: implement async io_uring discard cmd block: introduce blk_validate_byte_range() filemap: introduce filemap_invalidate_pages io_uring/cmd: give inline space in request to cmds io_uring/cmd: expose iowq to cmds	2024-09-16 13:50:14 +02:00
Linus Torvalds	3a4d319a8f	for-6.12/io_uring-20240913 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkST4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnU7D/47BmxQmTbsT9NFBeZrQVgmQ2Zap2WWx3Za 4qGuU1VxcafztqWnRChtxznheVG9ioHglcxfbZjc/D4/BiffgF4n5Z48qh1c0t8O +2pwq75j0WyJkHH9wCrrN9Jq8zSB6pBr2sMEQmSilMgYZKMzhXrXevKkYnthj/1a 7U9QzY+lfc8neZRHR7VDouPWIRjBhwaO62ANXWCL7F2uE6NQasU61x6YTzGuoDB3 0gR5PbSiLIusGxsYqIVmQUPNBUOw8nOzXXcbw8kBlRdnpadns8rNk+ivIMtAYw0m s6xVWNWFToVxO8956rBnjicD6ZzF5Txe6gWC6gvhKMFkOyxkihgMCOZUpSmw6D8G YlDHB4+lijpQMyPDw1UUPOYPVGSVRp/f2MuRcEhW/Yums5vd9eOVrUVsFjfYRQLr fg+lp3rEMoHxBnuKneMY2inuZW99+LGyO8F4IVublwXoXKFcq3TdGCvn5OZUBGDn E5x4QGq+cf9icK4kqN5mVi256fhOLnqDTtzIg4qiwhZ5h9UA3CFjGc56G7wqgp8d Bu5scCkJR5tXJEZA1hce+w2bXzrM6Xd2gym5A6D6k8S3QheHkKva60/qfIzhs/x0 6nlJYSlznyQbDOBDQIJC86OE4tcShNusjFIgIDg6ZvAX2qk7BBmbPNF4RGrI9TTM xz2dONRhlA== =ZNjL -----END PGP SIGNATURE----- Merge tag 'for-6.12/io_uring-20240913' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: - NAPI fixes and cleanups (Pavel, Olivier) - Add support for absolute timeouts (Pavel) - Fixes for io-wq/sqpoll affinities (Felix) - Efficiency improvements for dealing with huge pages (Chenliang) - Support for a minwait mode, where the application essentially has two timouts - one smaller one that defines the batch timeout, and the overall large one similar to what we had before. This enables efficient use of batching based on count + timeout, while still working well with periods of less intensive workloads - Use ITER_UBUF for single segment sends - Add support for incremental buffer consumption. Right now each operation will always consume a full buffer. With incremental consumption, a recv/read operation only consumes the part of the buffer that it needs to satisfy the operation - Add support for GCOV for io_uring, to help retain a high coverage of test to code ratio - Fix regression with ocfs2, where an odd -EOPNOTSUPP wasn't correctly converted to a blocking retry - Add support for cloning registered buffers from one ring to another - Misc cleanups (Anuj, me) * tag 'for-6.12/io_uring-20240913' of git://git.kernel.dk/linux: (35 commits) io_uring: add IORING_REGISTER_COPY_BUFFERS method io_uring/register: provide helper to get io_ring_ctx from 'fd' io_uring/rsrc: add reference count to struct io_mapped_ubuf io_uring/rsrc: clear 'slot' entry upfront io_uring/io-wq: inherit cpuset of cgroup in io worker io_uring/io-wq: do not allow pinning outside of cpuset io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common() io_uring/rw: treat -EOPNOTSUPP for IOCB_NOWAIT like -EAGAIN io_uring/sqpoll: do not allow pinning outside of cpuset io_uring/eventfd: move refs to refcount_t io_uring: remove unused rsrc_put_fn io_uring: add new line after variable declaration io_uring: add GCOV_PROFILE_URING Kconfig option io_uring/kbuf: add support for incremental buffer consumption io_uring/kbuf: pass in 'len' argument for buffer commit Revert "io_uring: Require zeroed sqe->len on provided-buffers send" io_uring/kbuf: move io_ring_head_to_buf() to kbuf.h io_uring/kbuf: add io_kbuf_commit() helper io_uring/kbuf: shrink nr_iovs/mode in struct buf_sel_arg io_uring: wire up min batch wake timeout ...	2024-09-16 13:29:00 +02:00
Jens Axboe	a09c17240b	io_uring/sqpoll: retain test for whether the CPU is valid A recent commit ensured that SQPOLL cannot be setup with a CPU that isn't in the current tasks cpuset, but it also dropped testing whether the CPU is valid in the first place. Without that, if a task passes in a CPU value that is too high, the following KASAN splat can get triggered: BUG: KASAN: stack-out-of-bounds in io_sq_offload_create+0x858/0xaa4 Read of size 8 at addr ffff800089bc7b90 by task wq-aff.t/1391 CPU: 4 UID: 1000 PID: 1391 Comm: wq-aff.t Not tainted 6.11.0-rc7-00227-g371c468f4db6 #7080 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace.part.0+0xcc/0xe0 show_stack+0x14/0x1c dump_stack_lvl+0x58/0x74 print_report+0x16c/0x4c8 kasan_report+0x9c/0xe4 __asan_report_load8_noabort+0x1c/0x24 io_sq_offload_create+0x858/0xaa4 io_uring_setup+0x1394/0x17c4 __arm64_sys_io_uring_setup+0x6c/0x180 invoke_syscall+0x6c/0x260 el0_svc_common.constprop.0+0x158/0x224 do_el0_svc+0x3c/0x5c el0_svc+0x34/0x70 el0t_64_sync_handler+0x118/0x124 el0t_64_sync+0x168/0x16c The buggy address belongs to stack of task wq-aff.t/1391 and is located at offset 48 in frame: io_sq_offload_create+0x0/0xaa4 This frame has 1 object: [32, 40) 'allowed_mask' The buggy address belongs to the virtual mapping at [ffff800089bc0000, ffff800089bc9000) created by: kernel_clone+0x124/0x7e0 The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff0000d740af80 pfn:0x11740a memcg:ffff0000c2706f02 flags: 0xbffe00000000000(node=0\|zone=2\|lastcpupid=0x1fff) raw: 0bffe00000000000 0000000000000000 dead000000000122 0000000000000000 raw: ffff0000d740af80 0000000000000000 00000001ffffffff ffff0000c2706f02 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff800089bc7a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff800089bc7b00: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 >ffff800089bc7b80: 00 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 ^ ffff800089bc7c00: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 ffff800089bc7c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f3 Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202409161632.cbeeca0d-lkp@intel.com Fixes: `f011c9cf04` ("io_uring/sqpoll: do not allow pinning outside of cpuset") Tested-by: Felix Moessbauer <felix.moessbauer@siemens.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-16 03:12:21 -06:00
Jens Axboe	9753c642a5	io_uring/rsrc: change ubuf->ubuf_end to length tracking If we change it to tracking ubuf->start + ubuf->len, then we can reduce the size of struct io_mapped_ubuf by another 4 bytes, effectively 8 bytes, as a hole is eliminated too. This shrinks io_mapped_ubuf to 32 bytes. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-15 09:15:22 -06:00
Jens Axboe	8b0c6025a0	io_uring/rsrc: get rid of io_mapped_ubuf->folio_mask We don't really need to cache this, let's reclaim 8 bytes from struct io_mapped_ubuf and just calculate it when we need it. The only hot path here is io_import_fixed(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-15 09:15:19 -06:00
Jens Axboe	636119af94	io_uring: rename "copy buffers" to "clone buffers" A recent commit added support for copying registered buffers from one ring to another. But that term is a bit confusing, as no copying of buffer data is done here. What is being done is simply cloning the buffer registrations from one ring to another. Rename it while we still can, so that it's more descriptive. No functional changes in this patch. Fixes: `7cc2a6eadc` ("io_uring: add IORING_REGISTER_COPY_BUFFERS method") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-14 08:51:15 -06:00
Jens Axboe	7cc2a6eadc	io_uring: add IORING_REGISTER_COPY_BUFFERS method Buffers can get registered with io_uring, which allows to skip the repeated pin_pages, unpin/unref pages for each O_DIRECT operation. This reduces the overhead of O_DIRECT IO. However, registrering buffers can take some time. Normally this isn't an issue as it's done at initialization time (and hence less critical), but for cases where rings can be created and destroyed as part of an IO thread pool, registering the same buffers for multiple rings become a more time sensitive proposition. As an example, let's say an application has an IO memory pool of 500G. Initial registration takes: Got 500 huge pages (each 1024MB) Registered 500 pages in 409 msec or about 0.4 seconds. If we go higher to 900 1GB huge pages being registered: Registered 900 pages in 738 msec which is, as expected, a fully linear scaling. Rather than have each ring pin/map/register the same buffer pool, provide an io_uring_register(2) opcode to simply duplicate the buffers that are registered with another ring. Adding the same 900GB of registered buffers to the target ring can then be accomplished in: Copied 900 pages in 17 usec While timing differs a bit, this provides around a 25,000-40,000x speedup for this use case. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-12 10:14:15 -06:00
Jens Axboe	0b6d253e08	io_uring/register: provide helper to get io_ring_ctx from 'fd' Can be done in one of two ways: 1) Regular file descriptor, just fget() 2) Registered ring, index our own table for that In preparation for adding another register use of needing to get a ctx from a file descriptor, abstract out this helper and use it in the main register syscall as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-12 10:14:05 -06:00
Jens Axboe	bfc0aa7a51	io_uring/rsrc: add reference count to struct io_mapped_ubuf Currently there's a single ring owner of a mapped buffer, and hence the reference count will always be 1 when it's torn down and freed. However, in preparation for being able to link io_mapped_ubuf to different spots, add a reference count to manage the lifetime of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-11 13:54:32 -06:00
Jens Axboe	021b153f7d	io_uring/rsrc: clear 'slot' entry upfront No functional changes in this patch, but clearing the slot pointer earlier will be required by a later change. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-11 13:52:17 -06:00
Pavel Begunkov	6746ee4c3a	io_uring/cmd: expose iowq to cmds When an io_uring request needs blocking context we offload it to the io_uring's thread pool called io-wq. We can get there off ->uring_cmd by returning -EAGAIN, but there is no straightforward way of doing that from an asynchronous callback. Add a helper that would transfer a command to a blocking context. Note, we do an extra hop via task_work before io_queue_iowq(), that's a limitation of io_uring infra we have that can likely be lifted later if that would ever become a problem. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f735f807d7c8ba50c9452c69dfe5d3e9e535037b.1726072086.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-11 10:44:10 -06:00
Jens Axboe	6d0f8dcb3a	Merge branch 'for-6.12/io_uring' into for-6.12/io_uring-discard * for-6.12/io_uring: (31 commits) io_uring/io-wq: inherit cpuset of cgroup in io worker io_uring/io-wq: do not allow pinning outside of cpuset io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common() io_uring/rw: treat -EOPNOTSUPP for IOCB_NOWAIT like -EAGAIN io_uring/sqpoll: do not allow pinning outside of cpuset io_uring/eventfd: move refs to refcount_t io_uring: remove unused rsrc_put_fn io_uring: add new line after variable declaration io_uring: add GCOV_PROFILE_URING Kconfig option io_uring/kbuf: add support for incremental buffer consumption io_uring/kbuf: pass in 'len' argument for buffer commit Revert "io_uring: Require zeroed sqe->len on provided-buffers send" io_uring/kbuf: move io_ring_head_to_buf() to kbuf.h io_uring/kbuf: add io_kbuf_commit() helper io_uring/kbuf: shrink nr_iovs/mode in struct buf_sel_arg io_uring: wire up min batch wake timeout io_uring: add support for batch wait timeout io_uring: implement our own schedule timeout handling io_uring: move schedule wait logic into helper io_uring: encapsulate extraneous wait flags into a separate struct ...	2024-09-11 10:42:40 -06:00
Felix Moessbauer	84eacf177f	io_uring/io-wq: inherit cpuset of cgroup in io worker The io worker threads are userland threads that just never exit to the userland. By that, they are also assigned to a cgroup (the group of the creating task). When creating a new io worker, this worker should inherit the cpuset of the cgroup. Fixes: `da64d6db3b` ("io_uring: One wqe per wq") Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com> Link: https://lore.kernel.org/r/20240910171157.166423-3-felix.moessbauer@siemens.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-11 07:27:56 -06:00
Felix Moessbauer	0997aa5497	io_uring/io-wq: do not allow pinning outside of cpuset The io worker threads are userland threads that just never exit to the userland. By that, they are also assigned to a cgroup (the group of the creating task). When changing the affinity of the io_wq thread via syscall, we must only allow cpumasks within the limits defined by the cpuset controller of the cgroup (if enabled). Fixes: `da64d6db3b` ("io_uring: One wqe per wq") Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com> Link: https://lore.kernel.org/r/20240910171157.166423-2-felix.moessbauer@siemens.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-11 07:27:56 -06:00
Jens Axboe	90bfb28d5f	io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common() A recent change ensured that the necessary -EOPNOTSUPP -> -EAGAIN transformation happens inline on both the reader and writer side, and hence there's no need to check for both of these anymore on the completion handler side. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-10 09:34:44 -06:00
Jens Axboe	c0a9d496e0	io_uring/rw: treat -EOPNOTSUPP for IOCB_NOWAIT like -EAGAIN Some file systems, ocfs2 in this case, will return -EOPNOTSUPP for an IOCB_NOWAIT read/write attempt. While this can be argued to be correct, the usual return value for something that requires blocking issue is -EAGAIN. A refactoring io_uring commit dropped calling kiocb_done() for negative return values, which is otherwise where we already do that transformation. To ensure we catch it in both spots, check it in __io_read() itself as well. Reported-by: Robert Sander <r.sander@heinlein-support.de> Link: https://fosstodon.org/@gurubert@mastodon.gurubert.de/113112431889638440 Cc: stable@vger.kernel.org Fixes: `a08d195b58` ("io_uring/rw: split io_read() into a helper") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-10 09:34:41 -06:00
Christian Brauner	a6711d1cd4	io_uring: port to struct kmem_cache_args Port req_cachep to struct kmem_cache_args. Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2024-09-10 11:42:59 +02:00
Felix Moessbauer	f011c9cf04	io_uring/sqpoll: do not allow pinning outside of cpuset The submit queue polling threads are userland threads that just never exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF, the affinity of the poller thread is set to the cpu specified in sq_thread_cpu. However, this CPU can be outside of the cpuset defined by the cgroup cpuset controller. This violates the rules defined by the cpuset controller and is a potential issue for realtime applications. In b7ed6d8ffd6 we fixed the default affinity of the poller thread, in case no explicit pinning is required by inheriting the one of the creating task. In case of explicit pinning, the check is more complicated, as also a cpu outside of the parent cpumask is allowed. We implemented this by using cpuset_cpus_allowed (that has support for cgroup cpusets) and testing if the requested cpu is in the set. Fixes: `37d1e2e364` ("io_uring: move SQPOLL thread io-wq forked worker") Cc: stable@vger.kernel.org # 6.1+ Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com> Link: https://lore.kernel.org/r/20240909150036.55921-1-felix.moessbauer@siemens.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-09 09:09:08 -06:00
Jens Axboe	0e0bcf07ec	io_uring/eventfd: move refs to refcount_t atomic_t for the struct io_ev_fd references and there are no issues with it. While the ref getting and putting for the eventfd code is somewhat performance critical for cases where eventfd signaling is used (news flash, you should not...), it probably doesn't warrant using an atomic_t for this. Let's just move to it to refcount_t to get the added protection of over/underflows. Link: https://lore.kernel.org/lkml/202409082039.hnsaIJ3X-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409082039.hnsaIJ3X-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-08 16:43:57 -06:00
Anuj Gupta	c9f9ce65c2	io_uring: remove unused rsrc_put_fn rsrc_put_fn is declared but never used, remove it. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/20240902062134.136387-3-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-02 09:39:57 -06:00
Anuj Gupta	6cf52b42c4	io_uring: add new line after variable declaration Fixes checkpatch warning Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/20240902062134.136387-2-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-09-02 09:39:57 -06:00
Jens Axboe	1802656ef8	io_uring: add GCOV_PROFILE_URING Kconfig option If GCOV is enabled and this option is set, it enables code coverage profiling of the io_uring subsystem. Only use this for test purposes, as it will impact the runtime performance. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-30 10:52:02 -06:00
Jens Axboe	f274495aea	io_uring/kbuf: return correct iovec count from classic buffer peek io_provided_buffers_select() returns 0 to indicate success, but it should be returning 1 to indicate that 1 vec was mapped. This causes peeking to fail with classic provided buffers, and while that's not a use case that anyone should use, it should still work correctly. The end result is that no buffer will be selected, and hence a completion with '0' as the result will be posted, without a buffer attached. Fixes: `35c8711c8f` ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-30 10:45:54 -06:00
Jens Axboe	1c47c0d601	io_uring/rsrc: ensure compat iovecs are copied correctly For buffer registration (or updates), a userspace iovec is copied in and updated. If the application is within a compat syscall, then the iovec type is compat_iovec rather than iovec. However, the type used in __io_sqe_buffers_update() and io_sqe_buffers_register() is always struct iovec, and hence the source is incremented by the size of a non-compat iovec in the loop. This misses every other iovec in the source, and will run into garbage half way through the copies and return -EFAULT to the application. Maintain the source address separately and assign to our user vec pointer, so that copies always happen from the right source address. While in there, correct a bad placement of __user which triggered the following sparse warning prior to this fix: io_uring/rsrc.c:981:33: warning: cast removes address space '__user' of expression io_uring/rsrc.c:981:30: warning: incorrect type in assignment (different address spaces) io_uring/rsrc.c:981:30: expected struct iovec const [noderef] __user uvec io_uring/rsrc.c:981:30: got struct iovec [noderef] __user Fixes: `f4eaf8eda8` ("io_uring/rsrc: Drop io_copy_iov in favor of iovec API") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-30 07:52:43 -06:00
Jens Axboe	ae98dbf43d	io_uring/kbuf: add support for incremental buffer consumption By default, any recv/read operation that uses provided buffers will consume at least 1 buffer fully (and maybe more, in case of bundles). This adds support for incremental consumption, meaning that an application may add large buffers, and each read/recv will just consume the part of the buffer that it needs. For example, let's say an application registers 1MB buffers in a provided buffer ring, for streaming receives. If it gets a short recv, then the full 1MB buffer will be consumed and passed back to the application. With incremental consumption, only the part that was actually used is consumed, and the buffer remains the current one. This means that both the application and the kernel needs to keep track of what the current receive point is. Each recv will still pass back a buffer ID and the size consumed, the only difference is that before the next receive would always be the next buffer in the ring. Now the same buffer ID may return multiple receives, each at an offset into that buffer from where the previous receive left off. Example: Application registers a provided buffer ring, and adds two 32K buffers to the ring. Buffer1 address: 0x1000000 (buffer ID 0) Buffer2 address: 0x2000000 (buffer ID 1) A recv completion is received with the following values: cqe->res 0x1000 (4k bytes received) cqe->flags 0x11 (CQE_F_BUFFER\|CQE_F_BUF_MORE set, buffer ID 0) and the application now knows that 4096b of data is available at 0x1000000, the start of that buffer, and that more data from this buffer will be coming. Now the next receive comes in: cqe->res 0x2010 (8k bytes received) cqe->flags 0x11 (CQE_F_BUFFER\|CQE_F_BUF_MORE set, buffer ID 0) which tells the application that 8k is available where the last completion left off, at 0x1001000. Next completion is: cqe->res 0x5000 (20k bytes received) cqe->flags 0x1 (CQE_F_BUFFER set, buffer ID 0) and the application now knows that 20k of data is available at 0x1003000, which is where the previous receive ended. CQE_F_BUF_MORE isn't set, as no more data is available in this buffer ID. The next completion is then: cqe->res 0x1000 (4k bytes received) cqe->flags 0x10001 (CQE_F_BUFFER\|CQE_F_BUF_MORE set, buffer ID 1) which tells the application that buffer ID 1 is now the current one, hence there's 4k of valid data at 0x2000000. 0x2001000 will be the next receive point for this buffer ID. When a buffer will be reused by future CQE completions, IORING_CQE_BUF_MORE will be set in cqe->flags. This tells the application that the kernel isn't done with the buffer yet, and that it should expect more completions for this buffer ID. Will only be set by provided buffer rings setup with IOU_PBUF_RING INC, as that's the only type of buffer that will see multiple consecutive completions for the same buffer ID. For any other provided buffer type, any completion that passes back a buffer to the application is final. Once a buffer has been fully consumed, the buffer ring head is incremented and the next receive will indicate the next buffer ID in the CQE cflags. On the send side, the application can manage how much data is sent from an existing buffer by setting sqe->len to the desired send length. An application can request incremental consumption by setting IOU_PBUF_RING_INC in the provided buffer ring registration. Outside of that, any provided buffer ring setup and buffer additions is done like before, no changes there. The only change is in how an application may see multiple completions for the same buffer ID, hence needing to know where the next receive will happen. Note that like existing provided buffer rings, this should not be used with IOSQE_ASYNC, as both really require the ring to remain locked over the duration of the buffer selection and the operation completion. It will consume a buffer otherwise regardless of the size of the IO done. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-29 08:44:58 -06:00
Jens Axboe	6733e678ba	io_uring/kbuf: pass in 'len' argument for buffer commit In preparation for needing the consumed length, pass in the length being completed. Unused right now, but will be used when it is possible to partially consume a buffer. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-29 08:44:51 -06:00
Jens Axboe	641a681679	Revert "io_uring: Require zeroed sqe->len on provided-buffers send" This reverts commit `79996b45f7`. Revert the change that restricts a send provided buffer to be zero, so it will always consume the whole buffer. This is strictly needed for partial consumption, as the send may very well be a subset of the current buffer. In fact, that's the intended use case. For non-incremental provided buffer rings, an application should set sqe->len carefully to avoid the potential issue described in the reverted commit. It is recommended that '0' still be set for len for that case, if the application is set on maintaining more than 1 send inflight for the same socket. This is somewhat of a nonsensical thing to do. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-29 08:44:46 -06:00
Jens Axboe	2c8fa70bf3	io_uring/kbuf: move io_ring_head_to_buf() to kbuf.h In preparation for using this helper in kbuf.h as well, move it there and turn it into a macro. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-29 08:44:42 -06:00
Jens Axboe	ecd5c9b296	io_uring/kbuf: add io_kbuf_commit() helper Committing the selected ring buffer is currently done in three different spots, combine it into a helper and just call that. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-29 08:44:38 -06:00
Jens Axboe	120443321d	io_uring/kbuf: shrink nr_iovs/mode in struct buf_sel_arg nr_iovs is capped at 1024, and mode only has a few low values. We can safely make them u16, in preparation for adding a few more members. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Jens Axboe	7ed9e09e2d	io_uring: wire up min batch wake timeout Expose min_wait_usec in io_uring_getevents_arg, replacing the pad member that is currently in there. The value is in usecs, which is explained in the name as well. Note that if min_wait_usec and a normal timeout is used in conjunction, the normal timeout is still relative to the base time. For example, if min_wait_usec is set to 100 and the normal timeout is 1000, the max total time waited is still 1000. This also means that if the normal timeout is shorter than min_wait_usec, then only the min_wait_usec will take effect. See previous commit for an explanation of how this works. IORING_FEAT_MIN_TIMEOUT is added as a feature flag for this, as applications doing submit_and_wait_timeout() style operations will generally not see the -EINVAL from the wait side as they return the number of IOs submitted. Only if no IOs are submitted will the -EINVAL bubble back up to the application. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00
Jens Axboe	1100c4a265	io_uring: add support for batch wait timeout Waiting for events with io_uring has two knobs that can be set: 1) The number of events to wake for 2) The timeout associated with the event Waiting will abort when either of those conditions are met, as expected. This adds support for a third event, which is associated with the number of events to wait for. Applications generally like to handle batches of completions, and right now they'd set a number of events to wait for and the timeout for that. If no events have been received but the timeout triggers, control is returned to the application and it can wait again. However, if the application doesn't have anything to do until events are reaped, then it's possible to make this waiting more efficient. For example, the application may have a latency time of 50 usecs and wanting to handle a batch of 8 requests at the time. If it uses 50 usecs as the timeout, then it'll be doing 20K context switches per second even if nothing is happening. This introduces the notion of min batch wait time. If the min batch wait time expires, then we'll return to userspace if we have any events at all. If none are available, the general wait time is applied. Any request arriving after the min batch wait time will cause waiting to stop and return control to the application. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-08-25 08:27:01 -06:00

1 2 3 4 5 ...

1251 Commits