linux

Author	SHA1	Message	Date
Jens Axboe	6bf9c47a39	io_uring: defer file assignment If an application uses direct open or accept, it knows in advance what direct descriptor value it will get as it picks it itself. This allows combined requests such as: sqe = io_uring_get_sqe(ring); io_uring_prep_openat_direct(sqe, ..., file_slot); sqe->flags \|= IOSQE_IO_LINK \| IOSQE_CQE_SKIP_SUCCESS; sqe = io_uring_get_sqe(ring); io_uring_prep_read(sqe,file_slot, buf, buf_size, 0); sqe->flags \|= IOSQE_FIXED_FILE; io_uring_submit(ring); where we prepare both a file open and read, and only get a completion event for the read when both have completed successfully. Currently links are fully prepared before the head is issued, but that fails if the dependent link needs a file assigned that isn't valid until the head has completed. Conversely, if the same chain is performed but the fixed file slot is already valid, then we would be unexpectedly returning data from the old file slot rather than the newly opened one. Make sure we're consistent here. Allow deferral of file setup, which makes this documented case work. Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-07 11:17:28 -06:00
Linus Torvalds	35ce8ae9ae	Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull signal/exit/ptrace updates from Eric Biederman: "This set of changes deletes some dead code, makes a lot of cleanups which hopefully make the code easier to follow, and fixes bugs found along the way. The end-game which I have not yet reached yet is for fatal signals that generate coredumps to be short-circuit deliverable from complete_signal, for force_siginfo_to_task not to require changing userspace configured signal delivery state, and for the ptrace stops to always happen in locations where we can guarantee on all architectures that the all of the registers are saved and available on the stack. Removal of profile_task_ext, profile_munmap, and profile_handoff_task are the big successes for dead code removal this round. A bunch of small bug fixes are included, as most of the issues reported were small enough that they would not affect bisection so I simply added the fixes and did not fold the fixes into the changes they were fixing. There was a bug that broke coredumps piped to systemd-coredump. I dropped the change that caused that bug and replaced it entirely with something much more restrained. Unfortunately that required some rebasing. Some successes after this set of changes: There are few enough calls to do_exit to audit in a reasonable amount of time. The lifetime of struct kthread now matches the lifetime of struct task, and the pointer to struct kthread is no longer stored in set_child_tid. The flag SIGNAL_GROUP_COREDUMP is removed. The field group_exit_task is removed. Issues where task->exit_code was examined with signal->group_exit_code should been examined were fixed. There are several loosely related changes included because I am cleaning up and if I don't include them they will probably get lost. The original postings of these changes can be found at: https://lkml.kernel.org/r/87a6ha4zsd.fsf@email.froward.int.ebiederm.org https://lkml.kernel.org/r/87bl1kunjj.fsf@email.froward.int.ebiederm.org https://lkml.kernel.org/r/87r19opkx1.fsf_-_@email.froward.int.ebiederm.org I trimmed back the last set of changes to only the obviously correct once. Simply because there was less time for review than I had hoped" * 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (44 commits) ptrace/m68k: Stop open coding ptrace_report_syscall ptrace: Remove unused regs argument from ptrace_report_syscall ptrace: Remove second setting of PT_SEIZED in ptrace_attach taskstats: Cleanup the use of task->exit_code exit: Use the correct exit_code in /proc/<pid>/stat exit: Fix the exit_code for wait_task_zombie exit: Coredumps reach do_group_exit exit: Remove profile_handoff_task exit: Remove profile_task_exit & profile_munmap signal: clean up kernel-doc comments signal: Remove the helper signal_group_exit signal: Rename group_exit_task group_exec_task coredump: Stop setting signal->group_exit_task signal: Remove SIGNAL_GROUP_COREDUMP signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process signal: Make coredump handling explicit in complete_signal signal: Have prepare_signal detect coredumps using signal->core_state signal: Have the oom killer detect coredumps using signal->core_state exit: Move force_uaccess back into do_exit exit: Guarantee make_task_dead leaks the tsk when calling do_task_exit ...	2022-01-17 05:49:30 +02:00
Eric W. Biederman	e32cf5dfbe	kthread: Generalize pf_io_worker so it can point to struct kthread The point of using set_child_tid to hold the kthread pointer was that it already did what is necessary. There are now restrictions on when set_child_tid can be initialized and when set_child_tid can be used in schedule_tail. Which indicates that continuing to use set_child_tid to hold the kthread pointer is a bad idea. Instead of continuing to use the set_child_tid field of task_struct generalize the pf_io_worker field of task_struct and use it to hold the kthread pointer. Rename pf_io_worker (which is a void * pointer) to worker_private so it can be used to store kthreads struct kthread pointer. Update the kthread code to store the kthread pointer in the worker_private field. Remove the places where set_child_tid had to be dealt with carefully because kthreads also used it. Link: https://lkml.kernel.org/r/CAHk-=wgtFAA9SbVYg0gR1tqPMC17-NYcs0GQkaYg1bGhh1uJQQ@mail.gmail.com Link: https://lkml.kernel.org/r/87a6grvqy8.fsf_-_@email.froward.int.ebiederm.org Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2022-01-08 09:39:49 -06:00
Hao Xu	24115c4e95	io-wq: add helper to merge two wq_lists add a helper to merge two wq_lists, it will be useful in the next patches. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211207093951.247840-2-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-07 15:01:57 -07:00
Pavel Begunkov	765ff496c7	io_uring: optimise INIT_WQ_LIST The invariant of io_wq_work_list is that it's empty IFF ->first is NULL, so no need to initially set ->last. With now having more users of the list it may play a role, i.e. used in each tw iteration and on every completion flushing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c464ab5cab6e46a858c6d39c107e92b3b5291f13.1633373302.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:54 -06:00
Pavel Begunkov	5eef4e87eb	io_uring: use single linked list for iopoll Use single linked lists for keeping iopoll requests, takes less space, may be faster, but mostly will be of benefit for further patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/314033676b100cd485518c3bc55e1b95a0dcd71f.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	0d9521b9b5	io-wq: add io_wq_work_node based stack Apart from just using lists (i.e. io_wq_work_list), we also want to have stacks, which are a bit faster, and have some interoperability between them. Add a stack implementation based on io_wq_work_node and some helpers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5d3a412a5ac0d47e0f0499d70d2207d70a68925e.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Jens Axboe	2e480058dd	io-wq: provide a way to limit max number of workers io-wq divides work into two categories: 1) Work that completes in a bounded time, like reading from a regular file or a block device. This type of work is limited based on the size of the SQ ring. 2) Work that may never complete, we call this unbounded work. The amount of workers here is just limited by RLIMIT_NPROC. For various uses cases, it's handy to have the kernel limit the maximum amount of pending workers for both categories. Provide a way to do with with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation. IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets the max worker count to what is being passed in for each category. The old values are returned into that same array. If 0 is being passed in for either category, it simply returns the current value. The value is capped at RLIMIT_NPROC. This actually isn't that important as it's more of a hint, if we're exceeding the value then our attempt to fork a new worker will fail. This happens naturally already if more than one node is in the system, as these values are per-node internally for io-wq. Reported-by: Johannes Lundberg <johalun0@gmail.com> Link: https://github.com/axboe/liburing/issues/420 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-29 07:55:55 -06:00
Pavel Begunkov	8724dd8c83	io-wq: improve wq_list_add_tail() Prepare nodes that we're going to add before actually linking them, it's always safer and costs us nothing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f7e53f0c84c02ed6748c488ed0789b98f8cc6185.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	c10d1f986b	io_uring: move creds from io-wq work to io_kiocb io-wq now doesn't have anything to do with creds now, so move ->creds from struct io_wq_work into request (aka struct io_kiocb). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8520c72ab8b8f4b96db12a228a2ab4c094ae64e1.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Jens Axboe	fe76421d1d	io_uring: allow user configurable IO thread CPU affinity io-wq defaults to per-node masks for IO workers. This works fine by default, but isn't particularly handy for workloads that prefer more specific affinities, for either performance or isolation reasons. This adds IORING_REGISTER_IOWQ_AFF that allows the user to pass in a CPU mask that is then applied to IO thread workers, and an IORING_UNREGISTER_IOWQ_AFF that simply resets the masks back to the default of per-node. Note that no care is given to existing IO threads, they will need to go through a reschedule before the affinity is correct if they are already running or sleeping. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-17 10:25:50 -06:00
Pavel Begunkov	17a91051fe	io_uring/io-wq: close io-wq full-stop gap There is an old problem with io-wq cancellation where requests should be killed and are in io-wq but are not discoverable, e.g. in @next_hashed or @linked vars of io_worker_handle_work(). It adds some unreliability to individual request canellation, but also may potentially get __io_uring_cancel() stuck. For instance: 1) An __io_uring_cancel()'s cancellation round have not found any request but there are some as desribed. 2) __io_uring_cancel() goes to sleep 3) Then workers wake up and try to execute those hidden requests that happen to be unbound. As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT in advance, so preventing 3) from executing unbound requests. The workers will initially break looping because of getting a signal as they are threads of the dying/exec()'ing user task. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-25 19:39:58 -06:00
Jens Axboe	685fe7feed	io-wq: eliminate the need for a manager thread io-wq relies on a manager thread to create/fork new workers, as needed. But there's really no strong need for it anymore. We have the following cases that fork a new worker: 1) Work queue. This is done from the task itself always, and it's trivial to create a worker off that path, if needed. 2) All workers have gone to sleep, and we have more work. This is called off the sched out path. For this case, use a task_work items to queue a fork-worker operation. 3) Hashed work completion. Don't think we need to do anything off this case. If need be, it could just use approach 2 as well. Part of this change is incrementing the running worker count before the fork, to avoid cases where we observe we need a worker and then queue creation of one. Then new work comes in, we fork a new one. That last queue operation should have waited for the previous worker to come up, it's quite possible we don't even need it. Hence move the worker running from before we fork it off to more efficiently handle that case. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:42:00 -06:00
Stefan Metzmacher	53e043b2b4	io_uring: remove structures from include/linux/io_uring.h Link: https://lore.kernel.org/r/8c1d14f3748105f4caeda01716d47af2fa41d11c.1615809009.git.metze@samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-03-18 09:44:35 -06:00
Jens Axboe	003e8dccdb	io-wq: always track creds for async issue If we go async with a request, grab the creds that the task currently has assigned and make sure that the async side switches to them. This is handled in the same way that we do for registered personalities. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-03-06 10:57:01 -07:00
Jens Axboe	46fe18b16c	io_uring: move to using create_io_thread() This allows us to do task creation and setup without needing to use completions to try and synchronize with the starting thread. Get rid of the old io_wq_fork_thread() wrapper, and the 'wq' and 'worker' startup completion events - we can now do setup before the task is running. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-03-05 08:43:01 -07:00
Jens Axboe	5730b27e84	io_uring: move cred assignment into io_issue_sqe() If we move it in there, then we no longer have to care about it in io-wq. This means we can drop the cred handling in io-wq, and we can drop the REQ_F_WORK_INITIALIZED flag and async init functions as that was the last user of it since we moved to the new workers. Then we can also drop io_wq_work->creds, and just hold the personality u16 in there instead. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-03-04 06:36:28 -07:00
Jens Axboe	afcc4015d1	io-wq: provide an io_wq_put_and_exit() helper If we put the io-wq from io_uring, we really want it to exit. Provide a helper that does that for us. Couple that with not having the manager hold a reference to the 'wq' and the normal SQPOLL exit will tear down the io-wq context appropriate. On the io-wq side, our wq context is per task, so only the task itself is manipulating ->manager and hence it's safe to check and clear without any extra locking. We just need to ensure that the manager task stays around, in case it exits. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-03-04 06:34:39 -07:00
Jens Axboe	4fb6ac3262	io-wq: improve manager/worker handling over exec exec will cancel any threads, including the ones that io-wq is using. This isn't a problem, in fact we'd prefer it to be that way since it means we know that any async work cancels naturally without having to handle it proactively. But it does mean that we need to setup a new manager, as the manager and workers are gone. Handle this at queue time, and cancel work if we fail. Since the manager can go away without us noticing, ensure that the manager itself holds a reference to the 'wq' as well. Rename io_wq_destroy() to io_wq_put() to reflect that. In the future we can now simplify exec cancelation handling, for now just leave it the same. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-25 10:17:09 -07:00
Jens Axboe	e941894eae	io-wq: make buffered file write hashed work map per-ctx Before the io-wq thread change, we maintained a hash work map and lock per-node per-ring. That wasn't ideal, as we really wanted it to be per ring. But now that we have per-task workers, the hash map ends up being just per-task. That'll work just fine for the normal case of having one task use a ring, but if you share the ring between tasks, then it's considerably worse than it was before. Make the hash map per ctx instead, which provides full per-ctx buffered write serialization on hashed writes. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-25 09:23:47 -07:00
Jens Axboe	728f13e730	io-wq: remove nr_process accounting We're now just using fork like we would from userspace, so there's no need to try and impose extra restrictions or accounting on the user side of things. That's already being done for us. That also means we don't have to pass in the user_struct anymore, that's correctly inherited through ->creds on fork. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-23 20:33:26 -07:00
Jens Axboe	843bbfd49f	io-wq: make io_wq_fork_thread() available to other users We want to use this in io_uring proper as well, for the SQPOLL thread. Rename it from fork_thread() to io_wq_fork_thread(), and make it available through the io-wq.h header. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-21 17:25:22 -07:00
Jens Axboe	4379bf8bd7	io_uring: remove io_identity We are no longer grabbing state, so no need to maintain an IO identity that we COW if there are changes. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-21 17:25:22 -07:00
Jens Axboe	44526bedc2	io_uring: remove any grabbing of context The async workers are siblings of the task itself, so by definition we have all the state that we need. Remove any of the state grabbing that we have, and requests flagging what they need. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-21 17:25:22 -07:00
Jens Axboe	3bfe610669	io-wq: fork worker threads from original task Instead of using regular kthread kernel threads, create kernel threads that are like a real thread that the task would create. This ensures that we get all the context that we need, without having to carry that state around. This greatly reduces the code complexity, and the risk of missing state for a given request type. With the move away from kthread, we can also dump everything related to assigned state to the new threads. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-21 17:25:22 -07:00
Jens Axboe	3b094e727d	io-wq: get rid of wq->use_refs We don't support attach anymore, so doesn't make sense to carry the use_refs reference count. Get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-21 17:25:22 -07:00
Jens Axboe	7c25c0d16e	io_uring: remove the need for relying on an io-wq fallback worker We hit this case when the task is exiting, and we need somewhere to do background cleanup of requests. Instead of relying on the io-wq task manager to do this work for us, just stuff it somewhere where we can safely run it ourselves directly. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-21 17:25:22 -07:00
Jens Axboe	7cbf1722d5	io_uring: provide FIFO ordering for task_work task_work is a LIFO list, due to how it's implemented as a lockless list. For long chains of task_work, this can be problematic as the first entry added is the last one processed. Similarly, we'd waste a lot of CPU cycles reversing this list. Wrap the task_work so we have a single task_work entry per task per ctx, and use that to run it in the right order. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	5280f7e530	io_uring/io-wq: return 2-step work swap scheme Saving one lock/unlock for io-wq is not super important, but adds some ugliness in the code. More important, atomic decs not turning it to zero for some archs won't give the right ordering/barriers so the io_steal_work() may pretty easily get subtly and completely broken. Return back 2-step io-wq work exchange and clean it up. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Jens Axboe	4014d943cb	io_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCEL It's no longer used as IORING_OP_CLOSE got rid for the need of flagging it as uncancelable, kill it of. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
Jens Axboe	446bc1c207	io-wq: kill now unused io_wq_cancel_all() io_uring no longer issues full cancelations on the io-wq, so remove any remnants of this code and the IO_WQ_BIT_CANCEL flag. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-20 10:47:42 -07:00
Xiaoguang Wang	0020ef04e4	io_uring: fix io_wqe->work_list corruption For the first time a req punted to io-wq, we'll initialize io_wq_work's list to be NULL, then insert req to io_wqe->work_list. If this req is not inserted into tail of io_wqe->work_list, this req's io_wq_work list will point to another req's io_wq_work. For splitted bio case, this req maybe inserted to io_wqe->work_list repeatedly, once we insert it to tail of io_wqe->work_list for the second time, now io_wq_work->list->next will be invalid pointer, which then result in many strang error, panic, kernel soft-lockup, rcu stall, etc. In my vm, kernel doest not have commit `cc29e1bf0d` ("block: disable iopoll for split bio"), below fio job can reproduce this bug steadily: [global] name=iouring-sqpoll-iopoll-1 ioengine=io_uring iodepth=128 numjobs=1 thread rw=randread direct=1 registerfiles=1 hipri=1 bs=4m size=100M runtime=120 time_based group_reporting randrepeat=0 [device] directory=/home/feiman.wxg/mntpoint/ # an ext4 mount point If we have commit `cc29e1bf0d` ("block: disable iopoll for split bio"), there will no splitted bio case for polled io, but I think we still to need to fix this list corruption, it also should maybe go to stable branchs. To fix this corruption, if a req is inserted into tail of io_wqe->work_list, initialize req->io_wq_work->list->next to bu NULL. Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-18 08:15:10 -07:00
Pavel Begunkov	f6edbabb83	io_uring: always batch cancel in *cancel_files() Instead of iterating over each request and cancelling it individually in io_uring_cancel_files(), try to cancel all matching requests and use ->inflight_list only to check if there anything left. In many cases it should be faster, and we can reuse a lot of code from task cancellation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Jens Axboe	69228338c9	io_uring: unify fsize with def->work_flags This one was missed in the earlier conversion, should be included like any of the other IO identity flags. Make sure we restore to RLIM_INIFITY when dropping the personality again. Fixes: `98447d65b4` ("io_uring: move io identity items into separate struct") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-20 16:03:13 -06:00
Jens Axboe	98447d65b4	io_uring: move io identity items into separate struct io-wq contains a pointer to the identity, which we just hold in io_kiocb for now. This is in preparation for putting this outside io_kiocb. The only exception is struct files_struct, which we'll need different rules for to avoid a circular dependency. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:45 -06:00
Jens Axboe	0f20376588	io_uring: pass required context in as flags We have a number of bits that decide what context to inherit. Set up io-wq flags for these instead. This is in preparation for always having the various members set, but not always needing them for all requests. No intended functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:45 -06:00
Dennis Zhou	91d8f5191e	io_uring: add blkcg accounting to offloaded operations There are a few operations that are offloaded to the worker threads. In this case, we lose process context and end up in kthread context. This results in ios to be not accounted to the issuing cgroup and consequently end up as issued by root. Just like others, adopt the personality of the blkcg too when issuing via the workqueues. For the SQPOLL thread, it will live and attach in the inited cgroup's context. Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Jens Axboe	9b82849215	io_uring: reference ->nsproxy for file table commands If we don't get and assign the namespace for the async work, then certain paths just don't work properly (like /dev/stdin, /proc/mounts, etc). Anything that references the current namespace of the given task should be assigned for async work on behalf of that task. Cc: stable@vger.kernel.org # v5.5+ Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Pavel Begunkov	57f1a64958	io_uring/io-wq: move RLIMIT_FSIZE to io-wq RLIMIT_SIZE in needed only for execution from an io-wq context, hence move all preparations from hot path to io-wq work setup. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-24 13:00:44 -06:00
Pavel Begunkov	f4db7182e0	io-wq: return next work from ->do_work() directly It's easier to return next work from ->do_work() than having an in-out argument. Looks nicer and easier to compile. Also, merge io_wq_assign_next() into its only user. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-26 10:34:27 -06:00
Pavel Begunkov	e883a79d8c	io-wq: compact io-wq flags numbers Renumerate IO_WQ flags, so they take adjacent bits Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-26 10:34:27 -06:00
Pavel Begunkov	801dd57bd1	io_uring: cancel by ->task not pid For an exiting process it tries to cancel all its inflight requests. Use req->task to match such instead of work.pid. We always have req->task set, and it will be valid because we're matching only current exiting task. Also, remove work.pid and everything related, it's useless now. Reported-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-15 08:51:38 -06:00
Pavel Begunkov	44e728b8aa	io_uring: cancel all task's requests on exit If a process is going away, io_uring_flush() will cancel only 1 request with a matching pid. Cancel all of them Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-15 08:51:34 -06:00
Pavel Begunkov	4f26bda152	io-wq: add an option to cancel all matched reqs This adds support for cancelling all io-wq works matching a predicate. It isn't used yet, so no change in observable behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-15 08:51:34 -06:00
Xiaoguang Wang	7cdaf587de	io_uring: avoid whole io_wq_work copy for requests completed inline If requests can be submitted and completed inline, we don't need to initialize whole io_wq_work in io_init_req(), which is an expensive operation, add a new 'REQ_F_WORK_INITIALIZED' to determine whether io_wq_work is initialized and add a helper io_req_init_async(), users must call io_req_init_async() for the first time touching any members of io_wq_work. I use /dev/nullb0 to evaluate performance improvement in my physical machine: modprobe null_blk nr_devices=1 completion_nsec=0 sudo taskset -c 60 fio -name=fiotest -filename=/dev/nullb0 -iodepth=128 -thread -rw=read -ioengine=io_uring -direct=1 -bs=4k -size=100G -numjobs=1 -time_based -runtime=120 before this patch: Run status group 0 (all jobs): READ: bw=724MiB/s (759MB/s), 724MiB/s-724MiB/s (759MB/s-759MB/s), io=84.8GiB (91.1GB), run=120001-120001msec With this patch: Run status group 0 (all jobs): READ: bw=761MiB/s (798MB/s), 761MiB/s-761MiB/s (798MB/s-798MB/s), io=89.2GiB (95.8GB), run=120001-120001msec About 5% improvement. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-10 17:58:46 -06:00
Pavel Begunkov	f5fa38c59c	io_wq: add per-wq work handler instead of per work io_uring is the only user of io-wq, and now it uses only io-wq callback for all its requests, namely io_wq_submit_work(). Instead of storing work->runner callback in each instance of io_wq_work, keep it in io-wq itself. pros: - reduces io_wq_work size - more robust -- ->func won't be invalidated with mem{cpy,set}(req) - helps other work Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-08 13:47:37 -06:00
Jens Axboe	aa96bf8a9e	io_uring: use io-wq manager as backup task if task is exiting If the original task is (or has) exited, then the task work will not get queued properly. Allow for using the io-wq manager task to queue this work for execution, and ensure that the io-wq manager notices and runs this work if woken up (or exiting). Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-04-03 11:35:57 -06:00
Pavel Begunkov	86f3cd1b58	io-wq: handle hashed writes in chains We always punt async buffered writes to an io-wq helper, as the core kernel does not have IOCB_NOWAIT support for that. Most buffered async writes complete very quickly, as it's just a copy operation. This means that doing multiple locking roundtrips on the shared wqe lock for each buffered write is wasteful. Additionally, buffered writes are hashed work items, which means that any buffered write to a given file is serialized. Keep identicaly hashed work items contiguously in @wqe->work_list, and track a tail for each hash bucket. On dequeue of a hashed item, splice all of the same hash in one go using the tracked tail. Until the batch is done, the caller doesn't have to synchronize with the wqe or worker locks again. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-23 14:58:07 -06:00
Pavel Begunkov	18a542ff19	io_uring: Fix ->data corruption on re-enqueue work->data and work->list are shared in union. io_wq_assign_next() sets ->data if a req having a linked_timeout, but then io-wq may want to use work->list, e.g. to do re-enqueue of a request, so corrupting ->data. ->data is not necessary, just remove it and extract linked_timeout through @link_list. Fixes: `60cf46ae60` ("io-wq: hash dependent work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-22 19:31:27 -06:00
Pavel Begunkov	8766dd516c	io-wq: split hashing and enqueueing It's a preparation patch removing io_wq_enqueue_hashed(), which now should be done by io_wq_hash_work() + io_wq_enqueue(). Also, set hash value for dependant works, and do it as late as possible, because req->file can be unavailable before. This hash will be ignored by io-wq. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-14 17:02:28 -06:00

1 2

77 Commits