linux

Author	SHA1	Message	Date
Pavel Begunkov	c1379e247a	io_uring: move req preps out of io_issue_sqe() All request preparations are done only during submission, reflect it in the code by moving io_req_prep() much earlier into io_queue_sqe(). That's much cleaner, because it doen't expose bits to async code which it won't ever use. Also it makes the interface harder to misuse, and there are potential places for bugs. For instance, __io_queue() doesn't clear @sqe before proceeding to a next linked request, that could have been disastrous, but hopefully there are linked requests IFF sqe==NULL, so not actually a bug. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:38:46 -06:00
Pavel Begunkov	bfe7655983	io_uring: decouple issuing and req preparation io_issue_sqe() does two things at once, trying to prepare request and issuing them. Split it in two and deduplicate with io_defer_prep(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:38:46 -06:00
Pavel Begunkov	73debe68b3	io_uring: remove nonblock arg from io_{rw}_prep() All io_*_prep() functions including io_{read,write}_prep() are called only during submission where @force_nonblock is always true. Don't keep propagating it and instead remove the @force_nonblock argument from prep() altogether. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:38:46 -06:00
Pavel Begunkov	a88fc40021	io_uring: set/clear IOCB_NOWAIT into io_read/write Move setting IOCB_NOWAIT from io_prep_rw() into io_read()/io_write(), so it's set/cleared in a single place. Also remove @force_nonblock parameter from io_prep_rw(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:38:46 -06:00
Pavel Begunkov	2d199895d2	io_uring: remove F_NEED_CLEANUP check in prep() REQ_F_NEED_CLEANUP is set only by io__prep() and they're guaranteed to be called only once, so there is no one who may have set the flag before. Kill REQ_F_NEED_CLEANUP check in these *prep() handlers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:38:46 -06:00
Pavel Begunkov	5b09e37e27	io_uring: io_kiocb_ppos() style change Put brackets around bitwise ops in a complex expression Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:38:45 -06:00
Pavel Begunkov	291b2821e0	io_uring: simplify io_alloc_req() Extract common code from if/else branches. That is cleaner and optimised even better. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:38:45 -06:00
Jens Axboe	145cc8c665	io-wq: kill unused IO_WORKER_F_EXITING This flag is no longer used, remove it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Hillf Danton	c4068bf898	io-wq: fix use-after-free in io_wq_worker_running The smart syzbot has found a reproducer for the following issue: ================================================================== BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline] BUG: KASAN: use-after-free in atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline] BUG: KASAN: use-after-free in io_wqe_inc_running fs/io-wq.c:301 [inline] BUG: KASAN: use-after-free in io_wq_worker_running+0xde/0x110 fs/io-wq.c:613 Write of size 4 at addr ffff8882183db08c by task io_wqe_worker-0/7771 CPU: 0 PID: 7771 Comm: io_wqe_worker-0 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530 check_memory_region_inline mm/kasan/generic.c:186 [inline] check_memory_region+0x13d/0x180 mm/kasan/generic.c:192 instrument_atomic_write include/linux/instrumented.h:71 [inline] atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline] io_wqe_inc_running fs/io-wq.c:301 [inline] io_wq_worker_running+0xde/0x110 fs/io-wq.c:613 schedule_timeout+0x148/0x250 kernel/time/timer.c:1879 io_wqe_worker+0x517/0x10e0 fs/io-wq.c:580 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Allocated by task 7768: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461 kmem_cache_alloc_node_trace+0x17b/0x3f0 mm/slab.c:3594 kmalloc_node include/linux/slab.h:572 [inline] kzalloc_node include/linux/slab.h:677 [inline] io_wq_create+0x57b/0xa10 fs/io-wq.c:1064 io_init_wq_offload fs/io_uring.c:7432 [inline] io_sq_offload_start fs/io_uring.c:7504 [inline] io_uring_create fs/io_uring.c:8625 [inline] io_uring_setup+0x1836/0x28e0 fs/io_uring.c:8694 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 21: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track+0x1c/0x30 mm/kasan/common.c:56 kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355 __kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422 __cache_free mm/slab.c:3418 [inline] kfree+0x10e/0x2b0 mm/slab.c:3756 __io_wq_destroy fs/io-wq.c:1138 [inline] io_wq_destroy+0x2af/0x460 fs/io-wq.c:1146 io_finish_async fs/io_uring.c:6836 [inline] io_ring_ctx_free fs/io_uring.c:7870 [inline] io_ring_exit_work+0x1e4/0x6d0 fs/io_uring.c:7954 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 The buggy address belongs to the object at ffff8882183db000 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 140 bytes inside of 1024-byte region [ffff8882183db000, ffff8882183db400) The buggy address belongs to the page: page:000000009bada22b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2183db flags: 0x57ffe0000000200(slab) raw: 057ffe0000000200 ffffea0008604c48 ffffea00086a8648 ffff8880aa040700 raw: 0000000000000000 ffff8882183db000 0000000100000002 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8882183daf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff8882183db000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8882183db080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8882183db100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8882183db180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== which is down to the comment below, /* all workers gone, wq exit can proceed */ if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs)) complete(&wqe->wq->done); because there might be multiple cases of wqe in a wq and we would wait for every worker in every wqe to go home before releasing wq's resources on destroying. To that end, rework wq's refcount by making it independent of the tracking of workers because after all they are two different things, and keeping it balanced when workers come and go. Note the manager kthread, like other workers, now holds a grab to wq during its lifetime. Finally to help destroy wq, check IO_WQ_BIT_EXIT upon creating worker and do nothing for exiting wq. Cc: stable@vger.kernel.org # v5.5+ Reported-by: syzbot+45fa0a195b941764e0f0@syzkaller.appspotmail.com Reported-by: syzbot+9af99580130003da82b1@syzkaller.appspotmail.com Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Joseph Qi	dbbe9c6424	io_uring: show sqthread pid and cpu in fdinfo In most cases we'll specify IORING_SETUP_SQPOLL and run multiple io_uring instances in a host. Since all sqthreads are named "io_uring-sq", it's hard to distinguish the relations between application process and its io_uring sqthread. With this patch, application can get its corresponding sqthread pid and cpu through show_fdinfo. Steps: 1. Get io_uring fd first. $ ls -l /proc/<pid>/fd \| grep -w io_uring 2. Then get io_uring instance related info, including corresponding sqthread pid and cpu. $ cat /proc/<pid>/fdinfo/<io_uring_fd> pos: 0 flags: 02000002 mnt_id: 13 SqThread: 6929 SqThreadCpu: 2 UserFiles: 1 0: testfile UserBufs: 0 PollList: Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> [axboe: fixed for new shared SQPOLL] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Jens Axboe	af9c1a44f8	io_uring: process task work in io_uring_register() We do this for CQ ring wait, in case task_work completions come in. We should do the same in io_uring_register(), to avoid spurious -EINTR if the ring quiescing ends up having to process task_work to complete the operation Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Dennis Zhou	91d8f5191e	io_uring: add blkcg accounting to offloaded operations There are a few operations that are offloaded to the worker threads. In this case, we lose process context and end up in kthread context. This results in ios to be not accounted to the issuing cgroup and consequently end up as issued by root. Just like others, adopt the personality of the blkcg too when issuing via the workqueues. For the SQPOLL thread, it will live and attach in the inited cgroup's context. Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Jens Axboe	de2939388b	io_uring: improve registered buffer accounting for huge pages io_uring does account any registered buffer as pinned/locked memory, and checks limit and fails if the given user doesn't have a big enough limit to register the ranges specified. However, if huge pages are used, we are potentially under-accounting the memory in terms of what gets pinned on the vm side. This patch rectifies that, by ensuring that we account the full size of a compound page, regardless of how much of it is being registered. Huge pages are not accounted mulitple times - if multiple sections of a huge page is registered, then the page is only accounted once. Reported-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Zheng Bin	14db84110d	io_uring: remove unneeded semicolon Fixes coccicheck warning: fs/io_uring.c:4242:13-14: Unneeded semicolon Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Jens Axboe	e95eee2dee	io_uring: cap SQ submit size for SQPOLL with multiple rings In the spirit of fairness, cap the max number of SQ entries we'll submit for SQPOLL if we have multiple rings. If we don't do that, we could be submitting tons of entries for one ring, while others are waiting to get service. The value of 8 is somewhat arbitrarily chosen as something that allows a fair bit of batching, without using an excessive time per ring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Jens Axboe	e8c2bc1fb6	io_uring: get rid of req->io/io_async_ctx union There's really no point in having this union, it just means that we're always allocating enough room to cater to any command. But that's pointless, as the ->io field is request type private anyway. This gets rid of the io_async_ctx structure, and fills in the required size in the io_op_defs[] instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Pavel Begunkov	4be1c61512	io_uring: kill extra user_bufs check Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary, because in that case ctx->nr_user_bufs would be zero, and the following check would fail. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Pavel Begunkov	ab0b196ce5	io_uring: fix overlapped memcpy in io_req_map_rw() When io_req_map_rw() is called from io_rw_prep_async(), it memcpy() iorw->iter into itself. Even though it doesn't lead to an error, such a memcpy()'s aliasing rules violation is considered to be a bad practise. Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any remapping there, so it's much simpler than the generic implementation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Pavel Begunkov	afb87658f8	io_uring: refactor io_req_map_rw() Set rw->free_iovec to @iovec, that gives an identical result and stresses that @iovec param rw->free_iovec play the same role. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Pavel Begunkov	f4bff104ff	io_uring: simplify io_rw_prep_async() Don't touch iter->iov and iov in between __io_import_iovec() and io_req_map_rw(), the former function aleady sets it correctly, because it creates one more case with NULL'ed iov to consider in io_req_map_rw(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	9055420072	io_uring: provide IORING_ENTER_SQ_WAIT for SQPOLL SQ ring waits When using SQPOLL, applications can run into the issue of running out of SQ ring entries because the thread hasn't consumed them yet. The only option for dealing with that is checking later, or busy checking for the condition. Provide IORING_ENTER_SQ_WAIT if applications want to wait on this condition. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	738277adc8	io_uring: mark io_uring_fops/io_op_defs as __read_mostly These structures are never written, move them appropriately. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	aa06165de8	io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too We support using IORING_SETUP_ATTACH_WQ to share async backends between rings created by the same process, this now also allows the same to happen with SQPOLL. The setup procedure remains the same, the caller sets io_uring_params->wq_fd to the 'parent' context, and then the newly created ring will attach to that async backend. This means that multiple rings can share the same SQPOLL thread, saving resources. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	69fb21310f	io_uring: base SQPOLL handling off io_sq_data Remove the SQPOLL thread from the ctx, and use the io_sq_data as the data structure we pass in. io_sq_data has a list of ctx's that we can then iterate over and handle. As of now we're ready to handle multiple ctx's, though we're still just handling a single one after this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	534ca6d684	io_uring: split SQPOLL data into separate structure Move all the necessary state out of io_ring_ctx, and into a new structure, io_sq_data. The latter now deals with any state or variables associated with the SQPOLL thread itself. In preparation for supporting more than one io_ring_ctx per SQPOLL thread. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	c8d1ba583f	io_uring: split work handling part of SQPOLL into helper This is done in preparation for handling more than one ctx, but it also cleans up the code a bit since io_sq_thread() was a bit too unwieldy to get a get overview on. __io_sq_thread() is now the main handler, and it returns an enum sq_ret that tells io_sq_thread() what it ended up doing. The parent then makes a decision on idle, spinning, or work handling based on that. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	3f0e64d054	io_uring: move SQPOLL post-wakeup ring need wakeup flag into wake handler We need to decouple the clearing on wakeup from the the inline schedule, as that is going to be required for handling multiple rings in one thread. Wrap our wakeup handler so we can clear it when we get the wakeup, by definition that is when we no longer need the flag set. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	6a7793828f	io_uring: use private ctx wait queue entries for SQPOLL This is in preparation to sharing the poller thread between rings. For that we need per-ring wait_queue_entry storage, and we can't easily put that on the stack if one thread is managing multiple rings. We'll also be sharing the wait_queue_head across rings for the purposes of wakeups, provide the usual private ring wait_queue_head for now but make it a pointer so we can easily override it when sharing. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	ce71bfea20	fs: align IOCB_* flags with RWF_* flags We have a set of flags that are shared between the two and inherired in kiocb_set_rw_flags(), but we check and set these individually. Reorder the IOCB flags so that the bottom part of the space is synced with the RWF flag space, and then we can do them all in one mask and set operation. The only exception is RWF_SYNC, which needs to mark IOCB_SYNC and IOCB_DSYNC. Do that one separately. This shaves 15 bytes of text from kiocb_set_rw_flags() for me. Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	e35afcf912	io_uring: io_sq_thread() doesn't need to flush signals We're not handling signals by default in kernel threads, and we never use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never have a signal pending, and we don't need to check for it (nor flush it). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Sebastian Andrzej Siewior	95da846592	io_wq: Make io_wqe::lock a raw_spinlock_t During a context switch the scheduler invokes wq_worker_sleeping() with disabled preemption. Disabling preemption is needed because it protects access to `worker->sleeping'. As an optimisation it avoids invoking schedule() within the schedule path as part of possible wake up (thus preempt_enable_no_resched() afterwards). The io-wq has been added to the mix in the same section with disabled preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping() acquires a spinlock_t. Also within the schedule() the spinlock_t must be acquired after tsk_is_pi_blocked() otherwise it will block on the sleeping lock again while scheduling out. While playing with `io_uring-bench' I didn't notice a significant latency spike after converting io_wqe::lock to a raw_spinlock_t. The latency was more or less the same. In order to keep the spinlock_t it would have to be moved after the tsk_is_pi_blocked() check which would introduce a branch instruction into the hot path. The lock is used to maintain the `work_list' and wakes one task up at most. Should io_wqe_cancel_pending_work() cause latency spikes, while searching for a specific item, then it would need to drop the lock during iterations. revert_creds() is also invoked under the lock. According to debug cred::non_rcu is 0. Otherwise it should be moved outside of the locked section because put_cred_rcu()->free_uid() acquires a sleeping lock. Convert io_wqe::lock to a raw_spinlock_t.c Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Stefano Garzarella	7e84e1c756	io_uring: allow disabling rings during the creation This patch adds a new IORING_SETUP_R_DISABLED flag to start the rings disabled, allowing the user to register restrictions, buffers, files, before to start processing SQEs. When IORING_SETUP_R_DISABLED is set, SQE are not processed and SQPOLL kthread is not started. The restrictions registration are allowed only when the rings are disable to prevent concurrency issue while processing SQEs. The rings can be enabled using IORING_REGISTER_ENABLE_RINGS opcode with io_uring_register(2). Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Stefano Garzarella	21b55dbc06	io_uring: add IOURING_REGISTER_RESTRICTIONS opcode The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently installs a feature allowlist on an io_ring_ctx. The io_ring_ctx can then be passed to untrusted code with the knowledge that only operations present in the allowlist can be executed. The allowlist approach ensures that new features added to io_uring do not accidentally become available when an existing application is launched on a newer kernel version. Currently is it possible to restrict sqe opcodes, sqe flags, and register opcodes. IOURING_REGISTER_RESTRICTIONS can only be made once. Afterwards it is not possible to change restrictions anymore. This prevents untrusted code from removing restrictions. Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Stefano Garzarella	9d4a75efa2	io_uring: use an enumeration for io_uring_register(2) opcodes The enumeration allows us to keep track of the last io_uring_register(2) opcode available. Behaviour and opcodes names don't change. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	a3ec600540	io_uring: move io_uring_get_socket() into io_uring.h Now we have a io_uring kernel header, move this definition out of fs.h and into io_uring.h where it belongs. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	9b82849215	io_uring: reference ->nsproxy for file table commands If we don't get and assign the namespace for the async work, then certain paths just don't work properly (like /dev/stdin, /proc/mounts, etc). Anything that references the current namespace of the given task should be assigned for async work on behalf of that task. Cc: stable@vger.kernel.org # v5.5+ Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	0f2122045b	io_uring: don't rely on weak ->files references Grab actual references to the files_struct. To avoid circular references issues due to this, we add a per-task note that keeps track of what io_uring contexts a task has used. When the tasks execs or exits its assigned files, we cancel requests based on this tracking. With that, we can grab proper references to the files table, and no longer need to rely on stashing away ring_fd and ring_file to check if the ring_fd may have been closed. Cc: stable@vger.kernel.org # v5.5+ Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	e6c8aa9ac3	io_uring: enable task/files specific overflow flushing This allows us to selectively flush out pending overflows, depending on the task and/or files_struct being passed in. No intended functional changes in this patch. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	76e1b6427f	io_uring: return cancelation status from poll/timeout/files handlers Return whether we found and canceled requests or not. This is in preparation for using this information, no functional changes in this patch. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	e3bc8e9dad	io_uring: unconditionally grab req->task Sometimes we assign a weak reference to it, sometimes we grab a reference to it. Clean this up and make it unconditional, and drop the flag related to tracking this state. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	2aede0e417	io_uring: stash ctx task reference for SQPOLL We can grab a reference to the task instead of stashing away the task files_struct. This is doable without creating a circular reference between the ring fd and the task itself. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	f573d38445	io_uring: move dropping of files into separate helper No functional changes in this patch, prep patch for grabbing references to the files_struct. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	f3606e3a92	io_uring: allow timeout/poll/files killing to take task into account We currently cancel these when the ring exits, and we cancel all of them. This is in preparation for killing only the ones associated with a given task. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	0f07889691	Merge branch 'io_uring-5.9' into for-5.10/io_uring * io_uring-5.9: io_uring: fix async buffered reads when readahead is disabled io_uring: fix potential ABBA deadlock in ->show_fdinfo() io_uring: always delete double poll wait entry on match	2020-09-30 20:32:25 -06:00
Jean-Philippe Brucker	3effc06a4d	selftests/bpf: Fix alignment of .BTF_ids Fix a build failure on arm64, due to missing alignment information for the .BTF_ids section: resolve_btfids.test.o: in function `test_resolve_btfids': tools/testing/selftests/bpf/prog_tests/resolve_btfids.c:140:(.text+0x29c): relocation truncated to fit: R_AARCH64_LDST32_ABS_LO12_NC against `.BTF_ids' ld: tools/testing/selftests/bpf/prog_tests/resolve_btfids.c:140: warning: one possible cause of this error is that the symbol is being referenced in the indicated code as if it had a larger alignment than was declared where it was defined In vmlinux, the .BTF_ids section is aligned to 4 bytes by vmlinux.lds.h. In test_progs however, .BTF_ids doesn't have alignment constraints. The arm64 linker expects the btf_id_set.cnt symbol, a u32, to be naturally aligned but finds it misaligned and cannot apply the relocation. Enforce alignment of .BTF_ids to 4 bytes. Fixes: `cd04b04de1` ("selftests/bpf: Add set test to resolve_btfids") Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Link: https://lore.kernel.org/bpf/20200930093559.2120126-1-jean-philippe@linaro.org	2020-09-30 18:11:40 -07:00
David S. Miller	f2e834694b	Merge branch 'drop_monitor-Convert-to-use-devlink-tracepoint' Ido Schimmel says: ==================== drop_monitor: Convert to use devlink tracepoint Drop monitor is able to monitor both software and hardware originated drops. Software drops are monitored by having drop monitor register its probe on the 'kfree_skb' tracepoint. Hardware originated drops are monitored by having devlink call into drop monitor whenever it receives a dropped packet from the underlying hardware. This patch set converts drop monitor to monitor both software and hardware originated drops in the same way - by registering its probe on the relevant tracepoint. In addition to drop monitor being more consistent, it is now also possible to build drop monitor as module instead of as a builtin and still monitor hardware originated drops. Initially, CONFIG_NET_DEVLINK implied CONFIG_NET_DROP_MONITOR, but after commit `def2fbffe6` ("kconfig: allow symbols implied by y to become m") we can have CONFIG_NET_DEVLINK=y and CONFIG_NET_DROP_MONITOR=m and hardware originated drops will not be monitored. Patch set overview: Patch #1 adds a tracepoint in devlink for trap reports. Patch #2 prepares probe functions in drop monitor for the new tracepoint. Patch #3 converts drop monitor to use the new tracepoint. Patches #4-#6 perform cleanups after the conversion. Patch #7 adds a test case for drop monitor. Both software originated drops and hardware originated drops (using netdevsim) are tested. Tested: \| CONFIG_NET_DEVLINK \| CONFIG_NET_DROP_MONITOR \| Build \| SW drops \| HW drops \| \| -------------------\|-------------------------\|-------\|----------\|----------\| \| y \| y \| v \| v \| v \| \| y \| m \| v \| v \| v \| \| y \| n \| v \| x \| x \| \| n \| y \| v \| v \| x \| \| n \| m \| v \| v \| x \| \| n \| n \| v \| x \| x \| ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-30 18:01:27 -07:00
Ido Schimmel	b7cc6d3c5c	selftests: net: Add drop monitor test Test that drop monitor correctly captures both software and hardware originated packet drops. # ./drop_monitor_tests.sh Software drops test TEST: Capturing active software drops [ OK ] TEST: Capturing inactive software drops [ OK ] Hardware drops test TEST: Capturing active hardware drops [ OK ] TEST: Capturing inactive hardware drops [ OK ] Tests passed: 4 Tests failed: 0 Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-30 18:01:26 -07:00
Ido Schimmel	93e155967c	drop_monitor: Filter control packets in drop monitor Previously, devlink called into drop monitor in order to report hardware originated drops / exceptions. devlink intentionally filtered control packets and did not pass them to drop monitor as they were not dropped by the underlying hardware. Now drop monitor registers its probe on a generic 'devlink_trap_report' tracepoint and should therefore perform this filtering itself instead of having devlink do that. Add the trap type as metadata and have drop monitor ignore control packets. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-30 18:01:26 -07:00
Ido Schimmel	a848c05f4b	drop_monitor: Remove duplicate struct 'struct net_dm_hw_metadata' is a duplicate of 'struct devlink_trap_metadata'. Remove the former and simplify the code. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-30 18:01:26 -07:00
Ido Schimmel	de9cbb81bd	drop_monitor: Remove no longer used functions The old probe functions that were invoked by drop monitor code are no longer called and can thus be removed. They were replaced by actual probe functions that are registered on the recently introduced 'devlink_trap_report' tracepoint. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-30 18:01:26 -07:00

... 83 84 85 86 87 ...

966273 Commits