Commit Graph

330 Commits

Author SHA1 Message Date
Linus Torvalds
3db61221f4 io_uring-6.0-2022-09-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMuVUEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvvDD/41UF2c5fTS7T4D4aRK4cR9UJGYe32obxee
 Li3tVvHHFfV32WfNhsGR8P6bEwNVJhngHO65eTKMHmydQ4WhUAzW4U77/dcCITRy
 c1jn9kuFTQkWJslkRs4laXS7siTSr42R650dwJZCHQk2wW96bY/QKgmoWLWFBkBw
 g7OY4O4mv/yebJq7mtBjsbOS/3zaOmf7CZIykBPlaaL6Y+7aKZ+NSjx2VUz/ZKHy
 od2wbHw6Bo7E5eNtZIM6brM4rJTATObcWR3bOoHOgxf/9Ah4pj+Tkfdm1pO1uAg6
 BEFOinAJ5naCB2u3Ycya4SvHgC6RgPGbb/u1H6AfGVPSUuMn6ggnoUGM8bg2UCKn
 U0Rr9C0g84tDA565JSQzHShA8YUwAb5X6OuG+cU04a1Nx3udRQOdGLdWCPa/gJUJ
 asBpqv8tOrpaS4ANMG+SmAWj7OIaDvEXUr2Gs3acz1FSWYPyXfVR0sUUsHuXToHX
 5JPfq5NQZBgoXKUXeYb3+FRgkp9DF+EkmrU1uGtU3718B8QAcpa10Vp+H+Vu2pnv
 59A/DjN9QjEH606KU94o1zkaC47/llkIm22BunrrQFcDKtcQUzO/Q9HO3ILm7nqH
 6oUIlCAU81rN55gH+M3pMMtm/dSroeCLxQU2tu4XaBYCsb+QfI1LXCDT/CO0Nb3I
 TzyXboIrvg==
 =F9sU
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.0-2022-09-23' of git://git.kernel.dk/linux

Pull io_uring fix from Jens Axboe:
 "Just a single fix for an issue with un-reaped IOPOLL requests on ring
  exit"

* tag 'io_uring-6.0-2022-09-23' of git://git.kernel.dk/linux:
  io_uring: ensure that cached task references are always put on exit
2022-09-24 08:27:08 -07:00
Jens Axboe
e775f93f2a io_uring: ensure that cached task references are always put on exit
io_uring caches task references to avoid doing atomics for each of them
per request. If a request is put from the same task that allocated it,
then we can maintain a per-ctx cache of them. This obviously relies
on io_uring always pruning caches in a reliable way, and there's
currently a case off io_uring fd release where we can miss that.

One example is a ring setup with IOPOLL, which relies on the task
polling for completions, which will free them. However, if such a task
submits a request and then exits or closes the ring without reaping
the completion, then ring release will reap and put. If release happens
from that very same task, the completed request task refs will get
put back into the cache pool. This is problematic, as we're now beyond
the point of pruning caches.

Manually drop these caches after doing an IOPOLL reap. This releases
references from the current task, which is enough. If another task
happens to be doing the release, then the caching will not be
triggered and there's no issue.

Cc: stable@vger.kernel.org
Fixes: e98e49b2bb ("io_uring: extend task put optimisations")
Reported-by: Homin Rhee <hominlab@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23 18:51:08 -06:00
Pavel Begunkov
aa1df3a360 io_uring: fix CQE reordering
Overflowing CQEs may result in reordering, which is buggy in case of
links, F_MORE and so on. If we guarantee that we don't reorder for
the unlikely event of a CQ ring overflow, then we can further extend
this to not have to terminate multishot requests if it happens. For
other operations, like zerocopy sends, we have no choice but to honor
CQE ordering.

Reported-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ec3bc55687b0768bbe20fb62d7d06cfced7d7e70.1663892031.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23 15:04:20 -06:00
Pavel Begunkov
a75155faef io_uring/net: fix UAF in io_sendrecv_fail()
We should not assume anything about ->free_iov just from
REQ_F_ASYNC_DATA but rather rely on REQ_F_NEED_CLEANUP, as we may
allocate ->async_data but failed init would leave the field in not
consistent state. The easiest solution is to remove removing
REQ_F_NEED_CLEANUP and so ->async_data dealloc from io_sendrecv_fail()
and let io_send_zc_cleanup() do the job. The catch here is that we also
need to prevent double notif flushing, just test it for NULL and zero
where it's needed.

BUG: KASAN: use-after-free in io_sendrecv_fail+0x3b0/0x3e0 io_uring/net.c:1221
Write of size 8 at addr ffff8880771b4080 by task syz-executor.3/30199

CPU: 1 PID: 30199 Comm: syz-executor.3 Not tainted 6.0.0-rc6-next-20220923-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 print_address_description mm/kasan/report.c:284 [inline]
 print_report+0x15e/0x45d mm/kasan/report.c:395
 kasan_report+0xbb/0x1f0 mm/kasan/report.c:495
 io_sendrecv_fail+0x3b0/0x3e0 io_uring/net.c:1221
 io_req_complete_failed+0x155/0x1b0 io_uring/io_uring.c:873
 io_drain_req io_uring/io_uring.c:1648 [inline]
 io_queue_sqe_fallback.cold+0x29f/0x788 io_uring/io_uring.c:1931
 io_submit_sqe io_uring/io_uring.c:2160 [inline]
 io_submit_sqes+0x1180/0x1df0 io_uring/io_uring.c:2276
 __do_sys_io_uring_enter+0xac6/0x2410 io_uring/io_uring.c:3216
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: c4c0009e0b ("io_uring/net: combine fail handlers")
Reported-by: syzbot+4c597a574a3f5a251bda@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/23ab8346e407ea50b1198a172c8a97e1cf22915b.1663945875.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23 14:57:38 -06:00
Jens Axboe
ec7fd2562f io_uring: ensure local task_work marks task as running
io_uring will run task_work from contexts that have been prepared for
waiting, and in doing so it'll implicitly set the task running again
to avoid issues with blocking conditions. The new deferred local
task_work doesn't do that, which can result in spews on this being
an invalid condition:



[  112.917576] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000ad64af64>] prepare_to_wait_exclusive+0x3f/0xd0
[  112.983088] WARNING: CPU: 1 PID: 190 at kernel/sched/core.c:9819 __might_sleep+0x5a/0x60
[  112.987240] Modules linked in:
[  112.990504] CPU: 1 PID: 190 Comm: io_uring Not tainted 6.0.0-rc6+ #1617
[  113.053136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[  113.133650] RIP: 0010:__might_sleep+0x5a/0x60
[  113.136507] Code: ee 48 89 df 5b 31 d2 5d e9 33 ff ff ff 48 8b 90 30 0b 00 00 48 c7 c7 90 de 45 82 c6 05 20 8b 79 01 01 48 89 d1 e8 3a 49 77 00 <0f> 0b eb d1 66 90 0f 1f 44 00 00 9c 58 f6 c4 02 74 35 65 8b 05 ed
[  113.223940] RSP: 0018:ffffc90000537ca0 EFLAGS: 00010286
[  113.232903] RAX: 0000000000000000 RBX: ffffffff8246782c RCX: ffffffff8270bcc8
IOPS=133.15K, BW=520MiB/s, IOS/call=32/31
[  113.353457] RDX: ffffc90000537b50 RSI: 00000000ffffdfff RDI: 0000000000000001
[  113.358970] RBP: 00000000000003bc R08: 0000000000000000 R09: c0000000ffffdfff
[  113.361746] R10: 0000000000000001 R11: ffffc90000537b48 R12: ffff888103f97280
[  113.424038] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
[  113.428009] FS:  00007f67ae7fc700(0000) GS:ffff88842fc80000(0000) knlGS:0000000000000000
[  113.432794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  113.503186] CR2: 00007f67b8b9b3b0 CR3: 0000000102b9b005 CR4: 0000000000770ee0
[  113.507291] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  113.512669] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  113.574374] PKRU: 55555554
[  113.576800] Call Trace:
[  113.578325]  <TASK>
[  113.579799]  set_page_dirty_lock+0x1b/0x90
[  113.582411]  __bio_release_pages+0x141/0x160
[  113.673078]  ? set_next_entity+0xd7/0x190
[  113.675632]  blk_rq_unmap_user+0xaa/0x210
[  113.678398]  ? timerqueue_del+0x2a/0x40
[  113.679578]  nvme_uring_task_cb+0x94/0xb0
[  113.683025]  __io_run_local_work+0x8a/0x150
[  113.743724]  ? io_cqring_wait+0x33d/0x500
[  113.746091]  io_run_local_work.part.76+0x2e/0x60
[  113.750091]  io_cqring_wait+0x2e7/0x500
[  113.752395]  ? trace_event_raw_event_io_uring_req_failed+0x180/0x180
[  113.823533]  __x64_sys_io_uring_enter+0x131/0x3c0
[  113.827382]  ? switch_fpu_return+0x49/0xc0
[  113.830753]  do_syscall_64+0x34/0x80
[  113.832620]  entry_SYSCALL_64_after_hwframe+0x5e/0xc8

Ensure that we mark current as TASK_RUNNING for deferred task_work
as well.

Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Reported-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 19:39:35 -06:00
Pavel Begunkov
493108d95f io_uring/net: zerocopy sendmsg
Add a zerocopy version of sendmsg.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6aabc4bdfc0ec78df6ec9328137e394af9d4e7ef.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov
c4c0009e0b io_uring/net: combine fail handlers
Merge io_send_zc_fail() into io_sendrecv_fail(), saves a few lines of
code and some headache for following patch.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e0eba1d577413aef5602cd45f588b9230207082d.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov
b0e9b5517e io_uring/net: rename io_sendzc()
Simple renaming of io_sendzc*() functions in preparatio to adding
a zerocopy sendmsg variant.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/265af46829e6076dd220011b1858dc3151969226.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov
516e82f0e0 io_uring/net: support non-zerocopy sendto
We have normal sends, but what is missing is sendto-like requests. Add
sendto() capabilities to IORING_OP_SEND by passing in addr just as we do
for IORING_OP_SEND_ZC.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/69fbd8b2cb830e57d1bf9ec351e9bf95c5b77e3f.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov
6ae61b7aa2 io_uring/net: refactor io_setup_async_addr
Instead of passing the right address into io_setup_async_addr() only
specify local on-stack storage and let the function infer where to grab
it from. It optimises out one local variable we have to deal with.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6bfa9ab810d776853eb26ed59301e2536c3a5471.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov
5693bcce89 io_uring/net: don't lose partial send_zc on fail
Partial zc send may end up in io_req_complete_failed(), which not only
would return invalid result but also mask out the notification leading
to lifetime issues.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5673285b5e83e6ceca323727b4ddaa584b5cc91e.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov
7e6b638ed5 io_uring/net: don't lose partial send/recv on fail
Just as with rw, partial send/recv may end up in
io_req_complete_failed() and loose the result, make sure we return the
number of bytes processed.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a4ff95897b5419356fca9ea55db91ac15b2975f9.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov
47b4c68660 io_uring/rw: don't lose partial IO result on fail
A partially done read/write may end up in io_req_complete_failed() and
loose the result, make sure we return the number of bytes processed.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/05e0879c226bcd53b441bf92868eadd4bf04e2fc.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov
a47b255e90 io_uring: add custom opcode hooks on fail
Sometimes we have to do a little bit of a fixup on a request failuer in
io_req_complete_failed(). Add a callback in opdef for that.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b734cff4e67cb30cca976b9face321023f37549a.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Jens Axboe
3b8fdd1dc3 io_uring/fdinfo: fix sqe dumping for IORING_SETUP_SQE128
If we have doubly sized SQEs, then we need to shift the sq index by 1
to account for using two entries for a single request. The CQE dumping
gets this right, but the SQE one does not.

Improve the SQE dumping in general, the information dumped is pretty
sparse and doesn't even cover the whole basic part of the SQE. Include
information on the extended part of the SQE, if doubly sized SQEs are
in use. A typical dump now looks like the following:

[...]
SQEs:	32
   32: opcode:URING_CMD, fd:0, flags:1, off:3225964160, addr:0x0, rw_flags:0x0, buf_index:0 user_data:2721, e0:0x0, e1:0xffffb8041000, e2:0x100000000000, e3:0x5500, e4:0x7, e5:0x0, e6:0x0, e7:0x0
   33: opcode:URING_CMD, fd:0, flags:1, off:3225964160, addr:0x0, rw_flags:0x0, buf_index:0 user_data:2722, e0:0x0, e1:0xffffb8043000, e2:0x100000000000, e3:0x5508, e4:0x7, e5:0x0, e6:0x0, e7:0x0
   34: opcode:URING_CMD, fd:0, flags:1, off:3225964160, addr:0x0, rw_flags:0x0, buf_index:0 user_data:2723, e0:0x0, e1:0xffffb8045000, e2:0x100000000000, e3:0x5510, e4:0x7, e5:0x0, e6:0x0, e7:0x0
[...]

Fixes: ebdeb7c01d ("io_uring: add support for 128-byte SQEs")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Jens Axboe
4f731705cc io_uring/fdinfo: get rid of unnecessary is_cqe32 variable
We already have the cq_shift, just use that to tell if we have doubly
sized CQEs or not.

While in there, cleanup the CQE32 vs normal CQE size printing.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov
c0dc995eb2 io_uring: remove unused return from io_disarm_next
We removed conditional io_commit_cqring_flush() guarding against
spurious eventfd and the io_disarm_next()'s return value is not used
anymore, just void it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9a441c9a32a58bcc586076fa9a7d0dc33f1fb3cb.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:01 -06:00
Pavel Begunkov
7924fdfeea io_uring: add fast path for io_run_local_work()
We'll grab uring_lock and call __io_run_local_work() with several
atomics inside even if there are no task works. Skip it if ->work_llist
is empty.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/f6a885f372bad2d77d9cd87341b0a86a4000c0ff.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:01 -06:00
Pavel Begunkov
1f8d5bbe98 io_uring/iopoll: unify tw breaking logic
Let's keep checks for whether to break the iopoll loop or not same for
normal and defer tw, this includes ->cached_cq_tail checks guarding
against polling more than asked for.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d2fa8a44f8114f55a4807528da438cde93815360.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:01 -06:00
Pavel Begunkov
9d54bd6a3b io_uring/iopoll: fix unexpected returns
We may propagate a positive return value of io_run_task_work() out of
io_iopoll_check(), which breaks our tests. io_run_task_work() doesn't
return anything useful for us, ignore the return value.

Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/c442bb87f79cea10b3f857cbd4b9a4f0a0493fa3.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:14:59 -06:00
Pavel Begunkov
6567506b68 io_uring: disallow defer-tw run w/ no submitters
We try to restrict CQ waiters when IORING_SETUP_DEFER_TASKRUN is set,
but if nothing has been submitted yet it'll allow any waiter, which
violates the contract.

Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/b4f0d3f14236d7059d08c5abe2661ef0b78b5528.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:14:55 -06:00
Pavel Begunkov
76de6749d1 io_uring: further limit non-owner defer-tw cq waiting
In case of DEFER_TASK_WORK we try to restrict waiters to only one task,
which is also the only submitter; however, we don't do it reliably,
which might be very confusing and backfire in the future. E.g. we
currently allow multiple tasks in io_iopoll_check().

Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/94c83c0a7fe468260ee2ec31bdb0095d6e874ba2.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:14:46 -06:00
Pavel Begunkov
ac9e5784bb io_uring/net: use io_sr_msg for sendzc
Reuse struct io_sr_msg for zerocopy sends, which is handy. There is
only one zerocopy specific field, namely .notif, and we have enough
space for it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/408c5b1b2d8869e1a12da5f5a78ed72cac112149.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Pavel Begunkov
0b048557db io_uring/net: refactor io_sr_msg types
In preparation for using struct io_sr_msg for zerocopy sends, clean up
types. First, flags can be u16 as it's provided by the userspace in u16
ioprio, as well as addr_len. This saves us 4 bytes. Also use unsigned
for size and done_io, both are as well limited to u32.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/42c2639d6385b8b2181342d2af3a42d3b1c5bcd2.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Pavel Begunkov
cd9021e88f io_uring/net: add non-bvec sg chunking callback
Add a sg_from_iter() for when we initiate non-bvec zerocopy sends, which
helps us to remove some extra steps from io_sg_from_iter(). The only
thing the new function has to do before giving control away to
__zerocopy_sg_from_iter() is to check if the skb has managed frags and
downgrade them if so.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cda3dea0d36f7931f63a70f350130f085ac3f3dd.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Pavel Begunkov
6bf8ad25fc io_uring/net: io_async_msghdr caches for sendzc
We already keep io_async_msghdr caches for normal send/recv requests,
use them also for zerocopy send.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/42fa615b6e0be25f47a685c35d7b5e4f1b03d348.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Pavel Begunkov
858c293e5d io_uring/net: use async caches for async prep
send/recv have async_data caches but there are only used from within
issue handlers. Extend their use also to ->prep_async, should be handy
with links and IOSQE_ASYNC.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b9a2264b807582a97ed606c5bfcdc2399384e8a5.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Pavel Begunkov
95eafc74be io_uring/net: reshuffle error handling
We should prioritise send/recv retry cases over failures, they're more
important. Shuffle -ERESTARTSYS after we handled retries.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d9059691b30d0963b7269fa4a0c81ee7720555e6.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Pavel Begunkov
e9a8842854 io_uring: use io_cq_lock consistently
There is one place when we forgot to change hand coded spin locking with
io_cq_lock(), change it to be more consistent. Note, the unlock part is
already __io_cq_unlock_post().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/91699b9a00a07128f7ca66136bdbbfc67a64659e.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Pavel Begunkov
385c609f9b io_uring: kill an outdated comment
Request referencing has changed a while ago and there is no notion left
of submission/completion references, kill an outdated comment.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/38902e7229d68cecd62702436d627d4858b0d9d4.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Dylan Yudaken
4ab9d46507 io_uring: allow buffer recycling in READV
In commit 934447a603 ("io_uring: do not recycle buffer in READV") a
temporary fix was put in io_kbuf_recycle to simply never recycle READV
buffers.

Instead of that, rather treat READV with REQ_F_BUFFER_SELECTED the same as
a READ with REQ_F_BUFFER_SELECTED. Since READV requires iov_len of 1 they
are essentially the same.
In order to do this inside io_prep_rw() add some validation to check that
it is in fact only length 1, and also extract the length of the buffer at
prep time.

This allows removal of the io_iov_buffer_select codepaths as they are only
used from the READV op.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220907165152.994979-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Jens Axboe
de97fcb303 fs: add batch and poll flags to the uring_cmd_iopoll() handler
We need the poll_flags to know how to poll for the IO, and we should
have the batch structure in preparation for supporting batched
completions with iopoll.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Jens Axboe
dac6a0eae7 io_uring: ensure iopoll runs local task work as well
Combine the two checks we have for task_work running and whether or not
we need to shuffle the mutex into one, so we unify how task_work is run
in the iopoll loop. This helps ensure that local task_work is run when
needed, and also optimizes that path to avoid a mutex shuffle if it's
not needed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Jens Axboe
8ac5d85a89 io_uring: add local task_work run helper that is entered locked
We have a few spots that drop the mutex just to run local task_work,
which immediately tries to grab it again. Add a helper that just passes
in whether we're locked already.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Jens Axboe
a1119fb071 io_uring: cleanly separate request types for iopoll
After the addition of iopoll support for passthrough, there's a bit of
a mixup here. Clean it up and get rid of the casting for the passthrough
command type.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:42 -06:00
Kanchan Joshi
5756a3a7e7 io_uring: add iopoll infrastructure for io_uring_cmd
Put this up in the same way as iopoll is done for regular read/write IO.
Make place for storing a cookie into struct io_uring_cmd on submission.
Perform the completion using the ->uring_cmd_iopoll handler.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-3-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:42 -06:00
Dylan Yudaken
f75d5036d0 io_uring: trace local task work run
Add tracing for io_run_local_task_work

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-8-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:42 -06:00
Dylan Yudaken
21a091b970 io_uring: signal registered eventfd to process deferred task work
Some workloads rely on a registered eventfd (via
io_uring_register_eventfd(3)) in order to wake up and process the
io_uring.

In the case of a ring setup with IORING_SETUP_DEFER_TASKRUN, that eventfd
also needs to be signalled when there are tasks to run.

This changes an old behaviour which assumed 1 eventfd signal implied at
least 1 CQE, however only when this new flag is set (and so old users will
not notice). This should be expected with the IORING_SETUP_DEFER_TASKRUN
flag as it is not guaranteed that every task will result in a CQE.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-7-dylany@fb.com
[axboe: fold in call_rcu() serialization fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:42 -06:00
Dylan Yudaken
d8e9214f11 io_uring: move io_eventfd_put
Non functional change: move this function above io_eventfd_signal so it
can be used from there

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-6-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:42 -06:00
Dylan Yudaken
c0e0d6ba25 io_uring: add IORING_SETUP_DEFER_TASKRUN
Allow deferring async tasks until the user calls io_uring_enter(2) with
the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
io_uring_setup time. This functionality requires that the later
io_uring_enter will be called from the same submission task, and therefore
restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also
set.

Being able to hand pick when tasks are run prevents the problem where
there is current work to be done, however task work runs anyway.

For example, a common workload would obtain a batch of CQEs, and process
each one. Interrupting this to additional taskwork would add latency but
not gain anything. If instead task work is deferred to just before more
CQEs are obtained then no additional latency is added.

The way this is implemented is by trying to keep task work local to a
io_ring_ctx, rather than to the submission task. This is required, as the
application will want to wake up only a single io_ring_ctx at a time to
process work, and so the lists of work have to be kept separate.

This has some other benefits like not having to check the task continually
in handle_tw_list (and potentially unlocking/locking those), and reducing
locks in the submit & process completions path.

There are networking cases where using this option can reduce request
latency by 50%. For example a contrived example using [1] where the client
sends 2k data and receives the same data back while doing some system
calls (to trigger task work) shows this reduction. The reason ends up
being that if sending responses is delayed by processing task work, then
the client side sits idle. Whereas reordering the sends first means that
the client runs it's workload in parallel with the local task work.

[1]:
Using https://github.com/DylanZA/netbench/tree/defer_run
Client:
./netbench  --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000"
Server:
./netbench  --server_only 1 --control_port 10000  --rx "io_uring --defer_taskrun 0 --workload 100"   --rx "io_uring  --defer_taskrun 1 --workload 100"

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-5-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:42 -06:00
Dylan Yudaken
2327337b88 io_uring: do not run task work at the start of io_uring_enter
This is not needed, and it is normally better to wait for task work until
after submissions. This will allow greater batching if either work arrives
in the meanwhile, or if the submissions cause task work to be queued up.

For SQPOLL this also no longer runs task work, but this is handled inside
the SQPOLL loop anyway.

For IOPOLL io_iopoll_check will run task work anyway

And otherwise io_cqring_wait will run task work

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-4-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:42 -06:00
Dylan Yudaken
b4c98d59a7 io_uring: introduce io_has_work
This will be used later to know if the ring has outstanding work. Right
now just if there is overflow CQEs to copy to the main CQE ring, but later
will include deferred tasks

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:42 -06:00
Dylan Yudaken
32d91f0590 io_uring: remove unnecessary variable
'running' is set once and read once, so can easily just remove it

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-2-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:42 -06:00
Linus Torvalds
38eddeedbb io_uring-6.0-2022-09-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMnFlcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphm1D/0ZXgihejm59WTef8UktYzXT1B0SbN9TT1r
 CQm/5BVSTWkz5UOmpPxtiL2wT0Lj+D1i4xtKEvPS3L9nwWHgz5dM6AmdIk9jXKUz
 09Y8XnZqtjr228mxRxZ33x3YaUaJv3b/AAgdL12rzN/9Crr4V1z+vAFuW1LQpFhN
 DxXSMi+tQzyNBjD503h/buQ4eOpdkKOW/EpjqePHsz+OqSpjgoy+ddTVS7jhakun
 9B6BrDUVEMwyCzT///1Zi+TjkdiZOub26CSn38TXaQAWBkGDRo3B1Jq6D9MH8VK5
 MlHWgrkz6OSqoJw79bvLKjWR/WNA8EM4e5Myd1QGsesMa7BRPBCp/V0ooVtHeHtb
 lrN8CmGFXxt5uKRxzP0F6IxrRxo9hYxTTbH+Qy5K7c9JNNeyl6bxSP4DXtTNzLfy
 Apl343BiZFqdbFHlR6CCFcx+4YESr9UhSF5h3MFgX5TZQWwqNH/GDBYZtZ/qjg2W
 YNznGYx/xBphCeC08/LgHTdy+EhGy9WjLBP/KAzVs6rRwpiPLpn/PBAKrNHqskIa
 T6QmcTmSgfzKJtKg8ZQwkzp8QELwudNfYOyasSeHD0nY855j9zvnfnKdPHhzkx33
 Gt4goE94xas968SoQuQVF966L72JeZoAx48gMk+WTyP/3nMbwEDwtYX3cdOCte8z
 m8s04p1SQg==
 =02l7
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.0-2022-09-18' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:
 "Nothing really major here, but figured it'd be nicer to just get these
  flushed out for -rc6 so that the 6.1 branch will have them as well.
  That'll make our lives easier going forward in terms of development,
  and avoid trivial conflicts in this area.

   - Simple trace rename so that the returned opcode name is consistent
     with the enum definition (Stefan)

   - Send zc rsrc request vs notification lifetime fix (Pavel)"

* tag 'io_uring-6.0-2022-09-18' of git://git.kernel.dk/linux:
  io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC
  io_uring/net: fix zc fixed buf lifetime
2022-09-18 09:25:27 -07:00
Stefan Metzmacher
9bd3f72822 io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC
It's confusing to see the string SENDZC_NOTIF in ftrace output
when using IORING_OP_SEND_ZC.

Fixes: b48c312be0 ("io_uring/net: simplify zerocopy send user API")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8e5cd8616919c92b6c3c7b6ea419fdffd5b97f3c.1663363798.git.metze@samba.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-18 06:59:13 -06:00
Pavel Begunkov
e3366e0234 io_uring/net: fix zc fixed buf lifetime
Notifications usually outlive requests, so we need to pin buffers with
it by assigning a rsrc to it instead of the request.

Fixed: b48c312be0 ("io_uring/net: simplify zerocopy send user API")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dd6406ff8a90887f2b36ed6205dac9fda17c1f35.1663366886.git.asml.silence@gmail.com
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-18 05:07:51 -06:00
Linus Torvalds
0158137d81 io_uring-6.0-2022-09-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMkPA4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjojD/9wzCmmIa1KF/oKJ+ee7UG7fMykuuBc2S0k
 sdPGILmBy8K57hQZ1emyyfJqxMrhsvV0KHVCMsvV4xvE2lMbxydga/3EqrjiU8ad
 YnZdAeAyFEdAqiM2SL8bQSdiTMVUCN7cddGjZL4kY7CzWPP115phNt0XneFGr1Cd
 exsTRNjfPABUlGGm2q7G/ii3QCg2nbnl9wn5kbwfsHx7yKedkiO4BJ5Yl8ynL1i3
 jzLG/pcSxia9Bp8ZSLF+THFNqgYOJPvEbgygcn8tUt+F9DNE9lsEQ52/rp5TVEQz
 mcYUXzhaEbgblDaZGeTY0WI2Pa9M9f4AOlIhJ5du6rX9z3u2nIrumE4Y062VmWJP
 9Cr47Nf/bAmzvbKrZ2ZRWYaA4dMufqLvUwrFz60BxLMYX4Z6ZWkyg7altzjTbqlf
 zrODI+fDY77NNhMPFoPknUl3RpKUYSzL6N4Qod4qL3xj3PW02HNZn8GqIMnDNeT8
 5jWA1Arvqf420QOOvxiumIHiF9EgGWHoRIzg4UXnNiuRh08rJxgVVSAy0GuDA3L0
 n296x55DCGwprJtT91oB7Vbv/qg6Twetcqzw0VLASsxnUKZkcTI4R/MTFdQe4tif
 qja3A5Okq8tefqSTk90zguFVnEnHH0pWyQQsEMwUPLE4lqrur/viKgVxZQdyQ+ak
 D+kSGs3aNg==
 =BSte
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.0-2022-09-16' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Two small patches:

   - Fix using an unsigned type for the return value, introduced in this
     release (Pavel)

   - Stable fix for a missing check for a fixed file on put (me)"

* tag 'io_uring-6.0-2022-09-16' of git://git.kernel.dk/linux-block:
  io_uring/msg_ring: check file type before putting
  io_uring/rw: fix error'ed retry return values
2022-09-16 06:50:25 -07:00
Jens Axboe
fc7222c3a9 io_uring/msg_ring: check file type before putting
If we're invoked with a fixed file, follow the normal rules of not
calling io_fput_file(). Fixed files are permanently registered to the
ring, and do not need putting separately.

Cc: stable@vger.kernel.org
Fixes: aa184e8671 ("io_uring: don't attempt to IOPOLL for MSG_RING requests")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-15 11:44:35 -06:00
Pavel Begunkov
62bb0647b1 io_uring/rw: fix error'ed retry return values
Kernel test robot reports that we test negativity of an unsigned in
io_fixup_rw_res() after a recent change, which masks error codes and
messes up the return value in case I/O is re-retried and failed with
an error.

Fixes: 4d9cb92ca4 ("io_uring/rw: fix short rw error handling")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9754a0970af1861e7865f9014f735c70dc60bf79.1663071587.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-13 07:47:11 -06:00
Linus Torvalds
d2b768c3d4 io_uring-6.0-2022-09-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMbdjoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnupD/9r6uN+ZnxYs3lYHZxi5NNFJ+CHzFFfxOVK
 fTs/JQwJbKey05Rk8F77jt9np5RBoTOsrUXV3Z2B48Mr+8jp3g2tMqb/PbULL+pa
 kt5RZ1rxmD1IWARbd6nWweJh1CJbqB5Z/wijLnU4ZBmjtN76RLlJRZfEd6FTRkwY
 DPFOdmmKks2io+QvDABundj80z2Hrk15KH7dN5yt+4DPNofhMUk5XOQD/CM7tnWa
 KKVvH4MjewrXzdS6OgbBv2Z6L+nJcQniOjAJdKSC3X4DEsRlN8EkF7tPhcVksss9
 kmpilwfh0Um5ZyrxN8igzRqZ78l+lqkjG285rN8+MKsJar/X1KFfsZlG3ErfOe9T
 wEc9KHiPrGfDUJoCt6WzPnYQcTEmzqu8x4ip62UjaJ+c4wL7TIWTmcNRpxzqTto5
 2iouoBzC6nh/xrYPFSwRopglfXZZhJhM8X+yYpKUQHJ+uS/JGcsCJSihM3qRzJWy
 47IQ61GKpTFfnoJuftU8YjjjgUdYh1fAtxDhx5j6qI52Zflz3UKqmWC44yJXUuZM
 EHNzMHD2PMoSYDu6kmOcj+jTIlK1dHrA00riGySHDsjLh4MztXyVPc7pKGshneS7
 Qrq8AcqYmnlCmxfAJcmNQ3BsT1aCCJqnzfCAzpObUU/7VFMiU0Y3vSJSOKhViPxS
 e2O/H2VBBw==
 =8R3K
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.0-2022-09-09' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - Removed function that became unused after last week's merge (Jiapeng)

 - Two small fixes for kbuf recycling (Pavel)

 - Include address copy for zc send for POLLFIRST (Pavel)

 - Fix for short IO handling in the normal read/write path (Pavel)

* tag 'io_uring-6.0-2022-09-09' of git://git.kernel.dk/linux-block:
  io_uring/rw: fix short rw error handling
  io_uring/net: copy addr for zc on POLL_FIRST
  io_uring: recycle kbuf recycle on tw requeue
  io_uring/kbuf: fix not advancing READV kbuf ring
  io_uring/notif: Remove the unused function io_notif_complete()
2022-09-09 14:57:18 -04:00
Pavel Begunkov
4d9cb92ca4 io_uring/rw: fix short rw error handling
We have a couple of problems, first reports of unexpected link breakage
for reads when cqe->res indicates that the IO was done in full. The
reason here is partial IO with retries.

TL;DR; we compare the result in __io_complete_rw_common() against
req->cqe.res, but req->cqe.res doesn't store the full length but rather
the length left to be done. So, when we pass the full corrected result
via kiocb_done() -> __io_complete_rw_common(), it fails.

The second problem is that we don't try to correct res in
io_complete_rw(), which, for instance, might be a problem for O_DIRECT
but when a prefix of data was cached in the page cache. We also
definitely don't want to pass a corrected result into io_rw_done().

The fix here is to leave __io_complete_rw_common() alone, always pass
not corrected result into it and fix it up as the last step just before
actually finishing the I/O.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://github.com/axboe/liburing/issues/643
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-09 08:57:57 -06:00
Pavel Begunkov
3c8400532d io_uring/net: copy addr for zc on POLL_FIRST
Every time we return from an issue handler and expect the request to be
retried we should also setup it for async exec ourselves. Do that when
we return on IORING_RECVSEND_POLL_FIRST in io_sendzc(), otherwise it'll
re-read the address, which might be a surprise for the userspace.

Fixes: 092aeedb75 ("io_uring: allow to pass addr into sendzc")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ab1d0657890d6721339c56d2e161a4bba06f85d0.1662642013.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-08 08:28:38 -06:00
Pavel Begunkov
336d28a8f3 io_uring: recycle kbuf recycle on tw requeue
When we queue a request via tw for execution it's not going to be
executed immediately, so when io_queue_async() hits IO_APOLL_READY
and queues a tw but doesn't try to recycle/consume the buffer some other
request may try to use the the buffer.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a19bc9e211e3184215a58e129b62f440180e9212.1662480490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-07 10:36:10 -06:00
Pavel Begunkov
df6d3422d3 io_uring/kbuf: fix not advancing READV kbuf ring
When we don't recycle a selected ring buffer we should advance the head
of the ring, so don't just skip io_kbuf_recycle() for IORING_OP_READV
but adjust the ring.

Fixes: 934447a603 ("io_uring: do not recycle buffer in READV")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/a6d85e2611471bcb5d5dcd63a8342077ddc2d73d.1662480490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-07 10:36:10 -06:00
Jiapeng Chong
4fa07edbb7 io_uring/notif: Remove the unused function io_notif_complete()
The function io_notif_complete() is defined in the notif.c file, but not
called elsewhere, so delete this unused function.

io_uring/notif.c:24:20: warning: unused function 'io_notif_complete' [-Wunused-function].

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2047
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220905020436.51894-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-05 11:42:39 -06:00
Linus Torvalds
cec53f4c8d io_uring-6.0-2022-09-02
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMSIfUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmWyD/9vhN1GTy41KS+GoU4ljBCNx3ubigp9FqXc
 baCFaDEQlsUuYyuaG/DAWghs/dDJc/Qk51cpii+QOrU2yVw6vvOO32P2rbSJYW+l
 qlqxNm3Mmpr+amMc94RcL26r69av6l+6UA4HTgziWia2LR4QklvthmC+G65HqQRt
 dbkWRb7KPqlmtFyeTQSzy0mgwwra66zyUUEpcif4Fhu1puV64MD7IGvR9INsS50r
 Mey2Uo+ANvEKtqhhAFBPL7/LH49IyYJtLoJiH5PdCJzS0UTIUkb5FQoiu2gQTeF2
 /2ihT6wqe4T75Zo2Ofm8qK7OQF2+vM2KQWBc7ytiBNat32ypT/MGS3YtRd1p3zqc
 XTtfrtkzu7eoYGKCZfgZrRwKzfJuV9SjOwZEtfLgV42gTLnNkPZQCGsSPfPnVNeU
 Lc4GCxmf95blNvlkJK5SCtZ+SQQ+zOFm9sPuiDtmjJ5yMxUd28n52dzxTXqfG8su
 75Zgh+tH5rFK2Vi0tn8mWsm8/ZLwEdMoyG1OCi9eXWyfxXfR2iYj7PK9a/T9aTXn
 cr5JwuGbfW6pnnoTq9CSKFknJSar4FiJMCkQuhtzBMmhl23gpjJifqs4PXWQpX+s
 e2rL5B0/9DOKXmagTqPYtfBLvvmnNRgZ7wE+N3gLbPbtRPjb6XupOv3nudLA+rgR
 keJh/vVhqQ==
 =yXSa
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.0-2022-09-02' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - A single fix for over-eager retries for networking (Pavel)

 - Revert the notification slot support for zerocopy sends.

   It turns out that even after more than a year or development and
   testing, there's not full agreement on whether just using plain
   ordered notifications is Good Enough to avoid the complexity of using
   the notifications slots. Because of that, we decided that it's best
   left to a future final decision.

   We can always bring back this feature, but we can't really change it
   or remove it once we've released 6.0 with it enabled. The reverts
   leave the usual CQE notifications as the primary interface for
   knowing when data was sent, and when it was acked. (Pavel)

* tag 'io_uring-6.0-2022-09-02' of git://git.kernel.dk/linux-block:
  selftests/net: return back io_uring zc send tests
  io_uring/net: simplify zerocopy send user API
  io_uring/notif: remove notif registration
  Revert "io_uring: rename IORING_OP_FILES_UPDATE"
  Revert "io_uring: add zc notification flush requests"
  selftests/net: temporarily disable io_uring zc test
  io_uring/net: fix overexcessive retries
2022-09-02 16:37:01 -07:00
Al Viro
e81f574da0 __io_setxattr(): constify path
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-09-01 17:39:05 -04:00
Pavel Begunkov
b48c312be0 io_uring/net: simplify zerocopy send user API
Following user feedback, this patch simplifies zerocopy send API. One of
the main complaints is that the current API is difficult with the
userspace managing notification slots, and then send retries with error
handling make it even worse.

Instead of keeping notification slots change it to the per-request
notifications model, which posts both completion and notification CQEs
for each request when any data has been sent, and only one CQE if it
fails. All notification CQEs will have IORING_CQE_F_NOTIF set and
IORING_CQE_F_MORE in completion CQEs indicates whether to wait a
notification or not.

IOSQE_CQE_SKIP_SUCCESS is disallowed with zerocopy sends for now.

This is less flexible, but greatly simplifies the user API and also the
kernel implementation. We reuse notif helpers in this patch, but in the
future there won't be need for keeping two requests.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/95287640ab98fc9417370afb16e310677c63e6ce.1662027856.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-01 09:13:33 -06:00
Pavel Begunkov
57f332246a io_uring/notif: remove notif registration
We're going to remove the userspace exposed zerocopy notification API,
remove notification registration.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6ff00b97be99869c386958a990593c9c31cf105b.1662027856.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-01 09:13:33 -06:00
Pavel Begunkov
d9808ceb31 Revert "io_uring: rename IORING_OP_FILES_UPDATE"
This reverts commit 4379d5f15b.

We removed notification flushing, also cleanup uapi preparation changes
to not pollute it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/89edc3905350f91e1b6e26d9dbf42ee44fd451a2.1662027856.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-01 09:13:33 -06:00
Pavel Begunkov
23c12d5fc0 Revert "io_uring: add zc notification flush requests"
This reverts commit 492dddb4f6.

Soon we won't have the very notion of notification flushing, so remove
notification flushing requests.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8850334ca56e65b413cb34fd158db81d7b2865a3.1662027856.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-01 09:13:33 -06:00
Linus Torvalds
9c9d1896fa lsm/stable-6.0 PR 20220829
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmMNEC8UHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXN6uA//Wvoj5l33ngi5p6CNAfxrZiOeeki7
 ylMO9NF4BZY+BOKtWDcrUvpZoLCEEEtLihQ8vz7Iyedtpd34KBzI+H+36JDC9jei
 dWZiXYzzmaN6JVQ2pIGWr9kTfRPbbE4X91bI2jhDOBv64zCqZu2qDoXshud5WHU1
 XhMMtAsQHKrdZa29y6nj6xHYuVA/fkpL5rg5LDrFDYwS7fV+g02ATmRnEsGefRNu
 JbjrapAnl6lWO6peRuyLNzf6NNgLLsXAmYOdyJGERKx23TSwqVMGhK6eODYBttiH
 E9OfFDz3oqbLfVrL6uBlr30T1lnns+WyRWdRvAP36L9wbQ/0o24mGsf5E20wo1T9
 rwPNsFelI66Eu2S1v/DQWtGtzeaed5IrWMtQc93x4I1PQIxwMSP4znWEKg/2zDNQ
 tBVVjs6bIzWHbeYozmKK9xvtqL08F5H6t+cS7BDVWPfb8nAfiXvyrwgCRY36xHfO
 LJWb125lbDflkPRiIgf81IAE6SZLH/PFLowNXZUSAo0CTALhlGZXmhNr6Oz7Xr2A
 NIwKvuFNqGav0Rcsk+Qy0ir6jRKOj9854U4y3kAVOAhPSyBVZAoN1Y3wtiOpmdI0
 taLNKv9W46ZxQtqQNOm31/py3N4bZl0y2JvS4lvwbDMqCjCqVE7236GjQ0vtYQQi
 8thpb268VJTby8Y=
 =/7Pp
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20220829' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull LSM support for IORING_OP_URING_CMD from Paul Moore:
 "Add SELinux and Smack controls to the io_uring IORING_OP_URING_CMD.

  These are necessary as without them the IORING_OP_URING_CMD remains
  outside the purview of the LSMs (Luis' LSM patch, Casey's Smack patch,
  and my SELinux patch). They have been discussed at length with the
  io_uring folks, and Jens has given his thumbs-up on the relevant
  patches (see the commit descriptions).

  There is one patch that is not strictly necessary, but it makes
  testing much easier and is very trivial: the /dev/null
  IORING_OP_URING_CMD patch."

* tag 'lsm-pr-20220829' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  Smack: Provide read control for io_uring_cmd
  /dev/null: add IORING_OP_URING_CMD support
  selinux: implement the security_uring_cmd() LSM hook
  lsm,io_uring: add LSM hooks for the new uring_cmd file op
2022-08-31 09:23:16 -07:00
Pavel Begunkov
dfb58b1796 io_uring/net: fix overexcessive retries
Length parameter of io_sg_from_iter() can be smaller than the iterator's
size, as it's with TCP, so when we set from->count at the end of the
function we truncate the iterator forcing TCP to return preliminary with
a short send. It affects zerocopy sends with large payload sizes and
leads to retries and possible request failures.

Fixes: 3ff1a0d395 ("io_uring: enable managed frags with register buffers")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0bc0d5179c665b4ef5c328377c84c7a1f298467e.1661530037.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-26 10:31:42 -06:00
Luis Chamberlain
2a58401240 lsm,io_uring: add LSM hooks for the new uring_cmd file op
io-uring cmd support was added through ee692a21e9 ("fs,io_uring:
add infrastructure for uring-cmd"), this extended the struct
file_operations to allow a new command which each subsystem can use
to enable command passthrough. Add an LSM specific for the command
passthrough which enables LSMs to inspect the command details.

This was discussed long ago without no clear pointer for something
conclusive, so this enables LSMs to at least reject this new file
operation.

[0] https://lkml.kernel.org/r/8adf55db-7bab-f59d-d612-ed906b948d19@schaufler-ca.com

Cc: stable@vger.kernel.org
Fixes: ee692a21e9 ("fs,io_uring: add infrastructure for uring-cmd")
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2022-08-26 11:19:43 -04:00
Pavel Begunkov
581711c466 io_uring/net: save address for sendzc async execution
We usually copy all bits that a request needs from the userspace for
async execution, so the userspace can keep them on the stack. However,
send zerocopy violates this pattern for addresses and may reloads it
e.g. from io-wq. Save the address if any in ->async_data as usual.

Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d7512d7aa9abcd36e9afe1a4d292a24cb2d157e5.1661342812.git.asml.silence@gmail.com
[axboe: fold in incremental fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-25 07:52:30 -06:00
Pavel Begunkov
5916943943 io_uring: conditional ->async_data allocation
There are opcodes that need ->async_data only in some cases and
allocation it unconditionally may hurt performance. Add an option to
opdef to make move the allocation part from the core io_uring to opcode
specific code.
Note, we can't just set opdef->async_size to zero because there are
other helpers that rely on it, e.g. io_alloc_async_data().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9dc62be9e88dd0ed63c48365340e8922d2498293.1661342812.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-24 08:57:28 -06:00
Pavel Begunkov
53bdc88aac io_uring/notif: order notif vs send CQEs
Currently, there is no ordering between notification CQEs and
completions of the send flushing it, this quite complicates the
userspace, especially since we don't flush notification when the
send(+flush) request fails, i.e. there will be only one CQE. What we
can do is to make sure that notification completions come only after
sends.

The easiest way to achieve this is to not try to complete a notification
inline from io_sendzc() but defer it to task_work, considering that
io-wq sendzc is disallowed CQEs will be naturally ordered because
task_works will only be executed after we're done with submission and so
inline completion.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cddfd1c2bf91f22b9fe08e13b7dffdd8f858a151.1661342812.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-24 08:57:28 -06:00
Pavel Begunkov
986e263def io_uring/net: fix indentation
Fix up indentation before we get complaints from tooling.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bd5754e3764215ccd7fb04cd636ea9167aaa275d.1661342812.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-24 08:57:15 -06:00
Pavel Begunkov
5a848b7c9e io_uring/net: fix zc send link failing
Failed requests should be marked with req_set_fail(), so links and cqe
skipping work correctly, which is missing in io_sendzc(). Note,
io_sendzc() return IOU_OK on failure, so the core code won't do the
cleanup for us.

Fixes: 06a5464be8 ("io_uring: wire send zc request type")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e47d46fda9db30154ce66a549bb0d3380b780520.1661342812.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-24 08:57:00 -06:00
Pavel Begunkov
2cacedc873 io_uring/net: fix must_hold annotation
Fix up the io_alloc_notif()'s __must_hold as we don't have a ctx
argument there but should get it from the slot instead.

Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cbb0a920f18e0aed590bf58300af817b9befb8a3.1661342812.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-24 08:57:00 -06:00
Kanchan Joshi
a9c3eda7ea io_uring: fix submission-failure handling for uring-cmd
If ->uring_cmd returned an error value different from -EAGAIN or
-EIOCBQUEUED, it gets overridden with IOU_OK. This invites trouble
as caller (io_uring core code) handles IOU_OK differently than other
error codes.
Fix this by returning the actual error code.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-23 09:46:17 -06:00
Jens Axboe
47abea041f io_uring: fix off-by-one in sync cancelation file check
The passed in index should be validated against the number of registered
files we have, it needs to be smaller than the index value to avoid going
one beyond the end.

Fixes: 78a861b949 ("io_uring: add sync cancelation API through io_uring_register()")
Reported-by: Luo Likang <luolikang@nsfocus.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-23 07:26:08 -06:00
Pavel Begunkov
3f743e9bbb io_uring/net: use right helpers for async_data
There is another spot where we check ->async_data directly instead of
using req_has_async_data(), which is the way to do it, fix it up.

Fixes: 43e0bbbd0b ("io_uring: add netmsg cache")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/42f33b9a81dd6ae65dda92f0372b0ff82d548517.1660822636.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-18 07:27:20 -06:00
Pavel Begunkov
5993000dc6 io_uring/notif: raise limit on notification slots
1024 notification slots is rather an arbitrary value, raise it up,
everything is accounted to memcg.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/eb78a0a5f2fa5941f8e845cdae5fb399bf7ba0be.1660566179.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-15 21:34:00 -06:00
Pavel Begunkov
86dc8f23bb io_uring/net: improve zc addr import error handling
We may account memory to a memcg of a request that didn't even got to
the network layer. It's not a bug as it'll be routinely cleaned up on
flush, but it might be confusing for the userspace.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b8aae61f4c3ddc4da97c1da876bb73871f352d50.1660566179.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-15 21:34:00 -06:00
Pavel Begunkov
063604265f io_uring/net: use right helpers for async recycle
We have a helper that checks for whether a request contains anything in
->async_data or not, namely req_has_async_data(). It's better to use it
as it might have some extra considerations.

Fixes: 43e0bbbd0b ("io_uring: add netmsg cache")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b7414da4e7c3c32c31fc02dfd1355af4ccf4ca5f.1660566179.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-15 21:34:00 -06:00
Linus Torvalds
1da8cf961b io_uring-6.0-2022-08-13
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmL3+fQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmXyEACfERdKYdZ/W3IvPoyK8CJ3p7f/6SOj2/p1
 DTuaa3l7/kVq2HcRUGgZwvgeWpOCFghdBm5co/4hGqSw7bT8rERGDelo41ohhTfr
 xKIiwJflK/s280VXLJFA+o7Jeoj1oTFYCmdUmU3wcKFVnQdu1rz9s0L6bwsEqq93
 y1uty96dxYZn2mENLbBah0x9yV0h2ZxRkguUm0sdnKl/tMkUVLSD1TPLHf2s6eAL
 o3Dbmo9jv4HFXoJj8YL50Oxl22zIKBHl9hZqHdLcKesFgyFTChckKUNijWyPL2vE
 zesbnd57sXgY6ghi4LDGeCOtN41WNjiVeAm/c4XK5oFhTag8Q2x0D1hTPUByHksl
 IV/116xs6pHTeZRhNlMOBVMZGLSz95zSuRUyTONAmKgc/b3if/w3zTi1W3CnJSlx
 7O5GpqQDZTQuin0jldNKImbx1aPAATb+UWDkl7O5aXkjw4FUtxT5GrYcBNswVuKX
 iybx8NyVn8kFD1hix3U8huBOPSg1JMkR+sFml+NqYRd4i2CwV8KAPPuzsPw6MRBL
 U4DfkAkpsbKqSK+mri5aUrYxmpYkJ45mgyldiewiOso9+AYg9DDp3D2iGgAiRbKm
 i3pz1Gh/3iUow0UAI5ZFlDhjHgWPlIH7IBbemivhjhFV4GrXJqTwUzsA1iDKTe14
 3lHKkAPVPA==
 =FfLf
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.0-2022-08-13' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - Regression fix for this merge window, fixing a wrong order of
   arguments for io_req_set_res() for passthru (Dylan)

 - Fix for the audit code leaking context memory (Peilin)

 - Ensure that provided buffers are memcg accounted (Pavel)

 - Correctly handle short zero-copy sends (Pavel)

 - Sparse warning fixes for the recvmsg multishot command (Dylan)

 - Error handling fix for passthru (Anuj)

 - Remove randomization of struct kiocb fields, to avoid it growing in
   size if re-arranged in such a fashion that it grows more holes or
   padding (Keith, Linus)

 - Small series improving type safety of the sqe fields (Stefan)

* tag 'io_uring-6.0-2022-08-13' of git://git.kernel.dk/linux-block:
  io_uring: add missing BUILD_BUG_ON() checks for new io_uring_sqe fields
  io_uring: make io_kiocb_to_cmd() typesafe
  fs: don't randomize struct kiocb fields
  io_uring: consistently make use of io_notif_to_data()
  io_uring: fix error handling for io_uring_cmd
  io_uring: fix io_recvmsg_prep_multishot sparse warnings
  io_uring/net: send retry for zerocopy
  io_uring: mem-account pbuf buckets
  audit, io_uring, io-wq: Fix memory leak in io_sq_thread() and io_wqe_worker()
  io_uring: pass correct parameters to io_req_set_res
2022-08-13 13:28:54 -07:00
Stefan Metzmacher
9c71d39aa0 io_uring: add missing BUILD_BUG_ON() checks for new io_uring_sqe fields
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Link: https://lore.kernel.org/r/ffcaf8dc4778db4af673822df60dbda6efdd3065.1660201408.git.metze@samba.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-12 17:01:00 -06:00
Stefan Metzmacher
f2ccb5aed7 io_uring: make io_kiocb_to_cmd() typesafe
We need to make sure (at build time) that struct io_cmd_data is not
casted to a structure that's larger.

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Link: https://lore.kernel.org/r/c024cdf25ae19fc0319d4180e2298bade8ed17b8.1660201408.git.metze@samba.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-12 17:01:00 -06:00
Stefan Metzmacher
da2634e89c io_uring: consistently make use of io_notif_to_data()
This makes the assignment typesafe. It prepares
changing io_kiocb_to_cmd() in the next commit.

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Link: https://lore.kernel.org/r/8da6e9d12cf95ad4bc73274406d12bca7aabf72e.1660201408.git.metze@samba.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-11 10:56:13 -06:00
Anuj Gupta
3ed159c984 io_uring: fix error handling for io_uring_cmd
Commit 97b388d70b ("io_uring: handle completions in the core") moved the
error handling from handler to core. But for io_uring_cmd handler we end
up completing more than once (both in handler and in core) leading to
use_after_free.
Change io_uring_cmd handler to avoid calling io_uring_cmd_done in case
of error.

Fixes: 97b388d70b ("io_uring: handle completions in the core")
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220811091459.6929-1-anuj20.g@samsung.com
[axboe: fix ret vs req typo]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-11 10:56:00 -06:00
Dylan Yudaken
d1f6222c49 io_uring: fix io_recvmsg_prep_multishot sparse warnings
Fix casts missing the __user parts. This seemed to only cause errors on
the alpha build, or if checked with sparse, but it was definitely an
oversight.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 9bb66906f2 ("io_uring: support multishot in recvmsg")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220805115450.3921352-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-05 08:41:18 -06:00
Pavel Begunkov
4a933e6208 io_uring/net: send retry for zerocopy
io_uring handles short sends/recvs for stream sockets when MSG_WAITALL
is set, however new zerocopy send is inconsistent in this regard, which
might be confusing. Handle short sends.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b876a4838597d9bba4f3215db60d72c33c448ad0.1659622472.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-04 08:35:16 -06:00
Pavel Begunkov
cc18cc5e82 io_uring: mem-account pbuf buckets
Potentially, someone may create as many pbuf bucket as there are indexes
in an xarray without any other restrictions bounding our memory usage,
put memory needed for the buckets under memory accounting.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d34c452e45793e978d26e2606211ec9070d329ea.1659622312.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-04 08:35:07 -06:00
Peilin Ye
f482aa9865 audit, io_uring, io-wq: Fix memory leak in io_sq_thread() and io_wqe_worker()
Currently @audit_context is allocated twice for io_uring workers:

  1. copy_process() calls audit_alloc();
  2. io_sq_thread() or io_wqe_worker() calls audit_alloc_kernel() (which
     is effectively audit_alloc()) and overwrites @audit_context,
     causing:

  BUG: memory leak
  unreferenced object 0xffff888144547400 (size 1024):
<...>
    hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    backtrace:
      [<ffffffff8135cfc3>] audit_alloc+0x133/0x210
      [<ffffffff81239e63>] copy_process+0xcd3/0x2340
      [<ffffffff8123b5f3>] create_io_thread+0x63/0x90
      [<ffffffff81686604>] create_io_worker+0xb4/0x230
      [<ffffffff81686f68>] io_wqe_enqueue+0x248/0x3b0
      [<ffffffff8167663a>] io_queue_iowq+0xba/0x200
      [<ffffffff816768b3>] io_queue_async+0x113/0x180
      [<ffffffff816840df>] io_req_task_submit+0x18f/0x1a0
      [<ffffffff816841cd>] io_apoll_task_func+0xdd/0x120
      [<ffffffff8167d49f>] tctx_task_work+0x11f/0x570
      [<ffffffff81272c4e>] task_work_run+0x7e/0xc0
      [<ffffffff8125a688>] get_signal+0xc18/0xf10
      [<ffffffff8111645b>] arch_do_signal_or_restart+0x2b/0x730
      [<ffffffff812ea44e>] exit_to_user_mode_prepare+0x5e/0x180
      [<ffffffff844ae1b2>] syscall_exit_to_user_mode+0x12/0x20
      [<ffffffff844a7e80>] do_syscall_64+0x40/0x80

Then,

  3. io_sq_thread() or io_wqe_worker() frees @audit_context using
     audit_free();
  4. do_exit() eventually calls audit_free() again, which is okay
     because audit_free() does a NULL check.

As suggested by Paul Moore, fix it by deleting audit_alloc_kernel() and
redundant audit_free() calls.

Fixes: 5bd2182d58 ("audit,io_uring,io-wq: add some basic audit support to io_uring")
Suggested-by: Paul Moore <paul@paul-moore.com>
Cc: stable@vger.kernel.org
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/20220803222343.31673-1-yepeilin.cs@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-04 08:33:54 -06:00
Linus Torvalds
5264406cdb iov_iter work, part 1 - isolated cleanups and optimizations.
One of the goals is to reduce the overhead of using ->read_iter()
 and ->write_iter() instead of ->read()/->write(); new_sync_{read,write}()
 has a surprising amount of overhead, in particular inside iocb_flags().
 That's why the beginning of the series is in this pile; it's not directly
 iov_iter-related, but it's a part of the same work...
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYurGOQAKCRBZ7Krx/gZQ
 6ysyAP91lvBfMRepcxpd9kvtuzWkU8A3rfSziZZteEHANB9Q7QEAiPn2a2OjWkcZ
 uAyUWfCkHCNx+dSMkEvUgR5okQ0exAM=
 =9UCV
 -----END PGP SIGNATURE-----

Merge tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs iov_iter updates from Al Viro:
 "Part 1 - isolated cleanups and optimizations.

  One of the goals is to reduce the overhead of using ->read_iter() and
  ->write_iter() instead of ->read()/->write().

  new_sync_{read,write}() has a surprising amount of overhead, in
  particular inside iocb_flags(). That's the explanation for the
  beginning of the series is in this pile; it's not directly
  iov_iter-related, but it's a part of the same work..."

* tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  first_iovec_segment(): just return address
  iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
  iov_iter: first_{iovec,bvec}_segment() - simplify a bit
  iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
  iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
  iov_iter_bvec_advance(): don't bother with bvec_iter
  copy_page_{to,from}_iter(): switch iovec variants to generic
  keep iocb_flags() result cached in struct file
  iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
  struct file: use anonymous union member for rcuhead and llist
  btrfs: use IOMAP_DIO_NOSYNC
  teach iomap_dio_rw() to suppress dsync
  No need of likely/unlikely on calls of check_copy_size()
2022-08-03 13:50:22 -07:00
Ming Lei
ff2557b722 io_uring: pass correct parameters to io_req_set_res
The two parameters of 'res' and 'cflags' are swapped, so fix it.
Without this fix, 'ublk del' hangs forever.

Cc: Pavel Begunkov <asml.silence@gmail.com>
Fixes: de23077eda ("io_uring: set completion results upfront")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220803120757.1668278-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03 08:45:25 -06:00
Linus Torvalds
42df1cbf6a for-5.20/io_uring-zerocopy-send-2022-07-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLkm/MQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoaXD/9Nevo4KQmlG83ZcZfu2d51VlGtt6/Dl7LL
 pr07RfnRFJcjeCPCwXCXmu6rrlY+inpfEWv9iCR/ImoeESOJCzm0dN/nlffO/zT1
 E0h5AlEoDv2bYrCnVkbfvxL722TZqGeLiDE4YY1jVbuUfs3TDmLQzfGbORK+Zw4y
 wPEMDZP1yWHoyeHUGWFasu6dpWiAwsZ4sTX0J631YwIBDNWKZqtienIiY15rK4dz
 GioBea6voe8Fos0VEhCBOKXMmV9mG4yVOPeaDbTWTRfuzGNF8b7t2vg7mz+PrbBY
 M8h1oEt+/+FnsCIZqfaEUzqHX6quv46OVtq/F5L3yNz/5QEsnqfv08ZFwD3sXdgZ
 /RFxXamfcn/LoxzZ9eLu3MeyzpXp6frxBcgTNGc3q2TlIwXr1WsIx2N4PxZh00GM
 ssW/ulaOZvZmOmDlbdeSC7sp3R1JmHO4qVlHowr58ce8pkishNTwlZZGr0sHyeNq
 /Wkd9NQEQEFD6AIzZ/Mz9CsmzHeHYpy6GhicFrcLuU4YF/fnQ6T4hTjlIlucGv/S
 IeqoAHrurCB0/p1ml6VfJ58xUWXNCCCkKC5+xu8Vm6/RgMlIw5KkzvVEBfflnomB
 wVJLYsLw41gnlqqpwISR39I7cDV+s6xC5P8YAA/NLz692HDIUrRX14dlbZuXIgbc
 ROeHB2N5+g==
 =vSwm
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20/io_uring-zerocopy-send-2022-07-29' of git://git.kernel.dk/linux-block

Pull io_uring zerocopy support from Jens Axboe:
 "This adds support for efficient support for zerocopy sends through
  io_uring. Both ipv4 and ipv6 is supported, as well as both TCP and
  UDP.

  The core network changes to support this is in a stable branch from
  Jakub that both io_uring and net-next has pulled in, and the io_uring
  changes are layered on top of that.

  All of the work has been done by Pavel"

* tag 'for-5.20/io_uring-zerocopy-send-2022-07-29' of git://git.kernel.dk/linux-block: (34 commits)
  io_uring: notification completion optimisation
  io_uring: export req alloc from core
  io_uring/net: use unsigned for flags
  io_uring/net: make page accounting more consistent
  io_uring/net: checks errors of zc mem accounting
  io_uring/net: improve io_get_notif_slot types
  selftests/io_uring: test zerocopy send
  io_uring: enable managed frags with register buffers
  io_uring: add zc notification flush requests
  io_uring: rename IORING_OP_FILES_UPDATE
  io_uring: flush notifiers after sendzc
  io_uring: sendzc with fixed buffers
  io_uring: allow to pass addr into sendzc
  io_uring: account locked pages for non-fixed zc
  io_uring: wire send zc request type
  io_uring: add notification slot registration
  io_uring: add rsrc referencing for notifiers
  io_uring: complete notifiers in tw
  io_uring: cache struct io_notif
  io_uring: add zc notification infrastructure
  ...
2022-08-02 13:37:55 -07:00
Pavel Begunkov
14b146b688 io_uring: notification completion optimisation
We want to use all optimisations that we have for io_uring requests like
completion batching, memory caching and more but for zc notifications.
Fortunately, notification perfectly fit the request model so we can
overlay them onto struct io_kiocb and use all the infratructure.

Most of the fields of struct io_notif natively fits into io_kiocb, so we
replace struct io_notif with struct io_kiocb carrying struct
io_notif_data in the cmd cache line. Then we adapt io_alloc_notif() to
use io_alloc_req()/io_alloc_req_refill(), and kill leftovers of hand
coded caching. __io_notif_complete_tw() is converted to use io_uring's
tw infra.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9e010125175e80baf51f0ca63bdc7cc6a4a9fa56.1658913593.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-27 08:50:50 -06:00
Pavel Begunkov
bd1a3783dd io_uring: export req alloc from core
We want to do request allocation out of the core io_uring code, make the
allocation functions public for other io_uring parts.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0314fedd3a02a514210ba42d4720332538c65956.1658913593.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-27 08:50:50 -06:00
Pavel Begunkov
293402e564 io_uring/net: use unsigned for flags
Use unsigned int type for msg flags.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5cfaed13d3191337b14b8664ca68b515d9e2d1b4.1658742118.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-25 09:48:25 -06:00
Pavel Begunkov
6a9ce66f4d io_uring/net: make page accounting more consistent
Make network page accounting more consistent with how buffer
registration is working, i.e. account all memory to ctx->user.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4aacfe64bbb81b27f9ecf5d5c219c69a07e5aa56.1658742118.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-25 09:48:25 -06:00
Pavel Begunkov
2e32ba5607 io_uring/net: checks errors of zc mem accounting
mm_account_pinned_pages() may fail, don't ignore the return value.

Fixes: e29e3bd4b9 ("io_uring: account locked pages for non-fixed zc")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dae0542ed8e6706071bb83ad3e7ad6a70d207fd9.1658742118.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-25 09:48:17 -06:00
Pavel Begunkov
cb309ae49d io_uring/net: improve io_get_notif_slot types
Don't use signed int for slot indexing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e4d15aefebb5e55729dd9b5ec01ab16b70033343.1658742118.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-25 09:47:45 -06:00
Pavel Begunkov
3ff1a0d395 io_uring: enable managed frags with register buffers
io_uring's registered buffers infra has a good performant way of pinning
pages, so let's use SKBFL_MANAGED_FRAG_REFS when our requests are purely
register buffer backed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/278731d3f20caf346cfc025fbee0b4c9ee4ed751.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:41:07 -06:00
Pavel Begunkov
492dddb4f6 io_uring: add zc notification flush requests
Overlay notification control onto IORING_OP_RSRC_UPDATE (former
IORING_OP_FILES_UPDATE). It allows to flush a range of zc notifications
from slots with indexes [sqe->off, sqe->off+sqe->len). If sqe->arg is
not zero, it also copies sqe->arg as a new tag for all flushed
notifications.

Note, it doesn't flush a notification of a slot if there was no requests
attached to it (since last flush or registration).

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/df13e2363400682a73dd9e71c3b990b8d1ff0333.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:41:07 -06:00
Pavel Begunkov
4379d5f15b io_uring: rename IORING_OP_FILES_UPDATE
IORING_OP_FILES_UPDATE will be a more generic opcode serving different
resource types, rename it into IORING_OP_RSRC_UPDATE and add subtype
handling.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0a907133907d9af3415a8a7aa1802c6aa97c03c6.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:41:07 -06:00
Pavel Begunkov
63809137eb io_uring: flush notifiers after sendzc
Allow to flush notifiers as a part of sendzc request by setting
IORING_SENDZC_FLUSH flag. When the sendzc request succeedes it will
flush the used [active] notifier.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e0b4d9a6797e2fd6092824fe42953db7a519bbc8.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:41:07 -06:00
Pavel Begunkov
10c7d33ecd io_uring: sendzc with fixed buffers
Allow zerocopy sends to use fixed buffers. There is an optimisation for
this case, the network layer don't need to reference the pages, see
SKBFL_MANAGED_FRAG_REFS, so io_uring have to ensure validity of fixed
buffers until the notifier is released.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e1d8bd1b5934e541d90c1824eb4020ae3f5f43f3.1657643355.git.asml.silence@gmail.com
[axboe: fold in 32-bit pointer cast warning fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:41:07 -06:00
Pavel Begunkov
092aeedb75 io_uring: allow to pass addr into sendzc
Allow to specify an address to zerocopy sends making it more like
sendto(2).

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/70417a8f7c5b51ab454690bae08adc0c187f89e8.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:41:07 -06:00