Another change to the API io-wq worker limitation API added in 5.15,
apply the limit to all prior users that already registered a tctx. It
may be confusing as it's now, in particular the change covers the
following 2 cases:
TASK1 | TASK2
_________________________________________________
ring = create() |
| limit_iowq_workers()
*not limited* |
TASK1 | TASK2
_________________________________________________
ring = create() |
| issue_requests()
limit_iowq_workers() |
| *not limited*
A note on locking, it's safe to traverse ->tctx_list as we hold
->uring_lock, but do that after dropping sqd->lock to avoid possible
problems. It's also safe to access tctx->io_wq there because tasks
kill it only after removing themselves from tctx_list, see
io_uring_cancel_generic() -> io_uring_clean_tctx()
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d6e09ecc3545e4dc56e43c906ee3d71b7ae21bed.1634818641.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_unprep_linked_timeout() is broken, first it needs to return back
REQ_F_ARM_LTIMEOUT, so the linked timeout is enqueued and disarmed. But
now we refcounted it, and linked timeouts may get not executed at all,
leaking a request.
Just kill the unprep optimisation.
Fixes: 906c6caaf5 ("io_uring: optimise io_prep_linked_timeout()")
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/51b8e2bfc4bea8ee625cf2ba62b2a350cc9be031.1634719585.git.asml.silence@gmail.com
Link: https://github.com/axboe/liburing/issues/460
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, IORING_REGISTER_IOWQ_MAX_WORKERS applies only to the task
that issued it, it's unexpected for users. If one task creates a ring,
limits workers and then passes it to another task the limit won't be
applied to the other task.
Another pitfall is that a task should either create a ring or submit at
least one request for IORING_REGISTER_IOWQ_MAX_WORKERS to work at all,
furher complicating the picture.
Change the API, save the limits and apply to all future users. Note, it
should be done first before giving away the ring or submitting new
requests otherwise the result is not guaranteed.
Fixes: 2e480058dd ("io-wq: provide a way to limit max number of workers")
Link: https://github.com/axboe/liburing/issues/460
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/51d0bae97180e08ab722c0d5c93e7439cfb6f697.1634683237.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Grab uring lock when we are in io-worker rather than in the original
or system-wq context since we already hold it in these two situation.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Fixes: b66ceaf324 ("io_uring: move iopoll reissue into regular IO path")
Link: https://lore.kernel.org/r/20211014140400.50235-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have never supported fasync properly, it would only fire when there
is something polling io_uring making it useless. The original support came
in through the initial io_uring merge for 5.1. Since it's broken and
nobody has reported it, get rid of the fasync bits.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2f7ca3d344d406d34fa6713824198915c41cea86.1633080236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
From recently open/accept are now able to manipulate fixed file table,
but it's inconsistent that close can't. Close the gap, keep API same as
with open/accept, i.e. via sqe->file_slot.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't retry short writes and so we would never get to async setup in
io_write() in that case. Thus ret2 > 0 is always false and
iov_iter_advance() is never used. Apparently, the same is found by
Coverity, which complains on the code.
Fixes: cd65869512 ("io_uring: use iov_iter state save/restore helpers")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5b33e61034748ef1022766efc0fb8854cfcf749c.1632500058.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's no reason to punt it unconditionally, we just need to ensure that
the submit lock grabbing is conditional.
Fixes: 05f3fb3c53 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For each provided buffer, we allocate a struct io_buffer to hold the
data associated with it. As a large number of buffers can be provided,
account that data with memcg.
Fixes: ddf0322db7 ("io_uring: add IORING_OP_PROVIDE_BUFFERS")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For multishot mode, there may be cases like:
iowq original context
io_poll_add
_arm_poll()
mask = vfs_poll() is not 0
if mask
(2) io_poll_complete()
compl_unlock
(interruption happens
tw queued to original
context)
io_poll_task_func()
compl_lock
(3) done = io_poll_complete() is true
compl_unlock
put req ref
(1) if (poll->flags & EPOLLONESHOT)
put req ref
EPOLLONESHOT flag in (1) may be from (2) or (3), so there are multiple
combinations that can cause ref underfow.
Let's address it by:
- check the return value in (2) as done
- change (1) to if (done)
in this way, we only do ref put in (1) if 'oneshot flag' is from
(2)
- do poll.done check in io_poll_task_func(), so that we won't put ref
for the second time.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101238.7177-4-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should set EPOLLONESHOT if cqring_fill_event() returns false since
io_poll_add() decides to put req or not by it.
Fixes: 5082620fb2 ("io_uring: terminate multishot poll for CQ ring overflow")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101238.7177-3-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If poll arming and poll completion runs in parallel, there maybe races.
For instance, run io_poll_add in iowq and io_poll_task_func in original
context, then:
iowq original context
io_poll_add
vfs_poll
(interruption happens
tw queued to original
context) io_poll_task_func
generate cqe
del from cancel_hash[]
if !poll.done
insert to cancel_hash[]
The entry left in cancel_hash[], similar case for fast poll.
Fix it by set poll.done = true when del from cancel_hash[].
Fixes: 5082620fb2 ("io_uring: terminate multishot poll for CQ ring overflow")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101238.7177-2-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFEikcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmG4D/93W/CdNgw88WFkYPfjwICKHOcSDZhGqMzh
Ug1cp4BP8lPkiCvyC8VfM3XMBUWf9j8Ijb4X7b+wjuBWaNQdJHlcb1XSEQj4sh8/
w6MUGUz76/z1z6DE0HzzPHRZyrdog+oW9jZ+qpKCjguVBcs4eu3NdY3LbDcrVvzV
xzi3o52NbvpHdgWl6LuQqJiIq0twG/6RiguKfqZDfxZxPq6m3cSgjWRLquAV9nUJ
+S6/wyGkaRK3qPMTtphWyL9TM1pr+od8K5tfKYlgdjsAoCkqIzpIJUR62rTKz3Be
jjPLxkP0TkE3YPRCjyvZR1Eb7ZwgfuyCszWnGtmBmOt5/JXDUPXEqiQPCg7rVj47
6x2JGe/bglCnSTWwYSvOQNJDqRVBiXBr59jOvSWNTFO2Tj5v9Q0dk2etgMYwA9oS
k5vdDhFLNW5T4aibNbpJFJctZaHu9N1rFkzvW4DTdur7lj64ePRMtugaU2F9PhBt
VwQlkjcuvz5GBjpwS6QdZ78ro0oUSgGOhYiRHJ8JUHJOqDv4SChyC3Tf9sD7ELzZ
/JJNviD8/iv8ZpHNKGlbwFdive4CxqXIrOYaTycrDJ32/oQkYnEWIaLMmGHaF/F+
hasiUdS5D277DVz2/R2e0e2s8YXhkmRipoHjEdq57zk7PqRolheVQdaqYuCSmtwH
MjcJi1hi6g==
=TnwU
-----END PGP SIGNATURE-----
Merge tag 'iov_iter.3-5.15-2021-09-17' of git://git.kernel.dk/linux-block
Pull io_uring iov_iter retry fixes from Jens Axboe:
"This adds a helper to save/restore iov_iter state, and modifies
io_uring to use it.
After that is done, we can now kill the iter->truncated addition that
we added for this release. The io_uring change is being overly
cautious with the save/restore/advance, but better safe than sorry and
we can always improve that and reduce the overhead if it proves to be
of concern. The only case to be worried about in this regard is huge
IO, where iteration can take a while to iterate segments.
I spent some time writing test cases, and expanded the coverage quite
a bit from the last posting of this. liburing carries this regression
test case now:
https://git.kernel.dk/cgit/liburing/tree/test/file-verify.c
which exercises all of this. It now also supports provided buffers,
and explicitly tests for end-of-file/device truncation as well.
On top of that, Pavel sanitized the IOPOLL retry path to follow the
exact same pattern as normal IO"
* tag 'iov_iter.3-5.15-2021-09-17' of git://git.kernel.dk/linux-block:
io_uring: move iopoll reissue into regular IO path
Revert "iov_iter: track truncated size"
io_uring: use iov_iter state save/restore helpers
iov_iter: add helper to save iov_iter state
Get rid of the need to do re-expand and revert on an iterator when we
encounter a short IO, or failure that warrants a retry. Use the new
state save/restore helpers instead.
We keep the iov_iter_state persistent across retries, if we need to
restart the read or write operation. If there's a pending retry, the
operation will always exit with the state correctly saved.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A common complaint is that using O_NONBLOCK files with io_uring can be a
bit of a pain. Be a bit nicer and allow normal retry IFF the file does
support async behavior. This makes it possible to use io_uring more
reliably with O_NONBLOCK files, for use cases where it either isn't
possible or feasible to modify the file flags.
Cc: stable@vger.kernel.org
Reported-and-tested-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It might be inconvenient that direct open/accept deviates from the
update semantics and fails if the slot is taken instead of removing a
file sitting there. Implement this auto-removal.
Note that removal might need to allocate and so may fail. However, if an
empty slot is specified, it's guaraneed to not fail on the fd
installation side for valid userspace programs. It's needed for users
who can't tolerate such failures, e.g. accept where the other end
never retries.
Suggested-by: Franz-B. Tuneke <franz-bernhard.tuneke@tu-dortmund.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c896f14ea46b0eaa6c09d93149e665c2c37979b4.1631632300.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move get_timespec() section in io_cqring_wait() before the sigmask
saving, otherwise we'll fail to restore sigmask once get_timespec()
returns error.
Fixes: c73ebb685f ("io_uring: add timeout support for io_uring_enter()")
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210914143852.9663-1-xiaoguang.wang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to re-check sqd->thread after we've dropped the lock. Pin
the sqd before doing the lockdep lock dance, and check if the thread
is alive after that. It's either NULL or alive, as the SQPOLL thread
cannot exit without holding the same sqd->lock.
Reported-and-tested-by: syzbot+337de45f13a4fd54d708@syzkaller.appspotmail.com
Fixes: fa84693b3c ("io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When setting up the next segment, we check what type the iter is and
handle it accordingly. However, when incrementing and processed amount
we do not, and both iter advance and addr/len are adjusted, regardless
of type. Split the increment side just like we do on the setup side.
Fixes: 4017eb91a9 ("io_uring: make loop_rw_iter() use original user supplied pointers")
Cc: stable@vger.kernel.org
Reported-by: Valentina Palmiotti <vpalmiotti@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmE8uxgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplL6EADYVpaEI9gIkSFsfkxvZ/akY8BfpTj48fP9
4zxNbchvtX+NcAuXjby6c/CvIO9QnViqgkSS9zxqZYJGYrYbsXsGV+fSZ6Vzc5tQ
bX2avxFa5iXhRVTRwxxml+m+trSKYPi2b2ETJbTwOavxDoic9BUs21/VwsW38CBU
8/JZXOOIPQUpjZ5ifhaLKZOxV8UWy5azrJNCkjHbW/oV2Od43b1zKPwI6/g15hfp
GVWvZ2u/QoDURicr5KjWcpj+XmWuevO07xysLZ49GeJncWjUbG+7lxpvhIOKaIFP
x7UYAkmzjKLS2PcO/M8fMHboIR0RiGvytHXK3rTa3TaL65sz6ZuM70fcokTT5jeZ
WSdKTCGKVT7JtHyk8CH+HH+00o2ecetGomC/3Mx+OrbpIEXUUQMfCNHak+lswmVl
Zn6HhU1Eb6nWCj6Oj09y2yWAuDb+WcOaLtI4PqQNOqsFTJAmTWqiO1qeYv+2d1YL
8i0xpRUi022Ai3bQdrmNDSsLBCAHpAxqaY//VROC+tDbHHeYchcf/Tl9m4CddQ4A
x8+iIfmgGB8nwVqWSz0zrFOV30csztnRnmGUOspSTvoL2j1lq7G2LX08sJ2uIEhB
vzddZJwnvM2uFYxCq3Vo/Y54CEwL6i6BG1bacwaM8Fp9Xufqfl5QanUAjYAvjUG0
zcvyIqznEw==
=aNr5
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.15-2021-09-11' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Fix an off-by-one in a BUILD_BUG_ON() check. Not a real issue right
now as we have plenty of flags left, but could become one. (Hao)
- Fix lockdep issue introduced in this merge window (me)
- Fix a few issues with the worker creation (me, Pavel, Qiang)
- Fix regression with wq_has_sleeper() for IOPOLL (Pavel)
- Timeout link error propagation fix (Pavel)
* tag 'io_uring-5.15-2021-09-11' of git://git.kernel.dk/linux-block:
io_uring: fix off-by-one in BUILD_BUG_ON check of __REQ_F_LAST_BIT
io_uring: fail links of cancelled timeouts
io-wq: fix memory leak in create_io_worker()
io-wq: fix silly logic error in io_task_work_match()
io_uring: drop ctx->uring_lock before acquiring sqd->lock
io_uring: fix missing mb() before waitqueue_active
io-wq: fix cancellation on create-worker failure
Build check of __REQ_F_LAST_BIT should be larger than, not equal or larger
than. It's perfectly valid to have __REQ_F_LAST_BIT be 32, as that means
that the last valid bit is 31 which does fit in the type.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210907032243.114190-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull iov_iter fixes from Al Viro:
"Fixes for io-uring handling of iov_iter reexpands"
* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
io_uring: reexpand under-reexpanded iters
iov_iter: track truncated size
The SQPOLL thread dictates the lock order, and we hold the ctx->uring_lock
for all the registration opcodes. We also hold a ref to the ctx, and we
do drop the lock for other reasons to quiesce, so it's fine to drop the
ctx lock temporarily to grab the sqd->lock. This fixes the following
lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.14.0-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.5/25433 is trying to acquire lock:
ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: io_register_iowq_max_workers fs/io_uring.c:10551 [inline]
ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __io_uring_register fs/io_uring.c:10757 [inline]
ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792
but task is already holding lock:
ffff8880885b40a8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x2e1/0x2e70 fs/io_uring.c:10791
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&ctx->uring_lock){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:596 [inline]
__mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729
__io_sq_thread fs/io_uring.c:7291 [inline]
io_sq_thread+0x65a/0x1370 fs/io_uring.c:7368
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
-> #0 (&sqd->lock){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3051 [inline]
check_prevs_add kernel/locking/lockdep.c:3174 [inline]
validate_chain kernel/locking/lockdep.c:3789 [inline]
__lock_acquire+0x2a07/0x54a0 kernel/locking/lockdep.c:5015
lock_acquire kernel/locking/lockdep.c:5625 [inline]
lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590
__mutex_lock_common kernel/locking/mutex.c:596 [inline]
__mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729
io_register_iowq_max_workers fs/io_uring.c:10551 [inline]
__io_uring_register fs/io_uring.c:10757 [inline]
__do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&ctx->uring_lock);
lock(&sqd->lock);
lock(&ctx->uring_lock);
lock(&sqd->lock);
*** DEADLOCK ***
Fixes: 2e480058dd ("io-wq: provide a way to limit max number of workers")
Reported-by: syzbot+97fa56483f69d677969f@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of !SQPOLL, io_cqring_ev_posted_iopoll() doesn't provide a
memory barrier required by waitqueue_active(&ctx->poll_wait). There is
a wq_has_sleeper(), which does smb_mb() inside, but it's called only for
SQPOLL.
Fixes: 5fd4617840 ("io_uring: be smarter about waking multiple CQ ring waiters")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2982e53bcea2274006ed435ee2a77197107d8a29.1631130542.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEz5eEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmk1D/wML8Im2erR5s0PaWZgYxXlgEKrJDwJm/p+
2Uixrn/9kQAhwH+0kJnCiI+HwlL3LU+5/iAdeGtdYMcVaotPPmm5V3jfud8+RuAi
E+uIOdULXgQKj8pkiQ2h5mvYd0BxGkGH38gUqilSwFrY2HTpbfxreCHhYoQaE/7o
DiGNgbhJglSFIBuIgS4cfpLkI3FdaAmrCydZ9zaqEv/G/bx9aA9lwSbAJadhTbmt
Qc1vvbh2FB9YvgZX8qfaneyDKzQbwqTvKxCe2SOVMOp/X0feJym7WZUvrPr04EoZ
zBaLDkmn44re4iWPbide7+KQJ8NMQQDBiuxwF5WxdF3hrcsiwqmKgDtBEGWXFMeV
CUZ9Osrfb480UKsDExtxLhQqGz1JZqIPZdtDvSJb8MunPZtvTz27NNFyyb9aBrlX
WiwEHqAOE1W33buPCNyuYLGDVYis4/TkwF0NZpMwsyPdN0Iz/M8Z5F5BHhC7BYoP
U8KMsX3XvddxB113U+IMVqI/SuvT125U65brklQlQeLEHnH57ceII9mNGfNic6LR
bcIu7Fb5J1U5nAMeeLCSXsEYXs+peYgI1UOWXaWgSVixUAyU8H+OqsBVIl8eiMjr
TTbdIMmfWqENE3wBM709FQQLoMmGl1YjBkGmBXKZjNHcDrf9X56rimSxRD2i2okg
r2JczxQ5uQ==
=QoQg
-----END PGP SIGNATURE-----
Merge tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"As sometimes happens, two reports came in around the merge window open
that led to some fixes. Hence this one is a bit bigger than usual
followup fixes, but most of it will be going towards stable, outside
of the fixes that are addressing regressions from this merge window.
In detail:
- postgres is a heavy user of signals between tasks, and if we're
unlucky this can interfere with io-wq worker creation. Make sure
we're resilient against unrelated signal handling. This set of
changes also includes hardening against allocation failures, which
could previously had led to stalls.
- Some use cases that end up having a mix of bounded and unbounded
work would have starvation issues related to that. Split the
pending work lists to handle that better.
- Completion trace int -> unsigned -> long fix
- Fix issue with REGISTER_IOWQ_MAX_WORKERS and SQPOLL
- Fix regression with hash wait lock in this merge window
- Fix retry issued on block devices (Ming)
- Fix regression with links in this merge window (Pavel)
- Fix race with multi-shot poll and completions (Xiaoguang)
- Ensure regular file IO doesn't inadvertently skip completion
batching (Pavel)
- Ensure submissions are flushed after running task_work (Pavel)"
* tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block:
io_uring: io_uring_complete() trace should take an integer
io_uring: fix possible poll event lost in multi shot mode
io_uring: prolong tctx_task_work() with flushing
io_uring: don't disable kiocb_done() CQE batching
io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
io-wq: make worker creation resilient against signals
io-wq: get rid of FIXED worker flag
io-wq: only exit on fatal signals
io-wq: split bounded and unbounded work into separate lists
io-wq: fix queue stalling race
io_uring: don't submit half-prepared drain request
io_uring: fix queueing half-created requests
io-wq: ensure that hash wait lock is IRQ disabling
io_uring: retry in case of short read on block device
io_uring: IORING_OP_WRITE needs hash_reg_file set
io-wq: fix race between adding work and activating a free worker
[ 74.211232] BUG: KASAN: stack-out-of-bounds in iov_iter_revert+0x809/0x900
[ 74.212778] Read of size 8 at addr ffff888025dc78b8 by task
syz-executor.0/828
[ 74.214756] CPU: 0 PID: 828 Comm: syz-executor.0 Not tainted
5.14.0-rc3-next-20210730 #1
[ 74.216525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 74.219033] Call Trace:
[ 74.219683] dump_stack_lvl+0x8b/0xb3
[ 74.220706] print_address_description.constprop.0+0x1f/0x140
[ 74.224226] kasan_report.cold+0x7f/0x11b
[ 74.226085] iov_iter_revert+0x809/0x900
[ 74.227960] io_write+0x57d/0xe40
[ 74.232647] io_issue_sqe+0x4da/0x6a80
[ 74.242578] __io_queue_sqe+0x1ac/0xe60
[ 74.245358] io_submit_sqes+0x3f6e/0x76a0
[ 74.248207] __do_sys_io_uring_enter+0x90c/0x1a20
[ 74.257167] do_syscall_64+0x3b/0x90
[ 74.257984] entry_SYSCALL_64_after_hwframe+0x44/0xae
old_size = iov_iter_count();
...
iov_iter_revert(old_size - iov_iter_count());
If iov_iter_revert() is done base on the initial size as above, and the
iter is truncated and not reexpanded in the middle, it miscalculates
borders causing problems. This trace is due to no one reexpanding after
generic_write_checks().
Now iters store how many bytes has been truncated, so reexpand them to
the initial state right before reverting.
Cc: stable@vger.kernel.org
Reported-by: Palash Oswal <oswalpalash@gmail.com>
Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Reported-and-tested-by: syzbot+9671693590ef5aad8953@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
IIUC, IORING_POLL_ADD_MULTI is similar to epoll's edge-triggered mode,
that means once one pure poll request returns one event(cqe), we'll
need to read or write continually until EAGAIN is returned, then I think
there is a possible poll event lost race in multi shot mode:
t1 poll request add | |
t2 | |
t3 event happens | |
t4 task work add | |
t5 | task work run |
t6 | commit one cqe |
t7 | | user app handles cqe
t8 | new event happen |
t9 | add back to waitqueue |
t10 |
After t6 but before t9, if new event happens, there'll be no wakeup
operation, and if user app has picked up this cqe in t7, read or write
until EAGAIN is returned. In t8, new event happens and will be lost,
though this race window maybe small.
To fix this possible race, add poll request back to waitqueue before
committing cqe.
Fixes: 88e41cf928 ("io_uring: add multishot mode for IORING_OP_POLL_ADD")
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210903142436.5767-1-xiaoguang.wang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
SQPOLL has a different thread doing submissions, we need to check for
that and use the right task context when updating the worker values.
Just hold the sqd->lock across the operation, this ensures that the
thread cannot go away while we poke at ->io_uring.
Link: https://github.com/axboe/liburing/issues/420
Fixes: 2e480058dd ("io-wq: provide a way to limit max number of workers")
Reported-by: Johannes Lundberg <johalun0@gmail.com>
Tested-by: Johannes Lundberg <johalun0@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of buffered reading from block device, when short read happens,
we should retry to read more, otherwise the IO will be completed
partially, for example, the following fio expects to read 2MB, but it
can only read 1M or less bytes:
fio --name=onessd --filename=/dev/nvme0n1 --filesize=2M \
--rw=randread --bs=2M --direct=0 --overwrite=0 --numjobs=1 \
--iodepth=1 --time_based=0 --runtime=2 --ioengine=io_uring \
--registerfiles --fixedbufs --gtod_reduce=1 --group_reporting
Fix the issue by allowing short read retry for block device, which sets
FMODE_BUF_RASYNC really.
Fixes: 9a173346bd ("io_uring: fix short read retries for non-reg files")
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20210821150751.1290434-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
During some testing, it became evident that using IORING_OP_WRITE doesn't
hash buffered writes like the other writes commands do. That's simply
an oversight, and can cause performance regressions when doing buffered
writes with this command.
Correct that and add the flag, so that buffered writes are correctly
hashed when using the non-iovec based write command.
Cc: stable@vger.kernel.org
Fixes: 3a6820f2bb ("io_uring: add non-vectored read/write commands")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs8fUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpio4D/9cGrHIbbZsuDIHzhaK2JIUrSG7G4GkcaG/
NAqbOp7KvF+1elMY08DWLT0nnFqHM7REHIS4Lv55KCNtktTFfdYmxso4lPrRu67o
MNbMJcEAglgIDw0xP4MfP/vZ0ftXJv8+OXSfL51pD4U40nWIZVpqn8WbWKRqjhGf
nQhiANbl2mO2Ec7I/UgAIqwczQnF5HveCkX5106dAppma8yEH+v2TkvZyZp/TCU3
h0ec26hLi+4QRBFm4O0yrVWj1gMS7yfHuEFSGw+jhp/WNTpH9A5pXFQjn7pIyJNi
uqrwM7knrod9ZH2pE1825w0TrbqkOdcZCo+/NvJHOAy03LUBJ/9qDc+JJUWsEmLZ
cpd8auaCfuAFx6ForHmKd+Pw1bANebWBMsClyQSh38+fsJ9myci3c3tkkzmO+dSW
G+rZZochiG4nFSl+CvlUoFfztuu8rdbOLKI/9usPMHNcDiY4yAAmz80B9uQdtQp7
tRLqegplsDODefLNvl0/Uj7WFJl6w5furchTXPmc+GSPFc+mpW08Olh7ScaCyD8c
a8YXaQi5hwuUR1N7uW65Df/HGMbIDvxOStcurIakP0mOSvRKrojZgQhbJ8zuCG4y
cRCwRUzvreNIoKK2ZxEvhLjhE5POaWgy6AtN/UI9k9BeVGQdboKVBGvub5Mv+ZKE
HpchbANk8Q==
=T7Zv
-----END PGP SIGNATURE-----
Merge tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block
Pull io_uring mkdirat/symlinkat/linkat support from Jens Axboe:
"This adds io_uring support for mkdirat, symlinkat, and linkat"
* tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block:
io_uring: add support for IORING_OP_LINKAT
io_uring: add support for IORING_OP_SYMLINKAT
io_uring: add support for IORING_OP_MKDIRAT
namei: update do_*() helpers to return ints
namei: make do_linkat() take struct filename
namei: add getname_uflags()
namei: make do_symlinkat() take struct filename
namei: make do_mknodat() take struct filename
namei: make do_mkdirat() take struct filename
namei: change filename_parentat() calling conventions
namei: ignore ERR/NULL names in putname()
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs8QQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgAgD/wP9gGxrFE5oxtdozDPkEYTXn5e0QKseDyV
cNxLmSb3wc4WIEPwjCavdQHpy0fnbjaYwGveHf9ygQwDZPj9WBgEL3ipPYXCCzFA
ysoV86kBRxKDI476r2InxI8WaW7hV0IWxPlScUTA1QeeNAzRJDymQvRuwg5KvVRS
Jt6R58khzWpEGYO2CqFTpGsA7x01R0kvZ54xmFgKZ+Pxo+Bk03fkO32YUFC49Wm8
Zy+JMsaiIlLgucDTJ4zAKjQUXiwP2GMEw5Vk/lLUFGBvyw0AN2rO9g18L7QW2ZUu
vnkaJQwBbMUbgveXlI/y6GG/vuKUG2i4AmzNJH17qFCnimO3JY6vgzUOg5dqOiwx
bx7ZzmnBWgQp95/cSAlZ4QwRYf3z0hvVFKPj9U3X9wKGmuxUKHiLResQwp7bzRdd
4L4Jo1WFDDHR/1MOOzzW0uxE3uTm0LKcncsi4hJL20dl+16RXCIbzHWUTAd8yyMV
9QeUAumc4GHOeswa1Ms8jLPAgXyEoAkec7ca7cRIY/NW+DXGLG9tYBgCw1eLe6BN
M7LwMsPNlS2v2dMUbiuw8XxkA+uYso728e2vd/edca2jxXj8+SVnm020aYBnxIzh
nmjbf69+QddBPEnk/EPvRj8tXOhr3k7FklI4R7qlei/+IGTujGPvM4kn3p6fnHrx
d7bsu/jtaQ==
=izfH
-----END PGP SIGNATURE-----
Merge tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block
Pull support for struct bio recycling from Jens Axboe:
"This adds bio recycling support for polled IO, allowing quick reuse of
a bio for high IOPS scenarios via a percpu bio_set list.
It's good for almost a 10% improvement in performance, bumping our
per-core IO limit from ~3.2M IOPS to ~3.5M IOPS"
* tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block:
bio: improve kerneldoc documentation for bio_alloc_kiocb()
block: provide bio_clear_hipri() helper
block: use the percpu bio cache in __blkdev_direct_IO
io_uring: enable use of bio alloc cache
block: clear BIO_PERCPU_CACHE flag if polling isn't supported
bio: add allocation cache abstraction
fs: add kiocb alloc cache flag
bio: optimize initialization of a bio
We allow updating normal timeouts, add support for adjusting timings of
linked timeouts as well.
Reported-by: Victor Stewart <v@nametag.social>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A preparation patch. Keep all queued linked timeout in a list, so they
may be found and updated.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Certain use cases want to use CLOCK_BOOTTIME or CLOCK_REALTIME rather than
CLOCK_MONOTONIC, instead of the default CLOCK_MONOTONIC.
Add an IORING_TIMEOUT_BOOTTIME and IORING_TIMEOUT_REALTIME flag that
allows timeouts and linked timeouts to use the selected clock source.
Only one clock source may be selected, and we -EINVAL the request if more
than one is given. If neither BOOTIME nor REALTIME are selected, the
previous default of MONOTONIC is used.
Link: https://github.com/axboe/liburing/issues/369
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io-wq divides work into two categories:
1) Work that completes in a bounded time, like reading from a regular file
or a block device. This type of work is limited based on the size of
the SQ ring.
2) Work that may never complete, we call this unbounded work. The amount
of workers here is just limited by RLIMIT_NPROC.
For various uses cases, it's handy to have the kernel limit the maximum
amount of pending workers for both categories. Provide a way to do with
with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation.
IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets
the max worker count to what is being passed in for each category. The
old values are returned into that same array. If 0 is being passed in for
either category, it simply returns the current value.
The value is capped at RLIMIT_NPROC. This actually isn't that important
as it's more of a hint, if we're exceeding the value then our attempt
to fork a new worker will fail. This happens naturally already if more
than one node is in the system, as these values are per-node internally
for io-wq.
Reported-by: Johannes Lundberg <johalun0@gmail.com>
Link: https://github.com/axboe/liburing/issues/420
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->buf_index is u16 and so we rely on registered buffers indexes
fitting into it. Add a build check, so when the upper limit for the
number of buffers is lifted we get a compliation fail but not lurking
problems.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/787e8e1a17cea51ca6301426b1c4c4887b8bd676.1629920396.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Given a linkchain like this:
req0(link_flag)-->req1(link_flag)-->...-->reqn(no link_flag)
There is a problem:
- if some intermediate linked req like req1 's submittion fails, reqs
after it won't be cancelled.
- sqpoll disabled: maybe it's ok since users can get the error info
of req1 and stop submitting the following sqes.
- sqpoll enabled: definitely a problem, the following sqes will be
submitted in the next round.
The solution is to refactor the code logic to:
- if a linked req's submittion fails, just mark it and the head(if it
exists) as REQ_F_FAIL. Leverage req->result to indicate whether it
is failed or cancelled.
- submit or fail the whole chain when we come to the end of it.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20210827094609.36052-3-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's not necessary to free the request back to slab when we fail to
get sqe, just move it to state->free_list.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20210825175856.194299-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As done with open opcodes, allow accept to skip installing fd into
processes' file tables and put it directly into io_uring's fixed file
table. Same restrictions and design as for open.
Suggested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/6d16163f376fac7ac26a656de6b42199143e9721.1629888991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>