Commit Graph

74929 Commits

Author SHA1 Message Date
Linus Torvalds
af472a9efd for-5.18/io_uring-2022-03-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI09cgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjI1D/4uo1FnG+mF9ZEPAy/FKBh6I74gHj7eBJyw
 obKyH2oszN19VI+HLYYQgBrPwlc4lOzxmYM0kbAQWfB+PCgxAExX+OkJrGrxWW3R
 Hlc1RSQkravh+CBzEc1VIYIWK3XF+guZIwqLPa3gS0hwea8LYBae9h3gKjhvb2kH
 4BbSh7Ax5k9klVUcNJuUXnG7nytyaiaQAbMl88iGV3S2bpseWNi0WTHvJ3Eizbyz
 sLponJcgUb0/H2G00MXf3M3+V40s7AiLXeEMwFcv29PZUQIEyp5P8Rx43//48nbE
 0Rms88Uz4tFHRyFe/48IKa4tjJgk6fy1Cjd3+Thyn8TeJaYhD6vZu4LA/uGo8fkm
 EZEmLFHw5mqn7Xz3OA31ErnnpSmwT/T164vMPcvrlLao/1ZSHD1fK2PjyDMTX+aO
 1wec5+mMsTbOFgzYJRH3tyEH/GfWxpsNJIPfdCjV5BRPd5HwFBBTvRwFEVTX6+Go
 CS4JAGWkTgmr2u1OeyzaQ1lYPZWD1GTt/HJj7rEvuqPzlreX9nRlvTtIcVubw39a
 dgx7nWLLCOCN3o4tFHEvmf05/92TiQd1Vw8wEQT0NdmkLuqUuJJOnVo53YWzHA5Y
 qqMwPcciiYMi92btrJdsZOgj3GCKxP6rLxyHJm1mY5kFSjEB5e0swV7xryme40r4
 ZJ6sUvEyHg==
 =OBma
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18/io_uring-2022-03-18' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:

 - Fixes for current file position. Still doesn't have the f_pos_lock
   sorted, but it's a step in the right direction (Dylan)

 - Tracing updates (Dylan, Stefan)

 - Improvements to io-wq locking (Hao)

 - Improvements for provided buffers (me, Pavel)

 - Support for registered file descriptors (me, Xiaoguang)

 - Support for ring messages (me)

 - Poll improvements (me)

 - Fix for fixed buffers and non-iterator reads/writes (me)

 - Support for NAPI on sockets (Olivier)

 - Ring quiesce improvements (Usama)

 - Misc fixes (Olivier, Pavel)

* tag 'for-5.18/io_uring-2022-03-18' of git://git.kernel.dk/linux-block: (42 commits)
  io_uring: terminate manual loop iterator loop correctly for non-vecs
  io_uring: don't check unrelated req->open.how in accept request
  io_uring: manage provided buffers strictly ordered
  io_uring: fold evfd signalling under a slower path
  io_uring: thin down io_commit_cqring()
  io_uring: shuffle io_eventfd_signal() bits around
  io_uring: remove extra barrier for non-sqpoll iopoll
  io_uring: fix provided buffer return on failure for kiocb_done()
  io_uring: extend provided buf return to fails
  io_uring: refactor timeout cancellation cqe posting
  io_uring: normilise naming for fill_cqe*
  io_uring: cache poll/double-poll state with a request flag
  io_uring: cache req->apoll->events in req->cflags
  io_uring: move req->poll_refs into previous struct hole
  io_uring: make tracing format consistent
  io_uring: recycle apoll_poll entries
  io_uring: remove duplicated member check for io_msg_ring_prep()
  io_uring: allow submissions to continue on error
  io_uring: recycle provided buffers if request goes async
  io_uring: ensure reads re-import for selected buffers
  ...
2022-03-21 16:24:45 -07:00
Linus Torvalds
356a1adca8 arm64 updates for 5.18
- Support for including MTE tags in ELF coredumps
 
 - Instruction encoder updates, including fixes to 64-bit immediate
   generation and support for the LSE atomic instructions
 
 - Improvements to kselftests for MTE and fpsimd
 
 - Symbol aliasing and linker script cleanups
 
 - Reduce instruction cache maintenance performed for user mappings
   created using contiguous PTEs
 
 - Support for the new "asymmetric" MTE mode, where stores are checked
   asynchronously but loads are checked synchronously
 
 - Support for the latest pointer authentication algorithm ("QARMA3")
 
 - Support for the DDR PMU present in the Marvell CN10K platform
 
 - Support for the CPU PMU present in the Apple M1 platform
 
 - Use the RNDR instruction for arch_get_random_{int,long}()
 
 - Update our copy of the Arm optimised string routines for str{n}cmp()
 
 - Fix signal frame generation for CPUs which have foolishly elected to
   avoid building in support for the fpsimd instructions
 
 - Workaround for Marvell GICv3 erratum #38545
 
 - Clarification to our Documentation (booting reqs. and MTE prctl())
 
 - Miscellanous cleanups and minor fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmIvta8QHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNAIhB/oDSva5FryAFExVuIB+mqRkbZO9kj6fy/5J
 ctN9LEVO2GI/U1TVAUWop1lXmP8Kbq5UCZOAuY8sz7dAZs7NRUWkwTrXVhaTpi6L
 oxCfu5Afu76d/TGgivNz+G7/ewIJRFj5zCPmHezLF9iiWPUkcAsP0XCp4a0iOjU4
 04O4d7TL/ap9ujEes+U0oEXHnyDTPrVB2OVE316FKD1fgztcjVJ2U+TxX5O4xitT
 PPIfeQCjQBq1B2OC1cptE3wpP+YEr9OZJbx+Ieweidy1CSInEy0nZ13tLoUnGPGU
 KPhsvO9daUCbhbd5IDRBuXmTi/sHU4NIB8LNEVzT1mUPnU8pCizv
 =ziGg
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:

 - Support for including MTE tags in ELF coredumps

 - Instruction encoder updates, including fixes to 64-bit immediate
   generation and support for the LSE atomic instructions

 - Improvements to kselftests for MTE and fpsimd

 - Symbol aliasing and linker script cleanups

 - Reduce instruction cache maintenance performed for user mappings
   created using contiguous PTEs

 - Support for the new "asymmetric" MTE mode, where stores are checked
   asynchronously but loads are checked synchronously

 - Support for the latest pointer authentication algorithm ("QARMA3")

 - Support for the DDR PMU present in the Marvell CN10K platform

 - Support for the CPU PMU present in the Apple M1 platform

 - Use the RNDR instruction for arch_get_random_{int,long}()

 - Update our copy of the Arm optimised string routines for str{n}cmp()

 - Fix signal frame generation for CPUs which have foolishly elected to
   avoid building in support for the fpsimd instructions

 - Workaround for Marvell GICv3 erratum #38545

 - Clarification to our Documentation (booting reqs. and MTE prctl())

 - Miscellanous cleanups and minor fixes

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (90 commits)
  docs: sysfs-devices-system-cpu: document "asymm" value for mte_tcf_preferred
  arm64/mte: Remove asymmetric mode from the prctl() interface
  arm64: Add cavium_erratum_23154_cpus missing sentinel
  perf/marvell: Fix !CONFIG_OF build for CN10K DDR PMU driver
  arm64: mm: Drop 'const' from conditional arm64_dma_phys_limit definition
  Documentation: vmcoreinfo: Fix htmldocs warning
  kasan: fix a missing header include of static_keys.h
  drivers/perf: Add Apple icestorm/firestorm CPU PMU driver
  drivers/perf: arm_pmu: Handle 47 bit counters
  arm64: perf: Consistently make all event numbers as 16-bits
  arm64: perf: Expose some Armv9 common events under sysfs
  perf/marvell: cn10k DDR perf event core ownership
  perf/marvell: cn10k DDR perfmon event overflow handling
  perf/marvell: CN10k DDR performance monitor support
  dt-bindings: perf: marvell: cn10k ddr performance monitor
  arm64: clean up tools Makefile
  perf/arm-cmn: Update watchpoint format
  perf/arm-cmn: Hide XP PUB events for CMN-600
  arm64: drop unused includes of <linux/personality.h>
  arm64: Do not defer reserve_crashkernel() for platforms with no DMA memory zones
  ...
2022-03-21 10:46:39 -07:00
Linus Torvalds
6e4069881a fix for regression in multiuser mount parm
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmI0tLYACgkQiiy9cAdy
 T1GllAv/W+4jMNZguBeAfCxSSuIjYEJI4Pz3QvBMto5gb4rgTEJrOLyz5n+Zj2+/
 VpLnTt0GNOcKdstfO2bql3P3m+gWF6UdiJA8/LQ6vR4fLJIGyX6VcxtDXhI3+PaA
 FoIRrNAEhY3knOaz8gCEVA18+F2QPGdmfFdMUziPf2vTqXtqsPXAUKJLQCwMWHXe
 8aWTbfvhFR1b+xeZy3Rb9rOjzd1qRIgSZYDfpmkLHejckhSMyR5i0NiwikTxIA2A
 IpHA48b/jfvOOyg7ABjZrAIumNZxZE7mYSjMGNhuEdiR17oVwtgsXn8c8DLLtNkY
 +95dKXuuGIPiGGTRk2OnNotHdC0GOj5VT/Z4akSnHGZHGEBoNe35QkUPWFpusc8L
 Wjt2hzcNErYnFo2Qi4QTI2i7NTFoTig0UYPd1KUnsvjeLQWo/DtQluu+FR7GEl1F
 MDQDyaDgcm10nTDy3xA8zGQRdJORKM0Qzq9dTZs/sWUUN7trynk0KTCT8CgsGqH1
 3jHXYrRR
 =rhQe
 -----END PGP SIGNATURE-----

Merge tag '5.17-rc8-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fix from Steve French:
 "Small fix for regression in multiuser mounts.

  The additional improvements suggested by Ronnie to make the server and
  session status handling code easier to read can wait for the 5.18
  merge window."

* tag '5.17-rc8-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: fix incorrect session setup check for multiuser mounts
2022-03-18 12:22:15 -07:00
Jens Axboe
5e92936746 io_uring: terminate manual loop iterator loop correctly for non-vecs
The fix for not advancing the iterator if we're using fixed buffers is
broken in that it can hit a condition where we don't terminate the loop.
This results in io-wq looping forever, asking to read (or write) 0 bytes
for every subsequent loop.

Reported-by: Joel Jaeschke <joel.jaeschke@gmail.com>
Link: https://github.com/axboe/liburing/issues/549
Fixes: 16c8d2df7e ("io_uring: ensure symmetry in handling iter types in loop_rw_iter()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-18 11:42:48 -06:00
Jens Axboe
adf3a9e9f5 io_uring: don't check unrelated req->open.how in accept request
Looks like a victim of too much copy/paste, we should not be looking
at req->open.how in accept. The point is to check CLOEXEC and error
out, which we don't invalid direct descriptors on exec. Hence any
attempt to get a direct descriptor with CLOEXEC is invalid.

No harm is done here, as req->open.how.flags overlaps with
req->accept.flags, but it's very confusing and might change if either of
those command structs are modified.

Fixes: aaa4db12ef ("io_uring: accept directly into fixed file table")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-18 10:57:19 -06:00
Jens Axboe
dbc7d452e7 io_uring: manage provided buffers strictly ordered
Workloads using provided buffers benefit from using and returning buffers
in the right order, and so does TLBs for that matter. Manage the internal
buffer list in a straight list, rather than use the head buffer as the
insertion node. Use a hashed list for the buffer group IDs instead of
xarray, the overhead is much lower this way. xarray provides internal
locking and other trickery that is handy for some uses cases, but
io_uring already locks internally for the buffer manipulation and needs
none of that.

This is good for about a 2% reduction in overhead, combination of the
improved management and the fact that the workload has an easier time
bundling back provided buffers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-17 17:20:10 -06:00
Joseph Qi
7b0b1332cf ocfs2: fix crash when initialize filecheck kobj fails
Once s_root is set, genric_shutdown_super() will be called if
fill_super() fails.  That means, we will call ocfs2_dismount_volume()
twice in such case, which can lead to kernel crash.

Fix this issue by initializing filecheck kobj before setting s_root.

Link: https://lkml.kernel.org/r/20220310081930.86305-1-joseph.qi@linux.alibaba.com
Fixes: 5f483c4abb ("ocfs2: add kobject for online file check")
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-17 11:02:13 -07:00
Steve French
e3ee9fb226 smb3: fix incorrect session setup check for multiuser mounts
A recent change to how the SMB3 server (socket) and session status
is managed regressed multiuser mounts by changing the check
for whether session setup is needed to the socket (TCP_Server_info)
structure instead of the session struct (cifs_ses). Add additional
check in cifs_setup_sesion to fix this.

Fixes: 73f9bfbe3d ("cifs: maintain a state machine for tcp/smb/tcon sessions")
Reported-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-03-16 22:48:55 -05:00
Pavel Begunkov
9aa8dfde48 io_uring: fold evfd signalling under a slower path
Add ->has_evfd flag, which is true IFF there is an eventfd attached, and
use it to hide io_eventfd_signal() into __io_commit_cqring_flush() and
combine fast checks in a single if. Also, gcc 11.2 wasn't inlining
io_cqring_ev_posted() without this change, so helps with that as well.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f6168471997decded475a063f92915787975a30b.1647481208.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-16 20:26:32 -06:00
Pavel Begunkov
9333f6b462 io_uring: thin down io_commit_cqring()
io_commit_cqring() is currently always under spinlock section, so it's
always better to keep it as slim as possible. Move
__io_commit_cqring_flush() out of it into ev_posted*(). If fast checks
do fail and this post-processing is required, we'll reacquire
->completion_lock, which is fine as we don't care about performance of
draining and offset timeouts.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ec4e81fd720d3bc7bca8cb9152e080dad1a052f1.1647481208.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-16 20:26:32 -06:00
Pavel Begunkov
66fc25ca6b io_uring: shuffle io_eventfd_signal() bits around
A preparation patch, which moves a fast ->io_ev_fd check out of
io_eventfd_signal() into ev_posted*(). Compilers are smart enough for it
to not change anything, but will need it later.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ec4091ac76d43912b73917e8db651c2dac4b7b01.1647481208.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-16 20:26:32 -06:00
Pavel Begunkov
0f84747177 io_uring: remove extra barrier for non-sqpoll iopoll
smp_mb() in io_cqring_ev_posted_iopoll() is only there because of
waitqueue_active(). However, non-SQPOLL IOPOLL ring doesn't wake the CQ
and so the barrier there is useless. Kill it, it's usually pretty
expensive.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d72e8ef6f7a3f6a72e18fad8409f7d47afc8da7d.1647481208.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-16 20:26:32 -06:00
Pavel Begunkov
b91ef18728 io_uring: fix provided buffer return on failure for kiocb_done()
Use io_req_complete_failed() in kiocb_done(). This cleans up the code,
but also ensures that a provided buffers is correctly freed on failure.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a4880106fcf199d5810707fe2d17126fcdf18bc4.1647481208.git.asml.silence@gmail.com
[axboe: split from previous patch]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-16 20:24:47 -06:00
Pavel Begunkov
3b2b78a8eb io_uring: extend provided buf return to fails
It's never a good idea to put provided buffers without notifying the
userspace, it'll lead to userspace leaks, so add io_put_kbuf() in
io_req_complete_failed(). The fail helper is called by all sorts of
requests, but it's still safe to do as io_put_kbuf() will return 0 in
for all requests that don't support and so don't expect provided buffers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a4880106fcf199d5810707fe2d17126fcdf18bc4.1647481208.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-16 20:24:28 -06:00
Pavel Begunkov
6695490dc8 io_uring: refactor timeout cancellation cqe posting
io_fill_cqe*() is not always the best way to post CQEs just because
there is enough of infrastructure on top. Replace a raw call to a
variant of it inside of io_timeout_cancel(), which also saves us some
bloating and might help with batching later.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/46113ec4345764b4aef3b384ce38cceabaeedcbb.1647481208.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-16 20:11:15 -06:00
Pavel Begunkov
ae4da18941 io_uring: normilise naming for fill_cqe*
Restore consistency in __io_fill_cqe* like helpers, always honouring
"io_" prefix and adding "req" when we're passing in a request.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bd016ff5c1a4f74687828069d2619d8a65e0c6d7.1647481208.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-16 20:11:00 -06:00
Jens Axboe
91eac1c69c io_uring: cache poll/double-poll state with a request flag
With commit "io_uring: cache req->apoll->events in req->cflags" applied,
we now have just io_poll_remove_entries() dipping into req->apoll when
it isn't strictly necessary.

Mark poll and double-poll with a flag, so we know if we need to look
at apoll->double_poll. This avoids pulling in those cachelines if we
don't need them. The common case is that the poll wake handler already
removed these entries while hot off the completion path.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-16 16:59:10 -06:00
Jens Axboe
81459350d5 io_uring: cache req->apoll->events in req->cflags
When we arm poll on behalf of a different type of request, like a network
receive, then we allocate req->apoll as our poll entry. Running network
workloads shows io_poll_check_events() as the most expensive part of
io_uring, and it's all due to having to pull in req->apoll instead of
just the request which we have hot already.

Cache poll->events in req->cflags, which isn't used until the request
completes anyway. This isn't strictly needed for regular poll, where
req->poll.events is used and thus already hot, but for the sake of
unification we do it all around.

This saves 3-4% of overhead in certain request workloads.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-16 16:55:05 -06:00
Jens Axboe
521d61fc76 io_uring: move req->poll_refs into previous struct hole
This serves two purposes:

- We now have the last cacheline mostly unused for generic workloads,
  instead of having to pull in the poll refs explicitly for workloads
  that rely on poll arming.

- It shrinks the io_kiocb from 232 to 224 bytes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-16 12:53:23 -06:00
Jens Axboe
4d9237e32c io_uring: recycle apoll_poll entries
Particularly for networked workloads, io_uring intensively uses its
poll based backend to get a notification when data/space is available.
Profiling workloads, we see 3-4% of alloc+free that is directly attributed
to just the apoll allocation and free (and the rest being skb alloc+free).

For the fast path, we have ctx->uring_lock held already for both issue
and the inline completions, and we can utilize that to avoid any extra
locking needed to have a basic recycling cache for the apoll entries on
both the alloc and free side.

Double poll still requires an allocation. But those are rare and not
a fast path item.

With the simple cache in place, we see a 3-4% reduction in overhead for
the workload.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-15 10:54:08 -06:00
Jens Axboe
f3b6a41eb2 io_uring: remove duplicated member check for io_msg_ring_prep()
Julia and the kernel test robot report that the prep handling for this
command inadvertently checks one field twice:

fs/io_uring.c:4338:42-56: duplicated argument to && or ||

Get rid of it.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Fixes: 4f57f06ce2 ("io_uring: add support for IORING_OP_MSG_RING command")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-12 06:50:13 -07:00
Linus Torvalds
93ce93587d Merge branch 'davidh' (fixes from David Howells)
Merge misc fixes from David Howells:
 "A set of patches for watch_queue filter issues noted by Jann. I've
  added in a cleanup patch from Christophe Jaillet to convert to using
  formal bitmap specifiers for the note allocation bitmap.

  Also two filesystem fixes (afs and cachefiles)"

* emailed patches from David Howells <dhowells@redhat.com>:
  cachefiles: Fix volume coherency attribute
  afs: Fix potential thrashing in afs writeback
  watch_queue: Make comment about setting ->defunct more accurate
  watch_queue: Fix lack of barrier/sync/lock between post and read
  watch_queue: Free the alloc bitmap when the watch_queue is torn down
  watch_queue: Fix the alloc bitmap size to reflect notes allocated
  watch_queue: Use the bitmap API when applicable
  watch_queue: Fix to always request a pow-of-2 pipe ring size
  watch_queue: Fix to release page in ->release()
  watch_queue, pipe: Free watchqueue state after clearing pipe ring
  watch_queue: Fix filter limit check
2022-03-11 10:28:32 -08:00
David Howells
413a4a6b0b cachefiles: Fix volume coherency attribute
A network filesystem may set coherency data on a volume cookie, and if
given, cachefiles will store this in an xattr on the directory in the
cache corresponding to the volume.

The function that sets the xattr just stores the contents of the volume
coherency buffer directly into the xattr, with nothing added; the
checking function, on the other hand, has a cut'n'paste error whereby it
tries to interpret the xattr contents as would be the xattr on an
ordinary file (using the cachefiles_xattr struct).  This results in a
failure to match the coherency data because the buffer ends up being
shifted by 18 bytes.

Fix this by defining a structure specifically for the volume xattr and
making both the setting and checking functions use it.

Since the volume coherency doesn't work if used, take the opportunity to
insert a reserved field for future use, set it to 0 and check that it is
0.  Log mismatch through the appropriate tracepoint.

Note that this only affects cifs; 9p, afs, ceph and nfs don't use the
volume coherency data at the moment.

Fixes: 32e150037d ("fscache, cachefiles: Store the volume coherency data")
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Steve French <smfrench@gmail.com>
cc: linux-cifs@vger.kernel.org
cc: linux-cachefs@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11 10:24:37 -08:00
David Howells
173ce1ca47 afs: Fix potential thrashing in afs writeback
In afs_writepages_region(), if the dirty page we find is undergoing
writeback or write to cache, but the sync_mode is WB_SYNC_NONE, we go
round the loop trying the same page again and again with no pausing or
waiting unless and until another thread manages to clear the writeback
and fscache flags.

Fix this with three measures:

 (1) Advance start to after the page we found.

 (2) Break out of the loop and return if rescheduling is requested.

 (3) Arbitrarily give up after a maximum of 5 skips.

Fixes: 31143d5d51 ("AFS: implement basic file write support")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
Acked-by: Marc Dionne <marc.dionne@auristor.com>
Link: https://lore.kernel.org/r/164692725757.2097000.2060513769492301854.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11 10:24:37 -08:00
David Howells
2ed147f015 watch_queue: Fix lack of barrier/sync/lock between post and read
There's nothing to synchronise post_one_notification() versus
pipe_read().  Whilst posting is done under pipe->rd_wait.lock, the
reader only takes pipe->mutex which cannot bar notification posting as
that may need to be made from contexts that cannot sleep.

Fix this by setting pipe->head with a barrier in post_one_notification()
and reading pipe->head with a barrier in pipe_read().

If that's not sufficient, the rd_wait.lock will need to be taken,
possibly in a ->confirm() op so that it only applies to notifications.
The lock would, however, have to be dropped before copy_page_to_iter()
is invoked.

Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11 10:17:13 -08:00
David Howells
db8facfc9f watch_queue, pipe: Free watchqueue state after clearing pipe ring
In free_pipe_info(), free the watchqueue state after clearing the pipe
ring as each pipe ring descriptor has a release function, and in the
case of a notification message, this is watch_queue_pipe_buf_release()
which tries to mark the allocation bitmap that was previously released.

Fix this by moving the put of the pipe's ref on the watch queue to after
the ring has been cleared.  We still need to call watch_queue_clear()
before doing that to make sure that the pipe is disconnected from any
notification sources first.

Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11 10:17:12 -08:00
Jens Axboe
bcbb7bf6cc io_uring: allow submissions to continue on error
By default, io_uring will stop submitting a batch of requests if we run
into an error submitting a request. This isn't strictly necessary, as
the error result is passed out-of-band via a CQE anyway. And it can be
a bit confusing for some applications.

Provide a way to setup a ring that will continue submitting on error,
when the error CQE has been posted.

There's still one case that will break out of submission. If we fail
allocating a request, then we'll still return -ENOMEM. We could in theory
post a CQE for that condition too even if we never got a request. Leave
that for a potential followup.

Reported-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 13:05:25 -07:00
Jens Axboe
b1c6264575 io_uring: recycle provided buffers if request goes async
If we are using provided buffers, it's less than useful to have a buffer
selected and pinned if a request needs to go async or arms poll for
notification trigger on when we can process it.

Recycle the buffer in those events, so we don't pin it for the duration
of the request.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 09:55:01 -07:00
Jens Axboe
2be2eb02e2 io_uring: ensure reads re-import for selected buffers
If we drop buffers between scheduling a retry, then we need to re-import
when we start the request again.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 09:55:01 -07:00
Jens Axboe
9af177ee3e io_uring: retry early for reads if we can poll
Most of the logic in io_read() deals with regular files, and in some ways
it would make sense to split the handling into S_IFREG and others. But
at least for retry, we don't need to bother setting up a bunch of state
just to abort in the loop later. In particular, don't bother forcing
setup of async data for a normal non-vectored read when we don't need it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 09:45:57 -07:00
Olivier Langlois
adc8682ec6 io_uring: Add support for napi_busy_poll
The sqpoll thread can be used for performing the napi busy poll in a
similar way that it does io polling for file systems supporting direct
access bypassing the page cache.

The other way that io_uring can be used for napi busy poll is by
calling io_uring_enter() to get events.

If the user specify a timeout value, it is distributed between polling
and sleeping by using the systemwide setting
/proc/sys/net/core/busy_poll.

The changes have been tested with this program:
https://github.com/lano1106/io_uring_udp_ping

and the result is:
Without sqpoll:
NAPI busy loop disabled:
rtt min/avg/max/mdev = 40.631/42.050/58.667/1.547 us
NAPI busy loop enabled:
rtt min/avg/max/mdev = 30.619/31.753/61.433/1.456 us

With sqpoll:
NAPI busy loop disabled:
rtt min/avg/max/mdev = 42.087/44.438/59.508/1.533 us
NAPI busy loop enabled:
rtt min/avg/max/mdev = 35.779/37.347/52.201/0.924 us

Co-developed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Link: https://lore.kernel.org/r/810bd9408ffc510ff08269e78dca9df4af0b9e4e.1646777484.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 09:18:30 -07:00
Olivier Langlois
950e79dd73 io_uring: minor io_cqring_wait() optimization
Move up the block manipulating the sig variable to execute code
that may encounter an error and exit first before continuing
executing the rest of the function and avoid useless computations

Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Link: https://lore.kernel.org/r/84513f7cc1b1fb31d8f4cb910aee033391d036b4.1646777484.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 09:18:30 -07:00
Jens Axboe
4f57f06ce2 io_uring: add support for IORING_OP_MSG_RING command
This adds support for IORING_OP_MSG_RING, which allows an SQE to signal
another ring. That allows either waking up someone waiting on the ring,
or even passing a 64-bit value via the user_data field in the CQE.

sqe->fd must contain the fd of a ring that should receive the CQE.
sqe->off will be propagated to the cqe->user_data on the target ring,
and sqe->len will be propagated to cqe->res. The results CQE will have
IORING_CQE_F_MSG set in its flags, to indicate that this CQE was generated
from a messaging request rather than a SQE issued locally on that ring.
This effectively allows passing a 64-bit and a 32-bit quantify between
the two rings.

This request type has the following request specific error cases:

- -EBADFD. Set if the sqe->fd doesn't point to a file descriptor that is
  of the io_uring type.
- -EOVERFLOW. Set if we were not able to deliver a request to the target
  ring.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 09:16:04 -07:00
Jens Axboe
cc3cec8367 io_uring: speedup provided buffer handling
In testing high frequency workloads with provided buffers, we spend a
lot of time in allocating and freeing the buffer units themselves.
Rather than repeatedly free and alloc them, add a recycling cache
instead. There are two caches:

- ctx->io_buffers_cache. This is the one we grab from in the submission
  path, and it's protected by ctx->uring_lock. For inline completions,
  we can recycle straight back to this cache and not need any extra
  locking.

- ctx->io_buffers_comp. If we're not under uring_lock, then we use this
  list to recycle buffers. It's protected by the completion_lock.

On adding a new buffer, check io_buffers_cache. If it's empty, check if
we can splice entries from the io_buffers_comp_cache.

This reduces about 5-10% of overhead from provided buffers, bringing it
pretty close to the non-provided path.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:33:14 -07:00
Jens Axboe
e7a6c00dc7 io_uring: add support for registering ring file descriptors
Lots of workloads use multiple threads, in which case the file table is
shared between them. This makes getting and putting the ring file
descriptor for each io_uring_enter(2) system call more expensive, as it
involves an atomic get and put for each call.

Similarly to how we allow registering normal file descriptors to avoid
this overhead, add support for an io_uring_register(2) API that allows
to register the ring fds themselves:

1) IORING_REGISTER_RING_FDS - takes an array of io_uring_rsrc_update
   structs, and registers them with the task.
2) IORING_UNREGISTER_RING_FDS - takes an array of io_uring_src_update
   structs, and unregisters them.

When a ring fd is registered, it is internally represented by an offset.
This offset is returned to the application, and the application then
uses this offset and sets IORING_ENTER_REGISTERED_RING for the
io_uring_enter(2) system call. This works just like using a registered
file descriptor, rather than a real one, in an SQE, where
IOSQE_FIXED_FILE gets set to tell io_uring that we're using an internal
offset/descriptor rather than a real file descriptor.

In initial testing, this provides a nice bump in performance for
threaded applications in real world cases where the batch count (eg
number of requests submitted per io_uring_enter(2) invocation) is low.
In a microbenchmark, submitting NOP requests, we see the following
increases in performance:

Requests per syscall	Baseline	Registered	Increase
----------------------------------------------------------------
1			 ~7030K		 ~8080K		+15%
2			~13120K		~14800K		+13%
4			~22740K		~25300K		+11%

Co-developed-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00
Dylan Yudaken
63c3654973 io_uring: documentation fixup
Fix incorrect name reference in comment. ki_filp does not exist in the
struct, but file does.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220224105157.1332353-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00
Dylan Yudaken
b4aec40015 io_uring: do not recalculate ppos unnecessarily
There is a slight optimisation to be had by calculating the correct pos
pointer inside io_kiocb_update_pos and then using that later.

It seems code size drops by a bit:
000000000000a1b0 0000000000000400 t io_read
000000000000a5b0 0000000000000319 t io_write

vs
000000000000a1b0 00000000000003f6 t io_read
000000000000a5b0 0000000000000310 t io_write

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00
Dylan Yudaken
d34e1e5b39 io_uring: update kiocb->ki_pos at execution time
Update kiocb->ki_pos at execution time rather than in io_prep_rw().
io_prep_rw() happens before the job is enqueued to a worker and so the
offset might be read multiple times before being executed once.

Ensures that the file position in a set of _linked_ SQEs will be only
obtained after earlier SQEs have completed, and so will include their
incremented file position.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00
Dylan Yudaken
af9c45eceb io_uring: remove duplicated calls to io_kiocb_ppos
io_kiocb_ppos is called in both branches, and it seems that the compiler
does not fuse this. Fusing removes a few bytes from loop_rw_iter.

Before:
$ nm -S fs/io_uring.o | grep loop_rw_iter
0000000000002430 0000000000000124 t loop_rw_iter

After:
$ nm -S fs/io_uring.o | grep loop_rw_iter
0000000000002430 000000000000010d t loop_rw_iter

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00
Olivier Langlois
c5020bc8d9 io_uring: Remove unneeded test in io_run_task_work_sig()
Avoid testing TIF_NOTIFY_SIGNAL twice by calling task_sigpending()
directly from io_run_task_work_sig()

Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Link: https://lore.kernel.org/r/bd7c0495f7656e803e5736708591bb665e6eaacd.1645041650.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00
Stefan Roesch
502c87d655 io-uring: Make tracepoints consistent.
This makes the io-uring tracepoints consistent. Where it makes sense
the tracepoints start with the following four fields:
- context (ring)
- request
- user_data
- opcode.

Signed-off-by: Stefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20220214180430.70572-3-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00
Stefan Roesch
d5ec1dfaf5 io-uring: add __fill_cqe function
This introduces the __fill_cqe function. This is necessary
to correctly issue the io_uring_complete tracepoint.

Signed-off-by: Stefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20220214180430.70572-2-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00
Hao Xu
86127bb18a io-wq: use IO_WQ_ACCT_NR rather than hardcoded number
It's better to use the defined enum stuff not the hardcoded number to
define array.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220206095241.121485-4-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00
Hao Xu
e13fb1fe14 io-wq: reduce acct->lock crossing functions lock/unlock
reduce acct->lock lock and unlock in different functions to make the
code clearer.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220206095241.121485-3-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00
Hao Xu
42abc95f05 io-wq: decouple work_list protection from the big wqe->lock
wqe->lock is abused, it now protects acct->work_list, hash stuff,
nr_workers, wqe->free_list and so on. Lets first get the work_list out
of the wqe-lock mess by introduce a specific lock for work list. This
is the first step to solve the huge contension between work insertion
and work consumption.
good thing:
  - split locking for bound and unbound work list
  - reduce contension between work_list visit and (worker's)free_list.

For the hash stuff, since there won't be a work with same file in both
bound and unbound work list, thus they won't visit same hash entry. it
works well to use the new lock to protect hash stuff.

Results:
set max_unbound_worker = 4, test with echo-server:
nice -n -15 ./io_uring_echo_server -p 8081 -f -n 1000 -l 16
(-n connection, -l workload)
before this patch:
Samples: 2M of event 'cycles:ppp', Event count (approx.): 1239982111074
Overhead  Command          Shared Object         Symbol
  28.59%  iou-wrk-10021    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
   8.89%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
   6.20%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock
   2.45%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
   2.36%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
   2.29%  iou-wrk-10021    [kernel.vmlinux]      [k] io_worker_handle_work
   1.29%  io_uring_echo_s  [kernel.vmlinux]      [k] io_wqe_enqueue
   1.06%  iou-wrk-10021    [kernel.vmlinux]      [k] io_wqe_worker
   1.06%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
   1.03%  iou-wrk-10021    [kernel.vmlinux]      [k] __schedule
   0.99%  iou-wrk-10021    [kernel.vmlinux]      [k] tcp_sendmsg_locked

with this patch:
Samples: 1M of event 'cycles:ppp', Event count (approx.): 708446691943
Overhead  Command          Shared Object         Symbol
  16.86%  iou-wrk-10893    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
   9.10%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock
   4.53%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
   2.87%  iou-wrk-10893    [kernel.vmlinux]      [k] io_worker_handle_work
   2.57%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
   2.56%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
   1.82%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
   1.33%  iou-wrk-10893    [kernel.vmlinux]      [k] io_wqe_worker
   1.26%  io_uring_echo_s  [kernel.vmlinux]      [k] try_to_wake_up

spin_lock failure from 25.59% + 8.89% =  34.48% to 16.86% + 4.53% = 21.39%
TPS is similar, while cpu usage is from almost 400% to 350%

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220206095241.121485-2-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00
Nathan Chancellor
f0a4e62bb5 io_uring: Fix use of uninitialized ret in io_eventfd_register()
Clang warns:

  fs/io_uring.c:9396:9: warning: variable 'ret' is uninitialized when used here [-Wuninitialized]
          return ret;
                 ^~~
  fs/io_uring.c:9373:13: note: initialize the variable 'ret' to silence this warning
          int fd, ret;
                     ^
                      = 0
  1 warning generated.

Just return 0 directly and reduce the scope of ret to the if statement,
as that is the only place that it is used, which is how the function was
before the fixes commit.

Fixes: 1a75fac9a0f9 ("io_uring: avoid ring quiesce while registering/unregistering eventfd")
Link: https://github.com/ClangBuiltLinux/linux/issues/1579
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20220207162410.1013466-1-nathan@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00
Usama Arif
8bb649ee1d io_uring: remove ring quiesce for io_uring_register
None of the opcodes in io_uring_register use ring quiesce anymore. Hence
io_register_op_must_quiesce always returns false and io_ctx_quiesce is
never called.

Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-6-usama.arif@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00
Usama Arif
ff16cfcfda io_uring: avoid ring quiesce while registering restrictions and enabling rings
IORING_SETUP_R_DISABLED prevents submitting requests and so there will be
no requests until IORING_REGISTER_ENABLE_RINGS is called. And
IORING_REGISTER_RESTRICTIONS works only before
IORING_REGISTER_ENABLE_RINGS is called. Hence ring quiesce is not needed
for these opcodes.

Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-5-usama.arif@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00
Usama Arif
c75312dd59 io_uring: avoid ring quiesce while registering async eventfd
This is done using the RCU data structure (io_ev_fd). eventfd_async is
moved from io_ring_ctx to io_ev_fd which is RCU protected hence avoiding
ring quiesce which is much more expensive than an RCU lock. The place
where eventfd_async is read is already under rcu_read_lock so there is no
extra RCU read-side critical section needed.

Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-4-usama.arif@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00
Usama Arif
77bc59b498 io_uring: avoid ring quiesce while registering/unregistering eventfd
This is done by creating a new RCU data structure (io_ev_fd) as part of
io_ring_ctx that holds the eventfd_ctx.

The function io_eventfd_signal is executed under rcu_read_lock with a
single rcu_dereference to io_ev_fd so that if another thread unregisters
the eventfd while io_eventfd_signal is still being executed, the
eventfd_signal for which io_eventfd_signal was called completes
successfully.

The process of registering/unregistering eventfd is already done under
uring_lock so multiple threads won't enter a race condition while
registering/unregistering eventfd.

With the above approach ring quiesce can be avoided which is much more
expensive then using RCU lock. On the system tested, io_uring_register
with IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared
to 15ms before with ring quiesce.

Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-3-usama.arif@bytedance.com
[axboe: long line fixups]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10 06:32:49 -07:00