This serves two purposes:
- We now have the last cacheline mostly unused for generic workloads,
instead of having to pull in the poll refs explicitly for workloads
that rely on poll arming.
- It shrinks the io_kiocb from 232 to 224 bytes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make the tracing formatting for user_data and flags consistent.
Having consistent formatting allows one for example to grep for a specific
user_data/flags and be able to trace a single sqe through easily.
Change user_data to 0x%llx and flags to 0x%x everywhere. The '0x' is
useful to disambiguate for example "user_data 100".
Additionally remove the '=' for flags in io_uring_req_failed, again for consistency.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220316095204.2191498-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Particularly for networked workloads, io_uring intensively uses its
poll based backend to get a notification when data/space is available.
Profiling workloads, we see 3-4% of alloc+free that is directly attributed
to just the apoll allocation and free (and the rest being skb alloc+free).
For the fast path, we have ctx->uring_lock held already for both issue
and the inline completions, and we can utilize that to avoid any extra
locking needed to have a basic recycling cache for the apoll entries on
both the alloc and free side.
Double poll still requires an allocation. But those are rare and not
a fast path item.
With the simple cache in place, we see a 3-4% reduction in overhead for
the workload.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Julia and the kernel test robot report that the prep handling for this
command inadvertently checks one field twice:
fs/io_uring.c:4338:42-56: duplicated argument to && or ||
Get rid of it.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Fixes: 4f57f06ce2 ("io_uring: add support for IORING_OP_MSG_RING command")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
By default, io_uring will stop submitting a batch of requests if we run
into an error submitting a request. This isn't strictly necessary, as
the error result is passed out-of-band via a CQE anyway. And it can be
a bit confusing for some applications.
Provide a way to setup a ring that will continue submitting on error,
when the error CQE has been posted.
There's still one case that will break out of submission. If we fail
allocating a request, then we'll still return -ENOMEM. We could in theory
post a CQE for that condition too even if we never got a request. Leave
that for a potential followup.
Reported-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we are using provided buffers, it's less than useful to have a buffer
selected and pinned if a request needs to go async or arms poll for
notification trigger on when we can process it.
Recycle the buffer in those events, so we don't pin it for the duration
of the request.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Most of the logic in io_read() deals with regular files, and in some ways
it would make sense to split the handling into S_IFREG and others. But
at least for retry, we don't need to bother setting up a bunch of state
just to abort in the loop later. In particular, don't bother forcing
setup of async data for a normal non-vectored read when we don't need it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The sqpoll thread can be used for performing the napi busy poll in a
similar way that it does io polling for file systems supporting direct
access bypassing the page cache.
The other way that io_uring can be used for napi busy poll is by
calling io_uring_enter() to get events.
If the user specify a timeout value, it is distributed between polling
and sleeping by using the systemwide setting
/proc/sys/net/core/busy_poll.
The changes have been tested with this program:
https://github.com/lano1106/io_uring_udp_ping
and the result is:
Without sqpoll:
NAPI busy loop disabled:
rtt min/avg/max/mdev = 40.631/42.050/58.667/1.547 us
NAPI busy loop enabled:
rtt min/avg/max/mdev = 30.619/31.753/61.433/1.456 us
With sqpoll:
NAPI busy loop disabled:
rtt min/avg/max/mdev = 42.087/44.438/59.508/1.533 us
NAPI busy loop enabled:
rtt min/avg/max/mdev = 35.779/37.347/52.201/0.924 us
Co-developed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Link: https://lore.kernel.org/r/810bd9408ffc510ff08269e78dca9df4af0b9e4e.1646777484.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds support for IORING_OP_MSG_RING, which allows an SQE to signal
another ring. That allows either waking up someone waiting on the ring,
or even passing a 64-bit value via the user_data field in the CQE.
sqe->fd must contain the fd of a ring that should receive the CQE.
sqe->off will be propagated to the cqe->user_data on the target ring,
and sqe->len will be propagated to cqe->res. The results CQE will have
IORING_CQE_F_MSG set in its flags, to indicate that this CQE was generated
from a messaging request rather than a SQE issued locally on that ring.
This effectively allows passing a 64-bit and a 32-bit quantify between
the two rings.
This request type has the following request specific error cases:
- -EBADFD. Set if the sqe->fd doesn't point to a file descriptor that is
of the io_uring type.
- -EOVERFLOW. Set if we were not able to deliver a request to the target
ring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In testing high frequency workloads with provided buffers, we spend a
lot of time in allocating and freeing the buffer units themselves.
Rather than repeatedly free and alloc them, add a recycling cache
instead. There are two caches:
- ctx->io_buffers_cache. This is the one we grab from in the submission
path, and it's protected by ctx->uring_lock. For inline completions,
we can recycle straight back to this cache and not need any extra
locking.
- ctx->io_buffers_comp. If we're not under uring_lock, then we use this
list to recycle buffers. It's protected by the completion_lock.
On adding a new buffer, check io_buffers_cache. If it's empty, check if
we can splice entries from the io_buffers_comp_cache.
This reduces about 5-10% of overhead from provided buffers, bringing it
pretty close to the non-provided path.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lots of workloads use multiple threads, in which case the file table is
shared between them. This makes getting and putting the ring file
descriptor for each io_uring_enter(2) system call more expensive, as it
involves an atomic get and put for each call.
Similarly to how we allow registering normal file descriptors to avoid
this overhead, add support for an io_uring_register(2) API that allows
to register the ring fds themselves:
1) IORING_REGISTER_RING_FDS - takes an array of io_uring_rsrc_update
structs, and registers them with the task.
2) IORING_UNREGISTER_RING_FDS - takes an array of io_uring_src_update
structs, and unregisters them.
When a ring fd is registered, it is internally represented by an offset.
This offset is returned to the application, and the application then
uses this offset and sets IORING_ENTER_REGISTERED_RING for the
io_uring_enter(2) system call. This works just like using a registered
file descriptor, rather than a real one, in an SQE, where
IOSQE_FIXED_FILE gets set to tell io_uring that we're using an internal
offset/descriptor rather than a real file descriptor.
In initial testing, this provides a nice bump in performance for
threaded applications in real world cases where the batch count (eg
number of requests submitted per io_uring_enter(2) invocation) is low.
In a microbenchmark, submitting NOP requests, we see the following
increases in performance:
Requests per syscall Baseline Registered Increase
----------------------------------------------------------------
1 ~7030K ~8080K +15%
2 ~13120K ~14800K +13%
4 ~22740K ~25300K +11%
Co-developed-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a slight optimisation to be had by calculating the correct pos
pointer inside io_kiocb_update_pos and then using that later.
It seems code size drops by a bit:
000000000000a1b0 0000000000000400 t io_read
000000000000a5b0 0000000000000319 t io_write
vs
000000000000a1b0 00000000000003f6 t io_read
000000000000a5b0 0000000000000310 t io_write
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Update kiocb->ki_pos at execution time rather than in io_prep_rw().
io_prep_rw() happens before the job is enqueued to a worker and so the
offset might be read multiple times before being executed once.
Ensures that the file position in a set of _linked_ SQEs will be only
obtained after earlier SQEs have completed, and so will include their
incremented file position.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_kiocb_ppos is called in both branches, and it seems that the compiler
does not fuse this. Fusing removes a few bytes from loop_rw_iter.
Before:
$ nm -S fs/io_uring.o | grep loop_rw_iter
0000000000002430 0000000000000124 t loop_rw_iter
After:
$ nm -S fs/io_uring.o | grep loop_rw_iter
0000000000002430 000000000000010d t loop_rw_iter
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This makes the io-uring tracepoints consistent. Where it makes sense
the tracepoints start with the following four fields:
- context (ring)
- request
- user_data
- opcode.
Signed-off-by: Stefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20220214180430.70572-3-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This introduces the __fill_cqe function. This is necessary
to correctly issue the io_uring_complete tracepoint.
Signed-off-by: Stefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20220214180430.70572-2-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
wqe->lock is abused, it now protects acct->work_list, hash stuff,
nr_workers, wqe->free_list and so on. Lets first get the work_list out
of the wqe-lock mess by introduce a specific lock for work list. This
is the first step to solve the huge contension between work insertion
and work consumption.
good thing:
- split locking for bound and unbound work list
- reduce contension between work_list visit and (worker's)free_list.
For the hash stuff, since there won't be a work with same file in both
bound and unbound work list, thus they won't visit same hash entry. it
works well to use the new lock to protect hash stuff.
Results:
set max_unbound_worker = 4, test with echo-server:
nice -n -15 ./io_uring_echo_server -p 8081 -f -n 1000 -l 16
(-n connection, -l workload)
before this patch:
Samples: 2M of event 'cycles:ppp', Event count (approx.): 1239982111074
Overhead Command Shared Object Symbol
28.59% iou-wrk-10021 [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
8.89% io_uring_echo_s [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
6.20% iou-wrk-10021 [kernel.vmlinux] [k] _raw_spin_lock
2.45% io_uring_echo_s [kernel.vmlinux] [k] io_prep_async_work
2.36% iou-wrk-10021 [kernel.vmlinux] [k] _raw_spin_lock_irqsave
2.29% iou-wrk-10021 [kernel.vmlinux] [k] io_worker_handle_work
1.29% io_uring_echo_s [kernel.vmlinux] [k] io_wqe_enqueue
1.06% iou-wrk-10021 [kernel.vmlinux] [k] io_wqe_worker
1.06% io_uring_echo_s [kernel.vmlinux] [k] _raw_spin_lock
1.03% iou-wrk-10021 [kernel.vmlinux] [k] __schedule
0.99% iou-wrk-10021 [kernel.vmlinux] [k] tcp_sendmsg_locked
with this patch:
Samples: 1M of event 'cycles:ppp', Event count (approx.): 708446691943
Overhead Command Shared Object Symbol
16.86% iou-wrk-10893 [kernel.vmlinux] [k] native_queued_spin_lock_slowpat
9.10% iou-wrk-10893 [kernel.vmlinux] [k] _raw_spin_lock
4.53% io_uring_echo_s [kernel.vmlinux] [k] native_queued_spin_lock_slowpat
2.87% iou-wrk-10893 [kernel.vmlinux] [k] io_worker_handle_work
2.57% iou-wrk-10893 [kernel.vmlinux] [k] _raw_spin_lock_irqsave
2.56% io_uring_echo_s [kernel.vmlinux] [k] io_prep_async_work
1.82% io_uring_echo_s [kernel.vmlinux] [k] _raw_spin_lock
1.33% iou-wrk-10893 [kernel.vmlinux] [k] io_wqe_worker
1.26% io_uring_echo_s [kernel.vmlinux] [k] try_to_wake_up
spin_lock failure from 25.59% + 8.89% = 34.48% to 16.86% + 4.53% = 21.39%
TPS is similar, while cpu usage is from almost 400% to 350%
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220206095241.121485-2-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Clang warns:
fs/io_uring.c:9396:9: warning: variable 'ret' is uninitialized when used here [-Wuninitialized]
return ret;
^~~
fs/io_uring.c:9373:13: note: initialize the variable 'ret' to silence this warning
int fd, ret;
^
= 0
1 warning generated.
Just return 0 directly and reduce the scope of ret to the if statement,
as that is the only place that it is used, which is how the function was
before the fixes commit.
Fixes: 1a75fac9a0f9 ("io_uring: avoid ring quiesce while registering/unregistering eventfd")
Link: https://github.com/ClangBuiltLinux/linux/issues/1579
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20220207162410.1013466-1-nathan@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
None of the opcodes in io_uring_register use ring quiesce anymore. Hence
io_register_op_must_quiesce always returns false and io_ctx_quiesce is
never called.
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-6-usama.arif@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IORING_SETUP_R_DISABLED prevents submitting requests and so there will be
no requests until IORING_REGISTER_ENABLE_RINGS is called. And
IORING_REGISTER_RESTRICTIONS works only before
IORING_REGISTER_ENABLE_RINGS is called. Hence ring quiesce is not needed
for these opcodes.
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-5-usama.arif@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is done using the RCU data structure (io_ev_fd). eventfd_async is
moved from io_ring_ctx to io_ev_fd which is RCU protected hence avoiding
ring quiesce which is much more expensive than an RCU lock. The place
where eventfd_async is read is already under rcu_read_lock so there is no
extra RCU read-side critical section needed.
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-4-usama.arif@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is done by creating a new RCU data structure (io_ev_fd) as part of
io_ring_ctx that holds the eventfd_ctx.
The function io_eventfd_signal is executed under rcu_read_lock with a
single rcu_dereference to io_ev_fd so that if another thread unregisters
the eventfd while io_eventfd_signal is still being executed, the
eventfd_signal for which io_eventfd_signal was called completes
successfully.
The process of registering/unregistering eventfd is already done under
uring_lock so multiple threads won't enter a race condition while
registering/unregistering eventfd.
With the above approach ring quiesce can be avoided which is much more
expensive then using RCU lock. On the system tested, io_uring_register
with IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared
to 15ms before with ring quiesce.
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-3-usama.arif@bytedance.com
[axboe: long line fixups]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The information on whether eventfd is registered is not very useful and
would result in the tracepoint being enclosed in an rcu_readlock in a
later patch that tries to avoid ring quiesce for registering eventfd.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-2-usama.arif@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmIk1isACgkQxWXV+ddt
WDsAVQ//TvkKObQLL/BJ4TFSxr1ZLs83z4vTcss2W/MrMjGWUut1fhUTGlhkqgC6
RE03VBuUV983k09/Tn3Q0AHSXcMAmxEv/t1QweJNKiVv7YKT3Nj7VF3kHioFz9g/
gZ5q9FVbTXkrl4tgcwiQXbLJ1BLWBfXTAMatKgsIQBYsYg0ec3GGem/tx3OlvdNt
9My6EJhNo5X7vrTMjRUygDgHDhcAgp/gYMa2VmnPhK5qcPzmIYbt4CJGLQDwiiiB
KSsXnsHCXKm/BRPgtgnMBH6O8YruaxUn0nEQMjntGx8RHbZrkdXk90PaK7pmWz1W
KkbHTM98zclAOWUx6JmGw8mb9aZQo6aGpou2Pa98aBtHhvbhiKYS2W2OOnHbAshK
2bj6W2o89eYHKgX+5fICWHt7efUoWUh1KPC+TeaV8DKl8q0a9DC3KfIL/v7PZacA
pIyyy4uyXh3finzI+Q+fW7QVKQhpcQKLuq5EVGCMEotlfsn+SJBselAdwUl9ChUp
ALAiYn1T8W1Mrt8P2mxB29btGrdckHtpoWTgr++OAZaX4PABF3GAvIxXwmFg2aMK
zfXKwTxjwKM42H3AWaLHttk4OA7FJhY9sgOproON/3Tn9cBSK2jiO0HSk1dBn/dL
WQbOKh4Z+VDXi5niF8hmTANTNO0wS0JdiKZX86tYyhcCl0ZBr/w=
=Bd5z
-----END PGP SIGNATURE-----
Merge tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes for various problems that have user visible effects
or seem to be urgent:
- fix corruption when combining DIO and non-blocking io_uring over
multiple extents (seen on MariaDB)
- fix relocation crash due to premature return from commit
- fix quota deadlock between rescan and qgroup removal
- fix item data bounds checks in tree-checker (found on a fuzzed
image)
- fix fsync of prealloc extents after EOF
- add missing run of delayed items after unlink during log replay
- don't start relocation until snapshot drop is finished
- fix reversed condition for subpage writers locking
- fix warning on page error"
* tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fallback to blocking mode when doing async dio over multiple extents
btrfs: add missing run of delayed items after unlink during log replay
btrfs: qgroup: fix deadlock between rescan worker and remove qgroup
btrfs: fix relocation crash due to premature return from btrfs_commit_transaction()
btrfs: do not start relocation until in progress drops are done
btrfs: tree-checker: use u64 for item data end to avoid overflow
btrfs: do not WARN_ON() if we have PageError set
btrfs: fix lost prealloc extents beyond eof after full fsync
btrfs: subpage: fix a wrong check on subpage->writers
* Tweaks to the paravirtualization code, to avoid using them
when they're pointless or harmful
x86 host:
* Fix for SRCU lockdep splat
* Brown paper bag fix for the propagation of errno
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmIkkdsUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroP15Qf7B8BXNMlNkret5WN/4pGf06gNdIY6
ZqC8t/Lx1+fCkzGk+VtAw0bxRscOF4z1XzvfywO5ZI5bxQB/b2xTyBkVY90SqhsB
shug5QpikejpmvVZJXxwD3+loCUah2T6FUT6QJa0sKVhW+XiqOva8fAmYLG5agaa
VGvqFXTXiVmbiw/O9ZI/CfUC0WNrn+I1iDO+oGWyhv/22tePxGCizVczRFJn6DAD
Vh5P6AfOqXjmzdpUeOiU544FQZPHAZehb7/xYc0T9GSW4fPnTmHwRzwhUqgJnx7d
3E+eWGwny+Q/OrpKf7SbxtB65yn7lHRmdN/YtCHygl4sjs6CdjSPY8/9jQ==
=PPz1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"x86 guest:
- Tweaks to the paravirtualization code, to avoid using them when
they're pointless or harmful
x86 host:
- Fix for SRCU lockdep splat
- Brown paper bag fix for the propagation of errno"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run
KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots()
KVM: x86: Yield to IPI target vCPU only if it is busy
x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64
x86/kvm: Don't waste memory if kvmclock is disabled
x86/kvm: Don't use PV TLB/yield when mwait is advertised
Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set.
Thanks to: Murilo Opsfelder Araujo, Erhard F.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmIkY/gTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgIQfD/997ouPSpuJCyG7nFY35R8IIqJESqqO
RhMrI1b/HjiTHI3+Ha3htnGWa258Klllwr6zerTFYIp9kRzoO8rskgqeTYM2aOXF
rLGMUz2b6BjsboxOGowd2Z9JB5U0sItpt1MQZrXVnaVVx3PWQUV4PjksdxmqwC4W
+DtmYisO38FVQey9kC3V12J+KMkm0J0PWqhh+m7w1zkhNvNlcZp+g0gODWRfo3ic
QBqTyN3mUXnVKqVNXJZqWCkMp2ek8ZxL1plhwdQtbh9Uwttooc/QNYURepjTVglT
sHusO8CwLKd1hQlMDD+eqZ0pMSYHE1sWxoaiBLZbaC6Qdu/+arTayHOLJi8QGwtt
g2jDOklXP8rsXA7Tp/qafWDV61YSJP+O8KJsEpnuluUP/SePSk3jdgDoztCe72M+
f8Xu5AZ5+2x3NaVmNoOOvvvsxlS3ywl2nDTO205Tz6W55ZCWafSf1vG11lRKU3G8
We0hzDlJNNajNjnBpiiXgyHu4vi2cfh8gWWDxKJhjZV9pomJ1zFU2+IOpN4CA/6D
qolgraeLLNVtmNMxcwMdpcBXnG7rwzTuJXSXMPM/tPhLFl1bJmQCiVYfEVOLFMtJ
2+uUyfbbjaf1IDAfBLrIgN1YzIWc3fEPG+bdmulhhHeFN1XY6tfj++JF/DiRxGhn
wWc9TB8BCY+uPw==
=U4bW
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
"Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set.
Thanks to Murilo Opsfelder Araujo, and Erhard F"
* tag 'powerpc-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set
- Fix sorting on old "cpu" value in histograms
- Fix return value of __setup() boot parameter handlers.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYiQOghQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6quMXAP0TVq+FvVroN42ZS/UpiynnJ0uW1ibV
93i3M12QQL2zSQEA6a+aWHywTl1tU2F/I4frH5RkIwTulfP/RwBVJG0MFQc=
=ccPg
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix sorting on old "cpu" value in histograms
- Fix return value of __setup() boot parameter handlers
* tag 'trace-v5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix return value of __setup handlers
tracing/histogram: Fix sorting on old "cpu" value
Pull input updates from Dmitry Torokhov:
- a fixup for Goodix touchscreen driver allowing it to work on certain
Cherry Trail devices
- a fix for imbalanced enable/disable regulator in Elam touchpad driver
that became apparent when used with Asus TF103C 2-in-1 dock
- a couple new input keycodes used on newer keyboards
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
HID: add mapping for KEY_ALL_APPLICATIONS
HID: add mapping for KEY_DICTATE
Input: elan_i2c - fix regulator enable count imbalance after suspend/resume
Input: elan_i2c - move regulator_[en|dis]able() out of elan_[en|dis]able_power()
Input: goodix - workaround Cherry Trail devices with a bogus ACPI Interrupt() resource
Input: goodix - use the new soc_intel_is_byt() helper
Input: samsung-keypad - properly state IOMEM dependency
Merge misc fixes from Andrew Morton:
"8 patches.
Subsystems affected by this patch series: mm (hugetlb, pagemap, and
userfaultfd), memfd, selftests, and kconfig"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
configs/debug: set CONFIG_DEBUG_INFO=y properly
proc: fix documentation and description of pagemap
kselftest/vm: fix tests build with old libc
memfd: fix F_SEAL_WRITE after shmem huge page allocated
mm: fix use-after-free when anon vma name is used after vma is freed
mm: prevent vm_area_struct::anon_name refcount saturation
mm: refactor vm_area_struct::anon_vma_name usage code
selftests/vm: cleanup hugetlb file after mremap test
- Fix HAVE_DYNAMIC_FTRACE_WITH_ARGS implementation by providing correct
switching between ftrace_caller/ftrace_regs_caller and supplying pt_regs
only when ftrace_regs_caller is activated.
- Fix exception table sorting.
- Fix breakage of kdump tooling by preserving metadata it cannot function
without.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmIjRzsACgkQjYWKoQLX
FBj2sQf7BOz0tu+mn/c/Fy/jQt0D+18yOt43SG4+yusQRo7Qa/Q+sub910rRl9N+
NwB5Z6Lxyv1O9QVD3sYBBCzLVbDNEVt/qeQIMCyNFuphgflLPFAAELs2Qi0hdVtc
ZM73VaqX7KS3Ts52aJZ2/tpschngLP9aGrxw2Aa56Ylv1Q04eOB+vWAJZVVREVts
r7nlcghkyBJQIoHVxUTOO8MrUA4FhvPRHQ/OehJt1EkVOJ4l54IOf74aoPsKE3Ma
AykSs/CeTIhIhtfTK+iE/JqkD9P/TjNULWdnk8lJYFFuVC9KrxytItoVMtu7IjGM
HDAp4FDyI7j3gx/vA7M4o3SMgLu13A==
=vjGY
-----END PGP SIGNATURE-----
Merge tag 's390-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- Fix HAVE_DYNAMIC_FTRACE_WITH_ARGS implementation by providing correct
switching between ftrace_caller/ftrace_regs_caller and supplying
pt_regs only when ftrace_regs_caller is activated.
- Fix exception table sorting.
- Fix breakage of kdump tooling by preserving metadata it cannot
function without.
* tag 's390-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/extable: fix exception table sorting
s390/ftrace: fix arch_ftrace_get_regs implementation
s390/ftrace: fix ftrace_caller/ftrace_regs_caller generation
s390/setup: preserve memory at OLDMEM_BASE and OLDMEM_SIZE
CONFIG_DEBUG_INFO can't be set by user directly, so set
CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y instead.
Otherwise, we end up with no debuginfo in vmlinux which is a big no-no
for kernel debugging.
Link: https://lkml.kernel.org/r/20220301202920.18488-1-quic_qiancai@quicinc.com
Signed-off-by: Qian Cai <quic_qiancai@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since bit 57 was exported for uffd-wp write-protected (commit
fb8e37f35a: "mm/pagemap: export uffd-wp protection information"),
fixing it can reduce some unnecessary confusion.
Link: https://lkml.kernel.org/r/20220301044538.3042713-1-yun.zhou@windriver.com
Fixes: fb8e37f35a ("mm/pagemap: export uffd-wp protection information")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
Cc: Florian Schmidt <florian.schmidt@nutanix.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Colin Cross <ccross@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The error message when I build vm tests on debian10 (GLIBC 2.28):
userfaultfd.c: In function `userfaultfd_pagemap_test':
userfaultfd.c:1393:37: error: `MADV_PAGEOUT' undeclared (first use
in this function); did you mean `MADV_RANDOM'?
if (madvise(area_dst, test_pgsize, MADV_PAGEOUT))
^~~~~~~~~~~~
MADV_RANDOM
This patch includes these newer definitions from UAPI linux/mman.h, is
useful to fix tests build on systems without these definitions in glibc
sys/mman.h.
Link: https://lkml.kernel.org/r/20220227055330.43087-2-zhouchengming@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wangyong reports: after enabling tmpfs filesystem to support transparent
hugepage with the following command:
echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
the docker program tries to add F_SEAL_WRITE through the following
command, but it fails unexpectedly with errno EBUSY:
fcntl(5, F_ADD_SEALS, F_SEAL_WRITE) = -1.
That is because memfd_tag_pins() and memfd_wait_for_pins() were never
updated for shmem huge pages: checking page_mapcount() against
page_count() is hopeless on THP subpages - they need to check
total_mapcount() against page_count() on THP heads only.
Make memfd_tag_pins() (compared > 1) as strict as memfd_wait_for_pins()
(compared != 1): either can be justified, but given the non-atomic
total_mapcount() calculation, it is better now to be strict. Bear in
mind that total_mapcount() itself scans all of the THP subpages, when
choosing to take an XA_CHECK_SCHED latency break.
Also fix the unlikely xa_is_value() case in memfd_wait_for_pins(): if a
page has been swapped out since memfd_tag_pins(), then its refcount must
have fallen, and so it can safely be untagged.
Link: https://lkml.kernel.org/r/a4f79248-df75-2c8c-3df-ba3317ccb5da@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Reported-by: wangyong <wang.yong12@zte.com.cn>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: CGEL ZTE <cgel.zte@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When adjacent vmas are being merged it can result in the vma that was
originally passed to madvise_update_vma being destroyed. In the current
implementation, the name parameter passed to madvise_update_vma points
directly to vma->anon_name and it is used after the call to vma_merge.
In the cases when vma_merge merges the original vma and destroys it,
this might result in UAF. For that the original vma would have to hold
the anon_vma_name with the last reference. The following vma would need
to contain a different anon_vma_name object with the same string. Such
scenario is shown below:
madvise_vma_behavior(vma)
madvise_update_vma(vma, ..., anon_name == vma->anon_name)
vma_merge(vma)
__vma_adjust(vma) <-- merges vma with adjacent one
vm_area_free(vma) <-- frees the original vma
replace_vma_anon_name(anon_name) <-- UAF of vma->anon_name
Fix this by raising the name refcount and stabilizing it.
Link: https://lkml.kernel.org/r/20220224231834.1481408-3-surenb@google.com
Link: https://lkml.kernel.org/r/20220223153613.835563-3-surenb@google.com
Fixes: 9a10064f56 ("mm: add a field to store names for private anonymous memory")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: syzbot+aa7b3d4b35f9dc46a366@syzkaller.appspotmail.com
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Colin Cross <ccross@google.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A deep process chain with many vmas could grow really high. With
default sysctl_max_map_count (64k) and default pid_max (32k) the max
number of vmas in the system is 2147450880 and the refcounter has
headroom of 1073774592 before it reaches REFCOUNT_SATURATED
(3221225472).
Therefore it's unlikely that an anonymous name refcounter will overflow
with these defaults. Currently the max for pid_max is PID_MAX_LIMIT
(4194304) and for sysctl_max_map_count it's INT_MAX (2147483647). In
this configuration anon_vma_name refcount overflow becomes theoretically
possible (that still require heavy sharing of that anon_vma_name between
processes).
kref refcounting interface used in anon_vma_name structure will detect a
counter overflow when it reaches REFCOUNT_SATURATED value but will only
generate a warning and freeze the ref counter. This would lead to the
refcounted object never being freed. A determined attacker could leak
memory like that but it would be rather expensive and inefficient way to
do so.
To ensure anon_vma_name refcount does not overflow, stop anon_vma_name
sharing when the refcount reaches REFCOUNT_MAX (2147483647), which still
leaves INT_MAX/2 (1073741823) values before the counter reaches
REFCOUNT_SATURATED. This should provide enough headroom for raising the
refcounts temporarily.
Link: https://lkml.kernel.org/r/20220223153613.835563-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Colin Cross <ccross@google.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The hugepage-mremap test will create a file in a hugetlb filesystem. In
a default 'run_vmtests' run, the file will contain all the hugetlb
pages. After the test, the file remains and there are no free hugetlb
pages for subsequent tests. This causes those hugetlb tests to fail.
Change hugepage-mremap to take the name of the hugetlb file as an
argument. Unlink the file within the test, and just to be sure remove
the file in the run_vmtests script.
Link: https://lkml.kernel.org/r/20220201033459.156944-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following build failure occurs when CONFIG_PPC_64S_HASH_MMU is not
set:
arch/powerpc/kernel/setup_64.c: In function ‘setup_per_cpu_areas’:
arch/powerpc/kernel/setup_64.c:811:21: error: ‘mmu_linear_psize’ undeclared (first use in this function); did you mean ‘mmu_virtual_psize’?
811 | if (mmu_linear_psize == MMU_PAGE_4K)
| ^~~~~~~~~~~~~~~~
| mmu_virtual_psize
arch/powerpc/kernel/setup_64.c:811:21: note: each undeclared identifier is reported only once for each function it appears in
Move the declaration of mmu_linear_psize outside of
CONFIG_PPC_64S_HASH_MMU ifdef.
After the above is fixed, it fails later with the following error:
ld: arch/powerpc/kexec/file_load_64.o: in function `.arch_kexec_kernel_image_probe':
file_load_64.c:(.text+0x1c1c): undefined reference to `.add_htab_mem_range'
Fix that, too, by conditioning add_htab_mem_range() symbol to
CONFIG_PPC_64S_HASH_MMU.
Fixes: 387e220a2e ("powerpc/64s: Move hash MMU support code under CONFIG_PPC_64S_HASH_MMU")
Reported-by: Erhard F. <erhard_f@mailbox.org>
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215567
Link: https://lore.kernel.org/r/20220301204743.45133-1-muriloo@linux.ibm.com
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmIihP0QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvWwD/4/Rwu4a7plr7HHYKfS5MaTS62edwWIf2Li
zMZaGS0kuS4DSV3Lk5Y4AlGyz7FrWjbV+hWlotQNZuvmGntlLeBmscuYlSdN55NL
afRjwhFRmLOfhOJXCsAE2dSqDvReuRdSn9XkDTL/ViByb35UZUaxGR+nTrGQ8B6J
DyoA2JVpTVs9B7jtnWoCXKz6TgjFIqT7v29Zd2xE5BrJ/vKpvq0z/4BdJlMBSSKT
FJ5IQjuE1dyudxJAVYc7X4+t7HRw0afRItZIxrn294COoMmdazhBelnES65CMLfN
u309J2/HGL0hIRI7tb1Gljp2U8oxYgKeg66VPx1LYFoQ0sUqC9rW+sqU8zZky7SG
oTzG6ZppSrhTSFhgMYIobChIOKmBRW+tj2BvO6ipKwNJVZbMMFmZogf9K75MJ5U7
L52RdFxf8D5t7lYzl22puBRgzq5G4m2yi6gbV2EMUfWb2SkbbngdVzuG/uJRQv+D
7zE8XqqevOgLsUgS71+1oAgc1h07j4b2ihe1UIY2Zo0rZ27y9MV66cbllG8s0R3y
la5xSSi+HuMNcUpmCeERWLf8uXB3Jzwrwo5l7UvpJuPEGSes4jmE+dHsN3r79bV4
I5Td7wjBASFu7LKEJlP1OinKdQWJvbJhahNN+pqQtNMxyK6IvNlQRgqh0EwGJhH+
dqwVNNgkIQ==
=drk+
-----END PGP SIGNATURE-----
Merge tag 'block-5.17-2022-03-04' of git://git.kernel.dk/linux-block
Pull block fix from Jens Axboe:
"Just a small UAF fix for blktrace"
* tag 'block-5.17-2022-03-04' of git://git.kernel.dk/linux-block:
blktrace: fix use after free for struct blk_trace
* Fixes for a handful of KASAN-related crashes.
* A fix to avoid a crash during boot for SPARSEMEM && !SPARSEMEM_VMEMMAP
configurations.
* A fix to stop reporting some incorrect errors under DEBUG_VIRTUAL.
* A fix for the K210's device tree to properly populate the interrupt
map, so hart1 will get interrupts again.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEAM520YNJYN/OiG3470yhUCzLq0EFAmIiNtYTHHBhbG1lckBk
YWJiZWx0LmNvbQAKCRDvTKFQLMurQf8cD/92NMaclwHMVjQ07svZloQcgDp+JSA5
JP2EYHuDy3UCZsJSdJY8zJZ+Ct81MxNSNDDpLCLQCZe8fD8hA+FOOVlt8a21SqNH
Pc96ycqIhD/QrfBlcYw5+8N3n5zNTpPSMjazrBphKj56qNWcAXdvQwQTh56pXGj+
3J5vf3L8xlnx8mlTUMYqHivHKl4cJhYOY/ICwXjpZnRYx0NRF32cquo5A4Uh65ls
qQjeKL2WXZd44avWK9IkDcBLpjyxr+pJmCsbIntvwK23bz37/SXmk4G2f5/8sBtH
RK6RDLU1LIH8YNCq5KvAv9/qZZPkuvOKig//lWfcsOLYv43+bp2cGVlO4Z4gvUw3
qRsrQxXxS+FQFxH5Fxre7UWqLlM9EUHUdbx/aXyGSF5e1DXuD8GcDSt0pOwQboiu
xKqRxuMozr6ZiHlug3mUcEwzeDAHOwPWrIDSXNELMj+5r/8QogkcPaFUFFqmvigj
gIwGMiPKe0nQ9XfAUAsjVTL3ozlGXa6nabbVNnA4N05a/scToy3hnFkYo2iEpjyH
0sxyQ96AaKnN4ydWBsy+y/HA13CbWRP+dgfgaG1BaWQCQ4kh/FN3A3FYpvubBjIm
5rslXvsmWEkCt/U/K0BY3t6Pvw9GNryAXWDsyPACaVFjMErZPqwwkRtp1PqoKpd6
XiYQ1nJxgZPCRQ==
=Ff5J
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- Fixes for a handful of KASAN-related crashes.
- A fix to avoid a crash during boot for SPARSEMEM &&
!SPARSEMEM_VMEMMAP configurations.
- A fix to stop reporting some incorrect errors under DEBUG_VIRTUAL.
- A fix for the K210's device tree to properly populate the interrupt
map, so hart1 will get interrupts again.
* tag 'riscv-for-linus-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: dts: k210: fix broken IRQs on hart1
riscv: Fix kasan pud population
riscv: Move high_memory initialization to setup_bootmem
riscv: Fix config KASAN && DEBUG_VIRTUAL
riscv: Fix DEBUG_VIRTUAL false warnings
riscv: Fix config KASAN && SPARSEMEM && !SPARSE_VMEMMAP
riscv: Fix is_linear_mapping with recent move of KASAN region