- The final step to get rid of auto-rearming posix-timers
posix-timers are currently auto-rearmed by the kernel when the signal
of the timer is ignored so that the timer signal can be delivered once
the corresponding signal is unignored.
This requires to throttle the timer to prevent a DoS by small intervals
and keeps the system pointlessly out of low power states for no value.
This is a long standing non-trivial problem due to the lock order of
posix-timer lock and the sighand lock along with life time issues as
the timer and the sigqueue have different life time rules.
Cure this by:
* Embedding the sigqueue into the timer struct to have the same life
time rules. Aside of that this also avoids the lookup of the timer
in the signal delivery and rearm path as it's just a always valid
container_of() now.
* Queuing ignored timer signals onto a seperate ignored list.
* Moving queued timer signals onto the ignored list when the signal is
switched to SIG_IGN before it could be delivered.
* Walking the ignored list when SIG_IGN is lifted and requeue the
signals to the actual signal lists. This allows the signal delivery
code to rearm the timer.
This also required to consolidate the signal delivery rules so they are
consistent across all situations. With that all self test scenarios
finally succeed.
- Core infrastructure for VFS multigrain timestamping
This is required to allow the kernel to use coarse grained time stamps
by default and switch to fine grained time stamps when inode attributes
are actively observed via getattr().
These changes have been provided to the VFS tree as well, so that the
VFS specific infrastructure could be built on top.
- Cleanup and consolidation of the sleep() infrastructure
* Move all sleep and timeout functions into one file
* Rework udelay() and ndelay() into proper documented inline functions
and replace the hardcoded magic numbers by proper defines.
* Rework the fsleep() implementation to take the reality of the timer
wheel granularity on different HZ values into account. Right now the
boundaries are hard coded time ranges which fail to provide the
requested accuracy on different HZ settings.
* Update documentation for all sleep/timeout related functions and fix
up stale documentation links all over the place
* Fixup a few usage sites
- Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks
A system can have multiple PTP clocks which are participating in
seperate and independent PTP clock domains. So far the kernel only
considers the PTP clock which is based on CLOCK TAI relevant as that's
the clock which drives the timekeeping adjustments via the various user
space daemons through adjtimex(2).
The non TAI based clock domains are accessible via the file descriptor
based posix clocks, but their usability is very limited. They can't be
accessed fast as they always go all the way out to the hardware and
they cannot be utilized in the kernel itself.
As Time Sensitive Networking (TSN) gains traction it is required to
provide fast user and kernel space access to these clocks.
The approach taken is to utilize the timekeeping and adjtimex(2)
infrastructure to provide this access in a similar way how the kernel
provides access to clock MONOTONIC, REALTIME etc.
Instead of creating a duplicated infrastructure this rework converts
timekeeping and adjtimex(2) into generic functionality which operates
on pointers to data structures instead of using static variables.
This allows to provide time accessors and adjtimex(2) functionality for
the independent PTP clocks in a subsequent step.
- Consolidate hrtimer initialization
hrtimers are set up by initializing the data structure and then
seperately setting the callback function for historical reasons.
That's an extra unnecessary step and makes Rust support less straight
forward than it should be.
Provide a new set of hrtimer_setup*() functions and convert the core
code and a few usage sites of the less frequently used interfaces over.
The bulk of the htimer_init() to hrtimer_setup() conversion is already
prepared and scheduled for the next merge window.
- Drivers:
* Ensure that the global timekeeping clocksource is utilizing the
cluster 0 timer on MIPS multi-cluster systems.
Otherwise CPUs on different clusters use their cluster specific
clocksource which is not guaranteed to be synchronized with other
clusters.
* Mostly boring cleanups, fixes, improvements and code movement
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7kPITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZKkD/9OUL6fOJrDUmOYBa4QVeMyfTef4EaL
tvwIMM/29XQFeiq3xxCIn+EMnHjXn2lvIhYGQ7GKsbKYwvJ7ZBDpQb+UMhZ2nKI9
6D6BP6WomZohKeH2fZbJQAdqOi3KRYdvQdIsVZUexkqiaVPphRvOH9wOr45gHtZM
EyMRSotPlQTDqcrbUejDMEO94GyjDCYXRsyATLxjmTzL/N4xD4NRIiotjM2vL/a9
8MuCgIhrKUEyYlFoOxxeokBsF3kk3/ez2jlG9b/N8VLH3SYIc2zgL58FBgWxlmgG
bY71nVG3nUgEjxBd2dcXAVVqvb+5widk8p6O7xxOAQKTLMcJ4H0tQDkMnzBtUzvB
DGAJDHAmAr0g+ja9O35Pkhunkh4HYFIbq0Il4d1HMKObhJV0JumcKuQVxrXycdm3
UZfq3seqHsZJQbPgCAhlFU0/2WWScocbee9bNebGT33KVwSp5FoVv89C/6Vjb+vV
Gusc3thqrQuMAZW5zV8g4UcBAA/xH4PB0I+vHib+9XPZ4UQ7/6xKl2jE0kd5hX7n
AAUeZvFNFqIsY+B6vz+Jx/yzyM7u5cuXq87pof5EHVFzv56lyTp4ToGcOGYRgKH5
JXeYV1OxGziSDrd5vbf9CzdWMzqMvTefXrHbWrjkjhNOe8E1A8O88RZ5uRKZhmSw
hZZ4hdM9+3T7cg==
=2VC6
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"A rather large update for timekeeping and timers:
- The final step to get rid of auto-rearming posix-timers
posix-timers are currently auto-rearmed by the kernel when the
signal of the timer is ignored so that the timer signal can be
delivered once the corresponding signal is unignored.
This requires to throttle the timer to prevent a DoS by small
intervals and keeps the system pointlessly out of low power states
for no value. This is a long standing non-trivial problem due to
the lock order of posix-timer lock and the sighand lock along with
life time issues as the timer and the sigqueue have different life
time rules.
Cure this by:
- Embedding the sigqueue into the timer struct to have the same
life time rules. Aside of that this also avoids the lookup of
the timer in the signal delivery and rearm path as it's just a
always valid container_of() now.
- Queuing ignored timer signals onto a seperate ignored list.
- Moving queued timer signals onto the ignored list when the
signal is switched to SIG_IGN before it could be delivered.
- Walking the ignored list when SIG_IGN is lifted and requeue the
signals to the actual signal lists. This allows the signal
delivery code to rearm the timer.
This also required to consolidate the signal delivery rules so they
are consistent across all situations. With that all self test
scenarios finally succeed.
- Core infrastructure for VFS multigrain timestamping
This is required to allow the kernel to use coarse grained time
stamps by default and switch to fine grained time stamps when inode
attributes are actively observed via getattr().
These changes have been provided to the VFS tree as well, so that
the VFS specific infrastructure could be built on top.
- Cleanup and consolidation of the sleep() infrastructure
- Move all sleep and timeout functions into one file
- Rework udelay() and ndelay() into proper documented inline
functions and replace the hardcoded magic numbers by proper
defines.
- Rework the fsleep() implementation to take the reality of the
timer wheel granularity on different HZ values into account.
Right now the boundaries are hard coded time ranges which fail
to provide the requested accuracy on different HZ settings.
- Update documentation for all sleep/timeout related functions
and fix up stale documentation links all over the place
- Fixup a few usage sites
- Rework of timekeeping and adjtimex(2) to prepare for multiple PTP
clocks
A system can have multiple PTP clocks which are participating in
seperate and independent PTP clock domains. So far the kernel only
considers the PTP clock which is based on CLOCK TAI relevant as
that's the clock which drives the timekeeping adjustments via the
various user space daemons through adjtimex(2).
The non TAI based clock domains are accessible via the file
descriptor based posix clocks, but their usability is very limited.
They can't be accessed fast as they always go all the way out to
the hardware and they cannot be utilized in the kernel itself.
As Time Sensitive Networking (TSN) gains traction it is required to
provide fast user and kernel space access to these clocks.
The approach taken is to utilize the timekeeping and adjtimex(2)
infrastructure to provide this access in a similar way how the
kernel provides access to clock MONOTONIC, REALTIME etc.
Instead of creating a duplicated infrastructure this rework
converts timekeeping and adjtimex(2) into generic functionality
which operates on pointers to data structures instead of using
static variables.
This allows to provide time accessors and adjtimex(2) functionality
for the independent PTP clocks in a subsequent step.
- Consolidate hrtimer initialization
hrtimers are set up by initializing the data structure and then
seperately setting the callback function for historical reasons.
That's an extra unnecessary step and makes Rust support less
straight forward than it should be.
Provide a new set of hrtimer_setup*() functions and convert the
core code and a few usage sites of the less frequently used
interfaces over.
The bulk of the htimer_init() to hrtimer_setup() conversion is
already prepared and scheduled for the next merge window.
- Drivers:
- Ensure that the global timekeeping clocksource is utilizing the
cluster 0 timer on MIPS multi-cluster systems.
Otherwise CPUs on different clusters use their cluster specific
clocksource which is not guaranteed to be synchronized with
other clusters.
- Mostly boring cleanups, fixes, improvements and code movement"
* tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits)
posix-timers: Fix spurious warning on double enqueue versus do_exit()
clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties
clocksource/drivers/gpx: Remove redundant casts
clocksource/drivers/timer-ti-dm: Fix child node refcount handling
dt-bindings: timer: actions,owl-timer: convert to YAML
clocksource/drivers/ralink: Add Ralink System Tick Counter driver
clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource
clocksource/drivers/timer-ti-dm: Don't fail probe if int not found
clocksource/drivers:sp804: Make user selectable
clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions
hrtimers: Delete hrtimer_init_on_stack()
alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack()
io_uring: Switch to use hrtimer_setup_on_stack()
sched/idle: Switch to use hrtimer_setup_on_stack()
hrtimers: Delete hrtimer_init_sleeper_on_stack()
wait: Switch to use hrtimer_setup_sleeper_on_stack()
timers: Switch to use hrtimer_setup_sleeper_on_stack()
net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack()
futex: Switch to use hrtimer_setup_sleeper_on_stack()
fs/aio: Switch to use hrtimer_setup_sleeper_on_stack()
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmc7S3kQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjHVEAC+CITBEcGy+S0IK0BpIAhuA+A621LtqBwy
0z/4MZKXMqvWxcFGQJ9Zr8MvxUnY4KFcssiaR5zk+I9TczNu7mLMuPYD1Gb0Klgz
mwuFOylo1CAAC41IABYZZ/0qWbTaW0p8tpaGsTbTNk3tBxuMLB550+APAqC1OE9U
bb7rP+FHc5+YGI9/7JNWt7NNTSHvVSO6oxjltCxHr1dRg93Jtr2jaY6letY3epFz
TCFyfJlDtK8fPwtYRyG51M4g2Vdp9/4qsfPqvnXwUr9MdWaVh5/TFkyvqDi5sCKM
zdK/sjRiimYzvqqKg6bzgYscITUPNk2TG6ZJq5U1L7lrglzVY69c7GIUnNzPrL/y
AxQsR5Guxz3bRNYWZ4BKJDH+NNB+cgIFEXDsv72qoUy3HTzA6wOPZYxfjhZhKuG/
DjRwM7NGx5oPiKtpK99IulZttXdmtkH0csuLwKmOzrQskQdTuWyrEtU7UQql7oQ5
Rt3DhMXouzYZMicB8U5Q9gO2I3WN+2VVxXl4sa00LG8KsT6PzLnz4Q2k/1c83S6J
rRivRbZAbZ1+BqKvF8T7GgzLCeaLgzbeoxmxj6xr87pf3SYEs2KhQeQ+n/C0HTOt
GOcG1+bvh7t2aSvlBPKVCExWI4erwG6wXFhfGKsLW9CmwIMqRNxdePpRWe3Cueyp
M3QRJuvTxQ==
=bDvp
-----END PGP SIGNATURE-----
Merge tag 'for-6.13/io_uring-20241118' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Cleanups of the eventfd handling code, making it fully private.
- Support for sending a sync message to another ring, without having a
ring available to send a normal async message.
- Get rid of the separate unlocked hash table, unify everything around
the single locked one.
- Add support for ring resizing. It can be hard to appropriately size
the CQ ring upfront, if the application doesn't know how busy it will
be. This results in applications sizing rings for the most busy case,
which can be wasteful. With ring resizing, they can start small and
grow the ring, if needed.
- Add support for fixed wait regions, rather than needing to copy the
same wait data tons of times for each wait operation.
- Rewrite the resource node handling, which before was serialized per
ring. This caused issues with particularly fixed files, where one
file waiting on IO could hold up putting and freeing of other
unrelated files. Now each node is handled separately. New code is
much simpler too, and was a net 250 line reduction in code.
- Add support for just doing partial buffer clones, rather than always
cloning the entire buffer table.
- Series adding static NAPI support, where a specific NAPI instance is
used rather than having a list of them available that need lookup.
- Add support for mapped regions, and also convert the fixed wait
support mentioned above to that concept. This avoids doing special
mappings for various planned features, and folds the existing
registered wait into that too.
- Add support for hybrid IO polling, which is a variant of strict
IOPOLL but with an initial sleep delay to avoid spinning too early
and wasting resources on devices that aren't necessarily in the < 5
usec category wrt latencies.
- Various cleanups and little fixes.
* tag 'for-6.13/io_uring-20241118' of git://git.kernel.dk/linux: (79 commits)
io_uring/region: fix error codes after failed vmap
io_uring: restore back registered wait arguments
io_uring: add memory region registration
io_uring: introduce concept of memory regions
io_uring: temporarily disable registered waits
io_uring: disable ENTER_EXT_ARG_REG for IOPOLL
io_uring: fortify io_pin_pages with a warning
switch io_msg_ring() to CLASS(fd)
io_uring: fix invalid hybrid polling ctx leaks
io_uring/uring_cmd: fix buffer index retrieval
io_uring/rsrc: add & apply io_req_assign_buf_node()
io_uring/rsrc: remove '->ctx_ptr' of 'struct io_rsrc_node'
io_uring/rsrc: pass 'struct io_ring_ctx' reference to rsrc helpers
io_uring: avoid normal tw intermediate fallback
io_uring/napi: add static napi tracking strategy
io_uring/napi: clean up __io_napi_do_busy_loop
io_uring/napi: Use lock guards
io_uring/napi: improve __io_napi_add
io_uring/napi: fix io_napi_entry RCU accesses
io_uring/napi: protect concurrent io_napi_entry timeout accesses
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmc7S40QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjHVD/43rDZ8ehs+IAAr6S0RemNX1SRG0mK2UOEb
kMoNogS7StO/c4JYW3JuzCyLRn5ZsgeWV/muqxwDEWQrmTGrvi+V45KikrZPwm3k
p0ump33qV9EU2jiR1MKZjtwK2P0CI7/DD3W8ww6IOvKbTT7RcqQcdHznvXArFBtc
xCuQPpayFG7ZasC+N9VaBwtiUEVgU3Ek9AFT7UVZRWajjHPNalQwaooJWayO0rEG
KdoW5yG0ryLrgCY2ACSvRLS+2s14EJtb8hgT08WKHTNgd5LxhSKxfsTapamua+7U
FdVS6Ij0tEkgu2jpvgj7QKO0Uw10Cnep2gj7RHts/LVewvkliS6XcheOzqRS1jWU
I2EI+UaGOZ11OUiw52VIveEVS5zV/NWhgy5BSP9LYEvXw0BUAHRDYGMem8o5G1V1
SWqjIM1UWvcQDlAnMF9FDVzojvjVUmYWvcAlFFztO8J0B7SavHR3NcfHwEf57reH
rNoUbi/9c4/wjJJF33gejiR5pU+ewy/Mk75GrtX3xpEqlztfRbf9/FbPCMEAO1KR
DF/b3lkUV9i2/BRW6a0SpZ5RDSmSYMnateel6TrPyVSRnpiSSFO8FrbynwUOa17b
6i49YDFWzzXOrR1YWDg6IEtTrcmBEmvi7F6aoDs020qUnL0hwLn1ZuoIxuiFEpor
Z0iFF1B/nw==
=PWTH
-----END PGP SIGNATURE-----
Merge tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe updates via Keith:
- Use uring_cmd helper (Pavel)
- Host Memory Buffer allocation enhancements (Christoph)
- Target persistent reservation support (Guixin)
- Persistent reservation tracing (Guixen)
- NVMe 2.1 specification support (Keith)
- Rotational Meta Support (Matias, Wang, Keith)
- Volatile cache detection enhancment (Guixen)
- MD updates via Song:
- Maintainers update
- raid5 sync IO fix
- Enhance handling of faulty and blocked devices
- raid5-ppl atomic improvement
- md-bitmap fix
- Support for manually defining embedded partition tables
- Zone append fixes and cleanups
- Stop sending the queued requests in the plug list to the driver
->queue_rqs() handle in reverse order.
- Zoned write plug cleanups
- Cleanups disk stats tracking and add support for disk stats for
passthrough IO
- Add preparatory support for file system atomic writes
- Add lockdep support for queue freezing. Already found a bunch of
issues, and some fixes for that are in here. More will be coming.
- Fix race between queue stopping/quiescing and IO queueing
- ublk recovery improvements
- Fix ublk mmap for 64k pages
- Various fixes and cleanups
* tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux: (118 commits)
MAINTAINERS: Update git tree for mdraid subsystem
block: make struct rq_list available for !CONFIG_BLOCK
block/genhd: use seq_put_decimal_ull for diskstats decimal values
block: don't reorder requests in blk_mq_add_to_batch
block: don't reorder requests in blk_add_rq_to_plug
block: add a rq_list type
block: remove rq_list_move
virtio_blk: reverse request order in virtio_queue_rqs
nvme-pci: reverse request order in nvme_queue_rqs
btrfs: validate queue limits
block: export blk_validate_limits
nvmet: add tracing of reservation commands
nvme: parse reservation commands's action and rtype to string
nvmet: report ns's vwc not present
md/raid5: Increase r5conf.cache_name size
block: remove the ioprio field from struct request
block: remove the write_hint field from struct request
nvme: check ns's volatile write cache not present
nvme: add rotational support
nvme: use command set independent id ns if available
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmc0zT4ACgkQxWXV+ddt
WDtThRAAhzSSiHcJqTfCL5nHh7w85MNEVw28o1ETgXSYJmx0JOWLE7Znlp2FV7jj
IbYkFfF2gXJzYvRZkcXB/TAHV9KJG5yZIBZfccbM+9db9f8xkImVKMuqQRXPU41R
ppSCmqZTeujtt8ucsaJkMpm6pzECKJCJaGOsMJ8fiqKpo89dKO3eGAVboSbpPF4C
r0YmppiBwSP/cCXQCqWxZRbqPGN+lUgZpIGNRi157kehfmRHlVVJTO1pgqK8PCXb
uIT09Kulppfez8+1A10CPcniDTyinLik/qLTNlzdWoDBL4iNJMg0A0wsA04AJVf0
PdOS0REusiv3QcEIO6PefuRFRRfXcSLPpPDUceltJT5O0uM2gUqf2C7dEHXUGU3o
TdgYlbQpsJWpZ7VGWQDZeGGV04lOPQvu0LGLPgEerUQd5H9ABa0dX8Fn0sPhKsa8
whpAcdfE4rdNxB2OJFnqQeFq0z3cSjP/rvKlluCmAj97QYI+kiu3QyhemcT1YSC9
U7n5Ya9IzIYCN3ml54q3hEgyD0IVGGG20GuUmqC9XSP9mrQRC8I1g7v26AiOTrrk
VhgSdtMmphDxXudifsnYMaQ0Z1QqiUrW1SM/prAEOnBYCo75+HDsTgrq9ithgHoI
4xz4YXJyMRs18qfTJctXC1wmGuz5plTdQrwarHdNsELN5HEyqX4=
=aAcf
-----END PGP SIGNATURE-----
Merge tag 'for-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Changes outside of btrfs: add io_uring command flag to track a dying
task (the rest will go via the block git tree).
User visible changes:
- wire encoded read (ioctl) to io_uring commands, this can be used on
itself, in the future this will allow 'send' to be asynchronous. As
a consequence, the encoded read ioctl can also work in non-blocking
mode
- new ioctl to wait for cleaned subvolumes, no need to use the
generic and root-only SEARCH_TREE ioctl, will be used by "btrfs
subvol sync"
- recognize different paths/symlinks for the same devices and don't
report them during rescanning, this can be observed with LVM or DM
- seeding device use case change, the sprout device (the one
capturing new writes) will not clear the read-only status of the
super block; this prevents accumulating space from deleted
snapshots
Performance improvements:
- reduce lock contention when traversing extent buffers
- reduce extent tree lock contention when searching for inline
backref
- switch from rb-trees to xarray for delayed ref tracking,
improvements due to better cache locality, branching factors and
more compact data structures
- enable extent map shrinker again (prevent memory exhaustion under
some types of IO load), reworked to run in a single worker thread
(there used to be problems causing long stalls under memory
pressure)
Core changes:
- raid-stripe-tree feature updates:
- make device replace and scrub work
- implement partial deletion of stripe extents
- new selftests
- split the config option BTRFS_DEBUG and add EXPERIMENTAL for
features that are experimental or with known problems so we don't
misuse debugging config for that
- subpage mode updates (sector < page):
- update compression implementations
- update writepage, writeback
- continued folio API conversions:
- buffered writes
- make buffered write copy one page at a time, preparatory work for
future integration with large folios, may cause performance drop
- proper locking of root item regarding starting send
- error handling improvements
- code cleanups and refactoring:
- dead code removal
- unused parameter reduction
- lockdep assertions"
* tag 'for-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (119 commits)
btrfs: send: check for read-only send root under critical section
btrfs: send: check for dead send root under critical section
btrfs: remove check for NULL fs_info at btrfs_folio_end_lock_bitmap()
btrfs: fix warning on PTR_ERR() against NULL device at btrfs_control_ioctl()
btrfs: fix a typo in btrfs_use_zone_append
btrfs: avoid superfluous calls to free_extent_map() in btrfs_encoded_read()
btrfs: simplify logic to decrement snapshot counter at btrfs_mksnapshot()
btrfs: remove hole from struct btrfs_delayed_node
btrfs: update stale comment for struct btrfs_delayed_ref_node::add_list
btrfs: add new ioctl to wait for cleaned subvolumes
btrfs: simplify range tracking in cow_file_range()
btrfs: remove conditional path allocation in btrfs_read_locked_inode()
btrfs: push cleanup into btrfs_read_locked_inode()
io_uring/cmd: let cmds to know about dying task
btrfs: add struct io_btrfs_cmd as type for io_uring_cmd_to_pdu()
btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)
btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages()
btrfs: don't sleep in btrfs_encoded_read() if IOCB_NOWAIT is set
btrfs: change btrfs_encoded_read() so that reading of extent is done by caller
btrfs: remove pointless iocb::ki_pos addition in btrfs_encoded_read()
...
and friends
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdpZQAKCRBZ7Krx/gZQ
6whMAQDhlGFV+nGRetwe4t60mVRpxIoc71GLC7b6V8FmyfTI5AEAkAigkJ8KCZDP
mfGsN/3PtzoxnIkIqdk7Y7q4/fowyAw=
=4DWZ
-----END PGP SIGNATURE-----
Merge tag 'pull-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull statx updates from Al Viro:
"Sanitize struct filename and lookup flags handling in statx and
friends"
* tag 'pull-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
libfs: kill empty_dir_getattr()
fs: Simplify getattr interface function checking AT_GETATTR_NOSEC flag
fs/stat.c: switch to CLASS(fd_raw)
kill getname_statx_lookup_flags()
io_statx_prep(): use getname_uflags()
add *xattrat() syscalls, sanitize struct filename handling in there.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdj4gAKCRBZ7Krx/gZQ
6/02AQC8ndn9i1wLGRb5DdZYGNWUDhXCdPrZCF2nyvU2swCIPwEAm1H5F/bxBXeT
6qCLHThVw4KTJOT2aDY03ELrxbi8Vg4=
=35Oj
-----END PGP SIGNATURE-----
Merge tag 'pull-xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull xattr updates from Al Viro:
"Sanitize xattr and io_uring interactions with it, add *xattrat()
syscalls, sanitize struct filename handling in there"
* tag 'pull-xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
xattr: remove redundant check on variable err
fs/xattr: add *at family syscalls
new helpers: file_removexattr(), filename_removexattr()
new helpers: file_listxattr(), filename_listxattr()
replace do_getxattr() with saner helpers.
replace do_setxattr() with saner helpers.
new helper: import_xattr_name()
fs: rename struct xattr_ctx to kernel_xattr_ctx
xattr: switch to CLASS(fd)
io_[gs]etxattr_prep(): just use getname()
io_uring: IORING_OP_F[GS]ETXATTR is fine with REQ_F_FIXED_FILE
getname_maybe_null() - the third variant of pathname copy-in
teach filename_lookup() to treat NULL filename as ""
Making sure that struct fd instances are destroyed in the same
scope where they'd been created, getting rid of reassignments
and passing them by reference, converting to CLASS(fd{,_pos,_raw}).
We are getting very close to having the memory safety of that stuff
trivial to verify.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdikAAKCRBZ7Krx/gZQ
69nJAQCmbQHK3TGUbQhOw6MJXOK9ezpyEDN3FZb4jsu38vTIdgEA6OxAYDO2m2g9
CN18glYmD3wRyU6Bwl4vGODouSJvDgA=
=gVH3
-----END PGP SIGNATURE-----
Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 'struct fd' class updates from Al Viro:
"The bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same scope
where they'd been created, getting rid of reassignments and passing
them by reference, converting to CLASS(fd{,_pos,_raw}).
We are getting very close to having the memory safety of that stuff
trivial to verify"
* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
deal with the last remaing boolean uses of fd_file()
css_set_fork(): switch to CLASS(fd_raw, ...)
memcg_write_event_control(): switch to CLASS(fd)
assorted variants of irqfd setup: convert to CLASS(fd)
do_pollfd(): convert to CLASS(fd)
convert do_select()
convert vfs_dedupe_file_range().
convert cifs_ioctl_copychunk()
convert media_request_get_by_fd()
convert spu_run(2)
switch spufs_calls_{get,put}() to CLASS() use
convert cachestat(2)
convert do_preadv()/do_pwritev()
fdget(), more trivial conversions
fdget(), trivial conversions
privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()
o2hb_region_dev_store(): avoid goto around fdget()/fdput()
introduce "fd_pos" class, convert fdget_pos() users to it.
fdget_raw() users: switch to CLASS(fd_raw)
convert vmsplice() to CLASS(fd)
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcW4gAKCRCRxhvAZXjc
okF+AP9xTMb2SlnRPBOBd9yFcmVXmQi86TSCUPAEVb+wIldGYwD/RIOdvXYJlp9v
RgJkU1DC3ddkXtONNDY6gFaP+siIWA0=
=gMc7
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs file updates from Christian Brauner:
"This contains changes the changes for files for this cycle:
- Introduce a new reference counting mechanism for files.
As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop
it has O(N^2) behaviour under contention with N concurrent
operations and it is in a hot path in __fget_files_rcu().
The rcuref infrastructures remedies this problem by using an
unconditional increment relying on safe- and dead zones to make
this work and requiring rcu protection for the data structure in
question. This not just scales better it also introduces overflow
protection.
However, in contrast to generic rcuref, files require a memory
barrier and thus cannot rely on *_relaxed() atomic operations and
also require to be built on atomic_long_t as having massive amounts
of reference isn't unheard of even if it is just an attack.
This adds a file specific variant instead of making this a generic
library.
This has been tested by various people and it gives consistent
improvement up to 3-5% on workloads with loads of threads.
- Add a fastpath for find_next_zero_bit(). Skip 2-levels searching
via find_next_zero_bit() when there is a free slot in the word that
contains the next fd. This improves pts/blogbench-1.1.0 read by 8%
and write by 4% on Intel ICX 160.
- Conditionally clear full_fds_bits since it's very likely that a bit
in full_fds_bits has been cleared during __clear_open_fds(). This
improves pts/blogbench-1.1.0 read up to 13%, and write up to 5% on
Intel ICX 160.
- Get rid of all lookup_*_fdget_rcu() variants. They were used to
lookup files without taking a reference count. That became invalid
once files were switched to SLAB_TYPESAFE_BY_RCU and now we're
always taking a reference count. Switch to an already existing
helper and remove the legacy variants.
- Remove pointless includes of <linux/fdtable.h>.
- Avoid cmpxchg() in close_files() as nobody else has a reference to
the files_struct at that point.
- Move close_range() into fs/file.c and fold __close_range() into it.
- Cleanup calling conventions of alloc_fdtable() and expand_files().
- Merge __{set,clear}_close_on_exec() into one.
- Make __set_open_fd() set cloexec as well instead of doing it in two
separate steps"
* tag 'vfs-6.13.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests: add file SLAB_TYPESAFE_BY_RCU recycling stressor
fs: port files to file_ref
fs: add file_ref
expand_files(): simplify calling conventions
make __set_open_fd() set cloexec state as well
fs: protect backing files with rcu
file.c: merge __{set,clear}_close_on_exec()
alloc_fdtable(): change calling conventions.
fs/file.c: add fast path in find_next_fd()
fs/file.c: conditionally clear full_fds
fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
move close_range(2) into fs/file.c, fold __close_range() into it
close_files(): don't bother with xchg()
remove pointless includes of <linux/fdtable.h>
get rid of ...lookup...fdget_rcu() family
Now we've got a more generic region registration API, place
IORING_ENTER_EXT_ARG_REG and re-enable it.
First, the user has to register a region with the
IORING_MEM_REGION_REG_WAIT_ARG flag set. It can only be done for a
ring in a disabled state, aka IORING_SETUP_R_DISABLED, to avoid races
with already running waiters. With that we should have stable constant
values for ctx->cq_wait_{size,arg} in io_get_ext_arg_reg() and hence no
READ_ONCE required.
The other API difference is that we're now passing byte offsets instead
of indexes. The user _must_ align all offsets / pointers to the native
word size, failing to do so might but not necessarily has to lead to a
failure usually returned as -EFAULT. liburing will be hiding this
details from users.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/81822c1b4ffbe8ad391b4f9ad1564def0d26d990.1731689588.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Regions will serve multiple purposes. First, with it we can decouple
ring/etc. object creation from registration / mapping of the memory they
will be placed in. We already have hacks that allow to put both SQ and
CQ into the same huge page, in the future we should be able to:
region = create_region(io_ring);
create_pbuf_ring(io_uring, region, offset=0);
create_pbuf_ring(io_uring, region, offset=N);
The second use case is efficiently passing parameters. The following
patch enables back on top of regions IORING_ENTER_EXT_ARG_REG, which
optimises wait arguments. It'll also be useful for request arguments
replacing iovecs, msghdr, etc. pointers. Eventually it would also be
handy for BPF as well if it comes to fruition.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0798cf3a14fad19cfc96fc9feca5f3e11481691d.1731689588.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We've got a good number of mappings we share with the userspace, that
includes the main rings, provided buffer rings, upcoming rings for
zerocopy rx and more. All of them duplicate user argument parsing and
some internal details as well (page pinnning, huge page optimisations,
mmap'ing, etc.)
Introduce a notion of regions. For userspace for now it's just a new
structure called struct io_uring_region_desc which is supposed to
parameterise all such mapping / queue creations. A region either
represents a user provided chunk of memory, in which case the user_addr
field should point to it, or a request for the kernel to allocate the
memory, in which case the user would need to mmap it after using the
offset returned in the mmap_offset field. With a uniform userspace API
we can avoid additional boiler plate code and apply future optimisation
to all of them at once.
Internally, there is a new structure struct io_mapped_region holding all
relevant runtime information and some helpers to work with it. This
patch limits it to user provided regions.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0e6fe25818dfbaebd1bd90b870a6cac503fe1a24.1731689588.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're a bit too frivolous with types of nr_pages arguments, converting
it to long and back to int, passing an unsigned int pointer as an int
pointer and so on. Shouldn't cause any problem but should be carefully
reviewed, but until then let's add a WARN_ON_ONCE check to be more
confident callers don't pass poorely checked arguents.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d48e0c097cbd90fb47acaddb6c247596510d8cfc.1731689588.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use CLASS(fd) to get the file for sync message ring requests, rather
than open-code the file retrieval dance.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20241115034902.GP3387508@ZenIV
[axboe: make a more coherent commit message]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the semi-open coded request list helpers with a proper rq_list
type that mirrors the bio_list and has head and tail pointers. Besides
better type safety this actually allows to insert at the tail of the
list, which will be useful soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
the only thing in flags getname_flags() ever cares about is
LOOKUP_EMPTY; anything else is none of its damn business.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When the taks that submitted a request is dying, a task work for that
request might get run by a kernel thread or even worse by a half
dismantled task. We can't just cancel the task work without running the
callback as the cmd might need to do some clean up, so pass a flag
instead. If set, it's not safe to access any task resources and the
callback is expected to cancel the cmd ASAP.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following pattern becomes more and more:
+ io_req_assign_rsrc_node(&req->buf_node, node);
+ req->flags |= REQ_F_BUF_NODE;
so make it a helper, which is less fragile to use than above code, for
example, the BUF_NODE flag is even missed in current io_uring_cmd_prep().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241107110149.890530-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
`io_rsrc_node` instance won't be shared among different io_uring ctxs,
and its allocation 'ctx' is always same with the user's 'ctx', so it is
safe to pass user 'ctx' reference to rsrc helpers. Even in io_clone_buffers(),
`io_rsrc_node` instance is allocated actually for destination io_uring_ctx.
Then io_rsrc_node_ctx() can be removed, and the 8 bytes `ctx` pointer will be
removed from `io_rsrc_node` in the following patch.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241107110149.890530-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
hrtimer_setup_on_stack() takes the callback function pointer as argument
and initializes the timer completely.
Replace hrtimer_init_on_stack() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/f0d4ac32ec4050710a656cee8385fa4427be33aa.1730386209.git.namcao@linutronix.de
The IORING_OP_TIMEOUT command uses hrtimer underneath. The timer's callback
function is setup in io_timeout(), and then the callback function is setup
again when the timer is rearmed.
Since the callback function is the same for both cases, the latter setup is
redundant, therefore remove it.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk:
Link: https://lore.kernel.org/all/07b28dfd5691478a2d250f379c8b90dd37f9bb9a.1730386209.git.namcao@linutronix.de
When a DEFER_TASKRUN io_uring is terminating it requeues deferred task
work items as normal tw, which can further fallback to kthread
execution. Avoid this extra step and always push them to the fallback
kthread.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d1cd472cec2230c66bd1c8d412a5833f0af75384.1730772720.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__io_napi_do_busy_loop now requires to have loop_end in its parameters.
This makes the code cleaner and also has the benefit of removing a
branch since the only caller not passing NULL for loop_end_arg is also
setting the value conditionally.
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Link: https://lore.kernel.org/r/d5b9bb91b1a08fff50525e1c18d7b4709b9ca100.1728828877.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
1. move the sock->sk pointer validity test outside the function to
avoid the function call overhead and to make the function more
more reusable
2. change its name to __io_napi_add_id to be more precise about it is
doing
3. return an error code to report errors
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Link: https://lore.kernel.org/r/d637fa3b437d753c0f4e44ff6a7b5bf2c2611270.1728828877.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The SQ index array consists of user provided indexes, which io_uring
then uses to index the SQ, and so it's susceptible to speculation. For
all other queues io_uring tracks heads and tails in kernel, and they
shouldn't need any special care.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c6c7a25962924a55869e317e4fdb682dfdc6b279.1730687889.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than store the task_struct itself in struct io_kiocb, store
the io_uring specific task_struct. The life times are the same in terms
of io_uring, and this avoids doing some dereferences through the
task_struct. For the hot path of putting local task references, we can
deref req->tctx instead, which we'll need anyway in that function
regardless of whether it's local or remote references.
This is mostly straight forward, except the original task PF_EXITING
check needs a bit of tweaking. task_work is _always_ run from the
originating task, except in the fallback case, where it's run from a
kernel thread. Replace the potentially racy (in case of fallback work)
checks for req->task->flags with current->flags. It's either the still
the original task, in which case PF_EXITING will be sane, or it has
PF_KTHREAD set, in which case it's fallback work. Both cases should
prevent moving forward with the given request.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Right now the task_struct pointer is used as the key to match a task,
but in preparation for some io_kiocb changes, move it to using struct
io_uring_task instead. No functional changes intended in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently the io_rsrc_node assignment in io_kiocb is an array of two
pointers, as two nodes may be assigned to a request - one file node,
and one buffer node. However, the buffer node can co-exist with the
provided buffers, as currently it's not supported to use both provided
and registered buffers at the same time.
This crucially brings struct io_kiocb down to 4 cache lines again, as
before it spilled into the 5th cacheline.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than keep the type field separate rom ctx, use the fact that we
can encode up to 4 types of nodes in the LSB of the ctx pointer. Doesn't
reclaim any space right now on 64-bit archs, but it leaves a full int
for future use.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
similar to do_setxattr() in the previous commit...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
io_uring setxattr logics duplicates stuff from fs/xattr.c; provide
saner helpers (filename_setxattr() and file_setxattr() resp.) and
use them.
NB: putname(ERR_PTR()) is a no-op
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the struct xattr_ctx to increase distinction with the about to be
added user API struct xattr_args.
No functional change.
Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Link: https://lore.kernel.org/r/20240426162042.191916-2-cgoettsche@seltendoof.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
getname_flags(pathname, LOOKUP_FOLLOW) is obviously bogus - following
trailing symlinks has no impact on how to copy the pathname from userland...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fdget() is the first thing done in scope, all matching fdput() are
immediately followed by leaving the scope.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A new hybrid poll is implemented on the io_uring layer. Once an IO is
issued, it will not poll immediately, but rather block first and re-run
before IO complete, then poll to reap IO. While this poll method could
be a suboptimal solution when running on a single thread, it offers
performance lower than regular polling but higher than IRQ, and CPU
utilization is also lower than polling.
To use hybrid polling, the ring must be setup with both the
IORING_SETUP_IOPOLL and IORING_SETUP_HYBRID)IOPOLL flags set. Hybrid
polling has the same restrictions as IOPOLL, in that commands must
explicitly support it.
Signed-off-by: hexue <xue01.he@samsung.com>
Link: https://lore.kernel.org/r/20241101091957.564220-2-xue01.he@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently cloning a buffer table will fail if the destination already has
a table. But it should be possible to use it to replace existing elements.
Add a IORING_REGISTER_DST_REPLACE cloning flag, which if set, will allow
the destination to already having a buffer table. If that is the case,
then entries designated by offset + nr buffers will be replaced if they
already exist.
Note that it's allowed to use IORING_REGISTER_DST_REPLACE and not have
an existing table, in which case it'll work just like not having the
flag set and an empty table - it'll just assign the newly created table
for that case.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Right now buffer cloning is an all-or-nothing kind of thing - either the
whole table is cloned from a source to a destination ring, or nothing at
all.
However, it's not always desired to clone the whole thing. Allow for
the application to specify a source and destination offset, and a
number of buffers to clone. If the destination offset is non-zero, then
allocate sparse nodes upfront.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The empty node was used as a placeholder for a sparse entry, but it
didn't really solve any issues. The caller still has to check for
whether it's the empty node or not, it may as well just check for a NULL
return instead.
The dummy_ubuf was used for a sparse buffer entry, but NULL will serve
the same purpose there of ensuring an -EFAULT on attempted import.
Just use NULL for a sparse node, regardless of whether or not it's a
file or buffer resource.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Puts and reset an existing node in a slot, if one exists. Returns true
if a node was there, false if not. This helps cleanup some of the code
that does a lookup just to clear an existing node.
Signed-off-by: Jens Axboe <axboe@kernel.dk>