Commit 23baf831a3 ("mm, treewide: redefine MAX_ORDER sanely")
changed the meaning of MAX_ORDER from exclusive to inclusive. So, we
can allocate compound pages with up to 1 << MAX_ORDER pages.
Reflect this change in dm-flakey and start trying to allocate compound
pages with MAX_ORDER.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
dm_verity_fec_io is placed after the end of two hash digests. If the hash
digest has unaligned length, struct dm_verity_fec_io could be unaligned.
This commit fixes the placement of struct dm_verity_fec_io, so that it's
aligned.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Fixes: a739ff3f54 ("dm verity: add support for forward error correction")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
We found an issue under Android OTA scenario that many BIOs have to do
FEC where the data under dm-verity is 100% complete and no corruption.
Android OTA has many dm-block layers, from upper to lower:
dm-verity
dm-snapshot
dm-origin & dm-cow
dm-linear
ufs
DM tables have to change 2 times during Android OTA merging process.
When doing table change, the dm-snapshot will be suspended for a while.
During this interval, many readahead IOs are submitted to dm_verity
from filesystem. Then the kverity works are busy doing FEC process
which cost too much time to finish dm-verity IO. This causes needless
delay which feels like system is hung.
After adding debugging it was found that each readahead IO needed
around 10s to finish when this situation occurred. This is due to IO
amplification:
dm-snapshot suspend
erofs_readahead // 300+ io is submitted
dm_submit_bio (dm_verity)
dm_submit_bio (dm_snapshot)
bio return EIO
bio got nothing, it's empty
verity_end_io
verity_verify_io
forloop range(0, io->n_blocks) // each io->nblocks ~= 20
verity_fec_decode
fec_decode_rsb
fec_read_bufs
forloop range(0, v->fec->rsn) // v->fec->rsn = 253
new_read
submit_bio (dm_snapshot)
end loop
end loop
dm-snapshot resume
Readahead BIOs get nothing while dm-snapshot is suspended, so all of
them will cause verity's FEC.
Each readahead BIO needs to verify ~20 (io->nblocks) blocks.
Each block needs to do FEC, and every block needs to do 253
(v->fec->rsn) reads.
So during the suspend interval(~200ms), 300 readahead BIOs trigger
~1518000 (300*20*253) IOs to dm-snapshot.
As readahead IO is not required by userspace, and to fix this issue,
it is best to pass readahead errors to upper layer to handle it.
Cc: stable@vger.kernel.org
Fixes: a739ff3f54 ("dm verity: add support for forward error correction")
Signed-off-by: Wu Bo <bo.wu@vivo.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
It is nontrivial to derive the role of the two attribute groups in source
file block/blk-sysfs.c. Hence add a comment that explains their roles. See
also commit 6d85ebf95c ("blk-sysfs: add a new attr_group for blk_mq").
Cc: Christoph Hellwig <hch@lst.de>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20231128194019.72762-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct ynl_req_state carries reply-related info from generated code
into generic YNL code. While we don't need reply info to execute
a request without a reply, we still need to pass in the struct, because
it's also where we get the pointer to struct ynl_sock from. Passing NULL
results in crashes if kernel returns an error or an unexpected reply.
Fixes: dc0956c98f ("tools: ynl-gen: move the response reading logic into YNL")
Link: https://lore.kernel.org/r/20231126225858.2144136-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The default dump handler needs to clear ret before returning.
Otherwise if the last interface returns an inconsequential
error this error will propagate to user space.
This may confuse user space (ethtool CLI seems to ignore it,
but YNL doesn't). It will also terminate the dump early
for mutli-skb dump, because netlink core treats EOPNOTSUPP
as a real error.
Fixes: 728480f124 ("ethtool: default handlers for GET requests")
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231126225806.2143528-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When amd_pstate is running, writing to scaling_min_freq and
scaling_max_freq has no effect. These values are only passed to the
policy level, but not to the platform level. This means that the
platform does not know about the frequency limits set by the user.
To fix this, update the min_perf and max_perf values at the platform
level whenever the user changes the scaling_min_freq and scaling_max_freq
values.
Fixes: ffa5096a7c ("cpufreq: amd-pstate: implement Pstate EPP support for the AMD processors")
Acked-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Wyes Karny <wyes.karny@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Fix a really interesting potential core bug in the list iterator
requireing the use of READ_ONCE() discovered when testing kernel
compiles with clang.
- Check devm_kcalloc() return value and an array bounds in the STM32
driver.
- Fix an exotic string truncation issue in the s32cc driver, found
by the kernel test robot (impressive!)
- Fix an undocumented struct member in the cy8c95x0 driver.
- Fix a symbol overlap with MIPS in the Lochnagar driver, MIPS
defines a global symbol "RST" which is a bit too generic and
collide with stuff. OK this one should be renamed too, we will
fix that as well.
- Fix erroneous branch taking in the Realtek driver.
- Fix the mail address in MAINTAINERS for the s32g2 driver.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEElDRnuGcz/wPCXQWMQRCzN7AZXXMFAmVnKK8ACgkQQRCzN7AZ
XXPr1BAAuG5cy4goUT+ycz9WOH65D3gaxqyNxMOHSI7XrbyTQWzPkR+HCcFHORQH
hdCRsTy8WoSnN3UJ/ySaEHl8D8nKl51M8KKqARNpaLnROyKWmGlQQdFZZScIUcpH
3YXKMngBg8buYKL9ZVDJrgCKxpAKVIXxkPDLRFMtfDIWBivvZcx4g7yeuX8SkZyT
hQEkx4VrN9mdA30ZY6s6avhDWyxL/CF669fk7qaQUaT3dZVQdFQ9lx/3yzk7Fm5D
z2hXwQ5ghLRQS/ZbxNS2oHixU9sA5xUVzaj3Rn3Of43NSAmrj8JWuqC2pyZ6V70e
jJSEldBqqY5tiP/0mh2irqi/Uo2CQqVbYtmEtqWxwxWcTv21orO5Hl8CAldXAQZt
v0kVbdsPtePiXfwP/EqsE3qBd3GFeLJKxtK3qZsYSdcDjbqPoxHqGBGSJUHoNfB0
ZFh0PDrFknRGO7ZHJKUdYJ4PQklFSp/JmVtbbZI+6/yafNX/bO9w1RpHOZQ5Je/s
7k9G1dfmJOmP+PprxuqjZTyZ0b+JPqfsiZje3LAnLCEyezJtKqlZwmDGjzhWM2Gv
5yTDIuxmtgI1ynuQVlePPuFZw6hdGCrAZYvKnzb0Wjw582P/XsSyXC4mnyq/dD4S
wxZrsyJFbUcn5LoST1fmio3u6M/tV8eYzLO5Zvg7oWtPAv0Kqlk=
=oW3z
-----END PGP SIGNATURE-----
Merge tag 'pinctrl-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
- Fix a really interesting potential core bug in the list iterator
requireing the use of READ_ONCE() discovered when testing kernel
compiles with clang.
- Check devm_kcalloc() return value and an array bounds in the STM32
driver.
- Fix an exotic string truncation issue in the s32cc driver, found by
the kernel test robot (impressive!)
- Fix an undocumented struct member in the cy8c95x0 driver.
- Fix a symbol overlap with MIPS in the Lochnagar driver, MIPS defines
a global symbol "RST" which is a bit too generic and collide with
stuff. OK this one should be renamed too, we will fix that as well.
- Fix erroneous branch taking in the Realtek driver.
- Fix the mail address in MAINTAINERS for the s32g2 driver.
* tag 'pinctrl-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
dt-bindings: pinctrl: s32g2: change a maintainer email address
pinctrl: realtek: Fix logical error when finding descriptor
pinctrl: lochnagar: Don't build on MIPS
pinctrl: avoid reload of p state in list iteration
pinctrl: cy8c95x0: Fix doc warning
pinctrl: s32cc: Avoid possible string truncation
pinctrl: stm32: fix array read out of bound
pinctrl: stm32: Add check for devm_kcalloc
Before running a guest, the host process (e.g., QEMU) FP/VEC registers
are saved if they were being used, similarly to when the kernel uses FP
registers. The guest values are then loaded into regs, and the host
process registers will be restored lazily when it uses FP/VEC.
KVM HV has a bug here: the host process registers do get saved, but the
user MSR bits remain enabled, which indicates the registers are valid
for the process. After they are clobbered by running the guest, this
valid indication causes the host process to take on the FP/VEC register
values of the guest.
Fixes: 34e119c96b ("KVM: PPC: Book3S HV P9: Reduce mtmsrd instructions required to save host SPRs")
Cc: stable@vger.kernel.org # v5.17+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231122025811.2973-1-npiggin@gmail.com
We used to call intel_pre_plane_updates() for any pipe going through
a modeset whether the pipe was previously enabled or not. This in
fact needed to apply all the necessary clock gating workarounds/etc.
Restore the correct behaviour.
Fixes: 3991999732 ("drm/i915: Disable all planes before modesetting any pipes")
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20231121054324.9988-3-ville.syrjala@linux.intel.com
(cherry picked from commit e0d5ce11ed)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Unfortunately even the HPD based detection added in
commit cfe5bdfb27 ("drm/i915: Check HPD live state during eDP probe")
fails to detect that the VBT's eDP/DDI-A is a ghost on
Asus B360M-A (CFL+CNP). On that board eDP/DDI-A has its HPD
asserted despite nothing being actually connected there :(
The straps/fuses also indicate that the eDP port is present.
So if one boots with a VGA monitor connected the eDP probe will
mistake the DP->VGA converter hooked to DDI-E for an eDP panel
on DDI-A.
As a last resort check what kind of DP device we've detected,
and if it looks like a DP->VGA converter then conclude that
the eDP port should be ignored.
Cc: stable@vger.kernel.org
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/9636
Fixes: cfe5bdfb27 ("drm/i915: Check HPD live state during eDP probe")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20231114142333.15799-1-ville.syrjala@linux.intel.com
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
(cherry picked from commit fcd479a791)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
The GSC CS is not exposed to the user, so we skipped assigning a uabi
class number for it. However, the trace logs use the uabi class and
instance to identify the engine, so leaving uabi class unset makes the
GSC CS show up as the RCS in those logs.
Given that the engine is not exposed to the user, we can't add a new
case in the uabi enum, so we insted internally define a kernel
internal class as -1.
At the same time remove special handling for the name and complete
the uabi_classes array so internal class is automatically correctly
assigned.
Engine will show as 65535:0 other0 in the logs/traces which should
be unique enough.
v2:
* Fix uabi class u8 vs u16 type confusion.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Fixes: 194babe26b ("drm/i915/mtl: don't expose GSC command streamer to the user")
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20231116084456.291533-1-tvrtko.ursulin@linux.intel.com
(cherry picked from commit dfed6b58d5)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
This fixes a bug where going read-only was taking longer than it should
have due to copygc forgetting to check kthread_should_stop()
Additionally: fix a missing is_kthread check in bch2_move_ratelimit().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This eliminates some SRCU warnings: for_each_btree_key2() runs every
loop iteration in a distinct transaction context.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree writes update the btree node key after every write, in order to
update sectors_written, and they also might need to drop pointers if one
of the writes failed in a replicated btree node.
But the btree node might also have had a pointer dropped while the write
was in flight, by bch2_dev_metadata_drop(), and thus there was a bug
where the btree node write would ovewrite the btree node's key with what
it had at the start of the write.
Fix this by dropping pointers not currently in the btree node key.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
journal_cur_seq() can legitimately be used outside of the journal lock,
where this assert can race
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The automated tests check if we've hit too many slowpath/error path
events and fail the test - if we're just shutting down, that naturally
shouldn't count.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fix races between ravb_tx_timeout_work() and functions of net_device_ops
and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that
since ravb_close() is under the rtnl lock and calls cancel_work_sync(),
ravb_tx_timeout_work() should calls rtnl_trylock(). Otherwise, a deadlock
may happen in ravb_tx_timeout_work() like below:
CPU0 CPU1
ravb_tx_timeout()
schedule_work()
...
__dev_close_many()
// Under rtnl lock
ravb_close()
cancel_work_sync()
// Waiting
ravb_tx_timeout_work()
rtnl_lock()
// This is possible to cause a deadlock
If rtnl_trylock() fails, rescheduling the work with sleep for 1 msec.
Fixes: c156633f13 ("Renesas Ethernet AVB driver proper")
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://lore.kernel.org/r/20231127122420.3706751-1-yoshihiro.shimoda.uh@renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We can't rely on FILE_STANDARD_INFORMATION::EndOfFile for reparse
points as they will be always zero. Set it to symlink target's length
as specified by POSIX.
This will make stat() family of syscalls return the correct st_size
for such files.
Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When instantiating inodes for SMB symlinks, add the mode bits from
@cifs_sb->ctx->file_mode as we already do for the other special files.
Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fake flexible arrays (zero-length and one-element arrays) are deprecated,
and should be replaced by flexible-array members. So, replace
zero-length array with a flexible-array member in `struct
PACKED_REGISTRY_TABLE`.
Also annotate array `entries` with `__counted_by()` to prepare for the
coming implementation by GCC and Clang of the `__counted_by` attribute.
Flexible array members annotated with `__counted_by` can have their
accesses bounds-checked at run-time via `CONFIG_UBSAN_BOUNDS` (for array
indexing) and `CONFIG_FORTIFY_SOURCE` (for strcpy/memcpy-family functions).
This fixes multiple -Warray-bounds warnings:
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:1069:29: warning: array subscript 0 is outside array bounds of 'PACKED_REGISTRY_ENTRY[0]' [-Warray-bounds=]
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:1070:29: warning: array subscript 0 is outside array bounds of 'PACKED_REGISTRY_ENTRY[0]' [-Warray-bounds=]
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:1071:29: warning: array subscript 0 is outside array bounds of 'PACKED_REGISTRY_ENTRY[0]' [-Warray-bounds=]
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:1072:29: warning: array subscript 0 is outside array bounds of 'PACKED_REGISTRY_ENTRY[0]' [-Warray-bounds=]
While there, also make use of the struct_size() helper, and address
checkpatch.pl warning:
WARNING: please, no spaces at the start of a line
This results in no differences in binary output.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Danilo Krummrich <dakr@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/ZVZbX7C5suLMiBf+@work
With the new uapi we don't have the comp flags on the allocation,
so we shouldn't be using the first size that works, we should be
iterating until we get the correct one.
This reduces allocations from 2MB to 64k in lots of places.
Fixes dEQP-VK.memory.allocation.basic.size_8KiB.forward.count_4000
on my ampere/gsp system.
Cc: stable@vger.kernel.org # v6.6
Signed-off-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Signed-off-by: Danilo Krummrich <dakr@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230811031520.248341-1-airlied@gmail.com
Renamed from trace_move_extent_alloc_mem_fail, because there are other
reasons we colud fail (disk space allocation failure).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_btree_update_start() calculates which nodes are going to have to be
split/rewritten, so that we know how many nodes to reserve and how deep
in the tree we have to take locks.
But btree node merges require inserting two keys into the parent node,
not just splits.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Validation was completely missing for replicas entries in the journal
(not the superblock replicas section) - we can't have replicas entries
pointing to invalid devices.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
zstd apparently lies about the size of the compression workspace it
requires; if we double it compression succeeds.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmVmHvEACgkQxWXV+ddt
WDtczA//UdxSPxQwJY1oOQj3k5Zgb/zOThfyD4x5wrDxFAYwVh1XhuyxV2XT8qyq
ipp+mi9doVykZKLSY+oRAeqjGlF0nIYm1z2PGSU7JXz4VsEj970rsDs3ePwH5TW6
V8/6VraOIIvmOnTft7aiuM8CXUjyndalNl7RvHu+v6grgAAgQAaly/3CmRIsm/Ui
2Wb6/J8ciAOBZ8TkFMr0PiTJd+CjUL+1Y9IaYEywujf0nVNJSgHp+R2CpwLDvM/5
1x6LdRrUKmnY7mhOvC7QwWfQmgsPnj3OuR+3+L+8jULTvcpwka2KEcpCH8/s6mUK
+4XhKQ4xXOJr8M+KmAUpy1yZZ30G6cDSwnfCWbWRCfR03396tTb08kb6G21fR+NL
o2qEUOe4DoMbYX/5zd9xEVqbwyGhAIXB0fJ7KJ0RqbaNBh/roRALBVCseP2CFwJE
P0DE9phjeIGQf3ybdfP7XqnMfk520bqoeV49Akbn2us2SrV1+O9Yjqmj2pbTnljE
M30Jh/btaiTFtsGB3MBDRRnGhf7F2l1dsmdmMVhdOK8HMY6obcJUdv6YXVLAjBDn
ATWtUVVizOpHvSZL0G/+1fXqHhLqOnHLY4A97uMjcElK5WJfuYZv8vZK7GVKC/jW
y5F4w/FPxU8dmhorMGksya2CLMvUsv5dikyAzGHirjEAdyrK1jg=
=85Pb
-----END PGP SIGNATURE-----
Merge tag 'for-6.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few fixes and message updates:
- for simple quotas, handle the case when a snapshot is created and
the target qgroup already exists
- fix a warning when file descriptor given to send ioctl is not
writable
- fix off-by-one condition when checking chunk maps
- free pages when page array allocation fails during compression
read, other cases were handled
- fix memory leak on error handling path in ref-verify debugging
feature
- copy missing struct member 'version' in 64/32bit compat send ioctl
- tree-checker verifies inline backref ordering
- print messages to syslog on first mount and last unmount
- update error messages when reading chunk maps"
* tag 'for-6.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: send: ensure send_fd is writable
btrfs: free the allocated memory if btrfs_alloc_page_array() fails
btrfs: fix 64bit compat send ioctl arguments not initializing version member
btrfs: make error messages more clear when getting a chunk map
btrfs: fix off-by-one when checking chunk map includes logical address
btrfs: ref-verify: fix memory leaks in btrfs_ref_tree_mod()
btrfs: add dmesg output for first mount and last unmount of a filesystem
btrfs: do not abort transaction if there is already an existing qgroup
btrfs: tree-checker: add type and sequence check for inline backrefs
Commit 1b0a151c10 ("blk-core: use pr_warn_ratelimited() in
bio_check_ro()") fix message storm by limit the rate, however, there
will still be lots of message in the long term. Fix it better by warn
once for each partition.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231128123027.971610-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The .bd_inode field of block_device is used in IO fast path of
blkdev_write_iter() and blkdev_llseek(), so it is more efficient to keep
it into the 1st cacheline.
.bd_openers is only touched in open()/close(), and .bd_size_lock is only
for updating bdev capacity, which is in slow path too.
So swap .bd_inode layout with .bd_openers & .bd_size_lock to move
.bd_inode into the 1st cache line.
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231128123027.971610-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Normally within a syscall it's fine to use fdget/fdput for grabbing a
file from the file table, and it's fine within io_uring as well. We do
that via io_uring_enter(2), io_uring_register(2), and then also for
cancel which is invoked from the latter. io_uring cannot close its own
file descriptors as that is explicitly rejected, and for the cancel
side of things, the file itself is just used as a lookup cookie.
However, it is more prudent to ensure that full references are always
grabbed. For anything threaded, either explicitly in the application
itself or through use of the io-wq worker threads, this is what happens
anyway. Generalize it and use fget/fput throughout.
Also see the below link for more details.
Link: https://lore.kernel.org/io-uring/CAG48ez1htVSO3TqmrF8QcX2WFuYTRM-VZ_N10i-VZgbtg=NNqw@mail.gmail.com/
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
mmap_lock nests under uring_lock out of necessity, as we may be doing
user copies with uring_lock held. However, for mmap of provided buffer
rings, we attempt to grab uring_lock with mmap_lock already held from
do_mmap(). This makes lockdep, rightfully, complain:
WARNING: possible circular locking dependency detected
6.7.0-rc1-00009-gff3337ebaf94-dirty #4438 Not tainted
------------------------------------------------------
buf-ring.t/442 is trying to acquire lock:
ffff00020e1480a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_uring_validate_mmap_request.isra.0+0x4c/0x140
but task is already holding lock:
ffff0000dc226190 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x124/0x264
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&mm->mmap_lock){++++}-{3:3}:
__might_fault+0x90/0xbc
io_register_pbuf_ring+0x94/0x488
__arm64_sys_io_uring_register+0x8dc/0x1318
invoke_syscall+0x5c/0x17c
el0_svc_common.constprop.0+0x108/0x130
do_el0_svc+0x2c/0x38
el0_svc+0x4c/0x94
el0t_64_sync_handler+0x118/0x124
el0t_64_sync+0x168/0x16c
-> #0 (&ctx->uring_lock){+.+.}-{3:3}:
__lock_acquire+0x19a0/0x2d14
lock_acquire+0x2e0/0x44c
__mutex_lock+0x118/0x564
mutex_lock_nested+0x20/0x28
io_uring_validate_mmap_request.isra.0+0x4c/0x140
io_uring_mmu_get_unmapped_area+0x3c/0x98
get_unmapped_area+0xa4/0x158
do_mmap+0xec/0x5b4
vm_mmap_pgoff+0x158/0x264
ksys_mmap_pgoff+0x1d4/0x254
__arm64_sys_mmap+0x80/0x9c
invoke_syscall+0x5c/0x17c
el0_svc_common.constprop.0+0x108/0x130
do_el0_svc+0x2c/0x38
el0_svc+0x4c/0x94
el0t_64_sync_handler+0x118/0x124
el0t_64_sync+0x168/0x16c
From that mmap(2) path, we really just need to ensure that the buffer
list doesn't go away from underneath us. For the lower indexed entries,
they never go away until the ring is freed and we can always sanely
reference those as long as the caller has a file reference. For the
higher indexed ones in our xarray, we just need to ensure that the
buffer list remains valid while we return the address of it.
Free the higher indexed io_buffer_list entries via RCU. With that we can
avoid needing ->uring_lock inside mmap(2), and simply hold the RCU read
lock around the buffer list lookup and address check.
To ensure that the arrayed lookup either returns a valid fully formulated
entry via RCU lookup, add an 'is_ready' flag that we access with store
and release memory ordering. This isn't needed for the xarray lookups,
but doesn't hurt either. Since this isn't a fast path, retain it across
both types. Similarly, for the allocated array inside the ctx, ensure
we use the proper load/acquire as setup could in theory be running in
parallel with mmap.
While in there, add a few lockdep checks for documentation purposes.
Cc: stable@vger.kernel.org
Fixes: c56e022c0a ("io_uring: add support for user mapped provided buffer ring")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We used to just use our page list for final teardown, which would ensure
that we got all the buffers, even the ones that were not on the normal
cached list. But while moving to slab for the io_buffers, we know only
prune this list, not the deferred locked list that we have. This can
cause a leak of memory, if the workload ends up using the intermediate
locked list.
Fix this by always pruning both lists when tearing down.
Fixes: b3a4dbc89d ("io_uring/kbuf: Use slab for struct io_buffer objects")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Right now we stash any potentially mmap'ed provided ring buffer range
for freeing at release time, regardless of when they get unregistered.
Since we're keeping track of these ranges anyway, keep track of their
registration state as well, and use that to recycle ranges when
appropriate rather than always allocate new ones.
The lookup is a basic scan of entries, checking for the best matching
free entry.
Fixes: c392cbecd8 ("io_uring/kbuf: defer release of mapped buffer rings")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a provided buffer ring is setup with IOU_PBUF_RING_MMAP, then the
kernel allocates the memory for it and the application is expected to
mmap(2) this memory. However, io_uring uses remap_pfn_range() for this
operation, so we cannot rely on normal munmap/release on freeing them
for us.
Stash an io_buf_free entry away for each of these, if any, and provide
a helper to free them post ->release().
Cc: stable@vger.kernel.org
Fixes: c56e022c0a ("io_uring: add support for user mapped provided buffer ring")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The power values coming from the Energy Model are already in uW.
The PowerCap and DTPM frameworks operate on uW, so all places should
just use the values from the EM.
Fix the code by removing all of the conversion to uW still present in it.
Fixes: ae6ccaa650 (PM: EM: convert power field to micro-Watts precision and align drivers)
Cc: 5.19+ <stable@vger.kernel.org> # v5.19+
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpufreq_driver->fast_switch() callback expects a frequency as a return
value. amd_pstate_fast_switch() was returning the return value of
amd_pstate_update_freq(), which only indicates a success or failure.
Fix this by making amd_pstate_fast_switch() return the target_freq
when the call to amd_pstate_update_freq() is successful, and return
the current frequency from policy->cur when the call to
amd_pstate_update_freq() is unsuccessful.
Fixes: 4badf2eb1e ("cpufreq: amd-pstate: Add ->fast_switch() callback")
Acked-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Wyes Karny <wyes.karny@amd.com>
Reviewed-by: Perry Yuan <perry.yuan@amd.com>
Cc: 6.4+ <stable@vger.kernel.org> # v6.4+
Signed-off-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
During floating point and vector save to thread data f0/vs0 are
clobbered by the FPSCR/VSCR store routine. This has been obvserved to
lead to userspace register corruption and application data corruption
with io-uring.
Fix it by restoring f0/vs0 after FPSCR/VSCR store has completed for
all the FP, altivec, VMX register save paths.
Tested under QEMU in kvm mode, running on a Talos II workstation with
dual POWER9 DD2.2 CPUs.
Additional detail (mpe):
Typically save_fpu() is called from __giveup_fpu() which saves the FP
regs and also *turns off FP* in the tasks MSR, meaning the kernel will
reload the FP regs from the thread struct before letting the task use FP
again. So in that case save_fpu() is free to clobber f0 because the FP
regs no longer hold live values for the task.
There is another case though, which is the path via:
sys_clone()
...
copy_process()
dup_task_struct()
arch_dup_task_struct()
flush_all_to_thread()
save_all()
That path saves the FP regs but leaves them live. That's meant as an
optimisation for a process that's using FP/VSX and then calls fork(),
leaving the regs live means the parent process doesn't have to take a
fault after the fork to get its FP regs back. The optimisation was added
in commit 8792468da5 ("powerpc: Add the ability to save FPU without
giving it up").
That path does clobber f0, but f0 is volatile across function calls,
and typically programs reach copy_process() from userspace via a syscall
wrapper function. So in normal usage f0 being clobbered across a
syscall doesn't cause visible data corruption.
But there is now a new path, because io-uring can call copy_process()
via create_io_thread() from the signal handling path. That's OK if the
signal is handled as part of syscall return, but it's not OK if the
signal is handled due to some other interrupt.
That path is:
interrupt_return_srr_user()
interrupt_exit_user_prepare()
interrupt_exit_user_prepare_main()
do_notify_resume()
get_signal()
task_work_run()
create_worker_cb()
create_io_worker()
copy_process()
dup_task_struct()
arch_dup_task_struct()
flush_all_to_thread()
save_all()
if (tsk->thread.regs->msr & MSR_FP)
save_fpu()
# f0 is clobbered and potentially live in userspace
Note the above discussion applies equally to save_altivec().
Fixes: 8792468da5 ("powerpc: Add the ability to save FPU without giving it up")
Cc: stable@vger.kernel.org # v4.6+
Closes: https://lore.kernel.org/all/480932026.45576726.1699374859845.JavaMail.zimbra@raptorengineeringinc.com/
Closes: https://lore.kernel.org/linuxppc-dev/480221078.47953493.1700206777956.JavaMail.zimbra@raptorengineeringinc.com/
Tested-by: Timothy Pearson <tpearson@raptorengineering.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
[mpe: Reword change log to describe exact path of corruption & other minor tweaks]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/1921539696.48534988.1700407082933.JavaMail.zimbra@raptorengineeringinc.com
ndo_stop() is RTNL-protected by net core, and the worker function takes
RTNL as well. Therefore we will deadlock when trying to execute a
pending work synchronously. To fix this execute any pending work
asynchronously. This will do no harm because netif_running() is false
in ndo_stop(), and therefore the work function is effectively a no-op.
However we have to ensure that no task is running or pending after
rtl_remove_one(), therefore add a call to cancel_work_sync().
Fixes: abe5fc42f9 ("r8169: use RTNL to protect critical sections")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/12395867-1d17-4cac-aa7d-c691938fcddf@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The original change results in a deadlock if jumbo mtu mode is used.
Reason is that the phydev lock is held when rtl_reset_work() is called
here, and rtl_jumbo_config() calls phy_start_aneg() which also tries
to acquire the phydev lock. Fix this by calling rtl_reset_work()
asynchronously.
Fixes: 621735f590 ("r8169: fix rare issue with broken rx after link-down on RTL8125")
Reported-by: Ian Chen <free122448@hotmail.com>
Tested-by: Ian Chen <free122448@hotmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/caf6a487-ef8c-4570-88f9-f47a659faf33@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When a task needs to accept memory it will scan the accepting_list
to see if any ranges already being processed by other tasks overlap
with its range. Due to an off-by-one in the range comparisons, a task
might falsely determine that an overlapping range is being accepted,
leading to an unnecessary delay before it begins processing the range.
Fix the off-by-one in the range comparison to prevent this and slightly
improve performance.
Fixes: 50e782a86c ("efi/unaccepted: Fix soft lockups caused by parallel memory acceptance")
Link: https://lore.kernel.org/linux-mm/20231101004523.vseyi5bezgfaht5i@amd.com/T/#me2eceb9906fcae5fe958b3fe88e41f920f8335b6
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Return -ENOMEM if xen_irq_init() fails. currently the code returns an
uninitialized variable or zero.
Fixes: 5dd9ad32d7 ("xen/events: drop xen_allocate_irqs_dynamic()")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Juergen Gross <jgross@ssue.com>
Link: https://lore.kernel.org/r/3b9ab040-a92e-4e35-b687-3a95890a9ace@moroto.mountain
Signed-off-by: Juergen Gross <jgross@suse.com>
Today the percpu struct vcpu_info is allocated via DEFINE_PER_CPU(),
meaning that it could cross a page boundary. In this case registering
it with the hypervisor will fail, resulting in a panic().
This can easily be fixed by using DEFINE_PER_CPU_ALIGNED() instead,
as struct vcpu_info is guaranteed to have a size of 64 bytes, matching
the cache line size of x86 64-bit processors (Xen doesn't support
32-bit processors).
Fixes: 5ead97c84f ("xen: Core Xen implementation")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.con>
Link: https://lore.kernel.org/r/20231124074852.25161-1-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Previously, one-element and zero-length arrays were treated as true
flexible arrays, even though they are actually "fake" flex arrays.
The __randomize_layout would leave them untouched at the end of the
struct, similarly to proper C99 flex-array members.
However, this approach changed with commit 1ee60356c2 ("gcc-plugins:
randstruct: Only warn about true flexible arrays"). Now, only C99
flexible-array members will remain untouched at the end of the struct,
while one-element and zero-length arrays will be subject to randomization.
Fix a `__randomize_layout` crash in `struct neighbour` by transforming
zero-length array `primary_key` into a proper C99 flexible-array member.
Fixes: 1ee60356c2 ("gcc-plugins: randstruct: Only warn about true flexible arrays")
Closes: https://lore.kernel.org/linux-hardening/20231124102458.GB1503258@e124191.cambridge.arm.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/ZWJoRsJGnCPdJ3+2@work
Signed-off-by: Paolo Abeni <pabeni@redhat.com>