The main service scheme of BFQ for sync I/O is serving one sync
bfq_queue at a time, for a while. In particular, BFQ enforces this
scheme when it deems the latter necessary to boost throughput or
to preserve service guarantees. Unfortunately, when BFQ enforces
this policy, only one actuator at a time gets served for a while,
because each bfq_queue contains I/O only for one actuator. The
other actuators may remain underutilized.
Actually, BFQ may serve (inject) extra I/O, taken from other
bfq_queues, in parallel with that of the in-service queue. This
injection mechanism may provide the ground for dealing also with
the above actuator-underutilization problem. Yet BFQ does not take
the actuator load into account when choosing which queue to pick
extra I/O from. In addition, BFQ may happen to inject extra I/O
only when the in-service queue is temporarily empty.
In view of these facts, this commit extends the
injection mechanism in such a way that the latter:
(1) takes into account also the actuator load;
(2) checks such a load on each dispatch, and injects I/O for an
underutilized actuator, if there is one and there is I/O for it.
To perform the check in (2), this commit introduces a load
threshold, currently set to 4. A linear scan of each actuator is
performed, until an actuator is found for which the following two
conditions hold: the load of the actuator is below the threshold,
and there is at least one non-in-service queue that contains I/O
for that actuator. If such a pair (actuator, queue) is found, then
the head request of that queue is returned for dispatch, instead
of the head request of the in-service queue.
We have set the threshold, empirically, to the minimum possible
value for which an actuator is fully utilized, or close to be
fully utilized. By doing so, injected I/O 'steals' as few
drive-queue slots as possibile to the in-service queue. This
reduces as much as possible the probability that the service of
I/O from the in-service bfq_queue gets delayed because of slot
exhaustion, i.e., because all the slots of the drive queue are
filled with I/O injected from other queues (NCQ provides for 32
slots).
This new mechanism also counters actuator underutilization in the
case of asymmetric configurations of bfq_queues. Namely if there
are few bfq_queues containing I/O for some actuators and many
bfq_queues containing I/O for other actuators. Or if the
bfq_queues containing I/O for some actuators have lower weights
than the other bfq_queues.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Davide Zini <davidezini2@gmail.com>
Link: https://lore.kernel.org/r/20230103145503.71712-8-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch implements the code to gather the content of the
independent_access_ranges structure from the request_queue and copy
it into the queue's bfq_data. This copy is done at queue initialization.
We copy the access ranges into the bfq_data to avoid taking the queue
lock each time we access the ranges.
This implementation, however, puts a limit to the maximum independent
ranges supported by the scheduler. Such a limit is equal to the constant
BFQ_MAX_ACTUATORS. This limit was placed to avoid the allocation of
dynamic memory.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Co-developed-by: Rory Chen <rory.c.chen@seagate.com>
Signed-off-by: Rory Chen <rory.c.chen@seagate.com>
Signed-off-by: Federico Gavioli <f.gavioli97@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-7-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Similarly to sync bfq_queues, also async bfq_queues need to be split
on a per-actuator basis.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Davide Zini <davidezini2@gmail.com>
Link: https://lore.kernel.org/r/20230103145503.71712-6-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a bfq_queue Q is merged with another queue, several pieces of
information are saved about Q. These pieces are stored in the
bfqq_data field in the bfq_io_cq data structure of the process
associated with Q.
Yet, with a multi-actuator drive, a process may get associated with
multiple bfq_queues: one queue for each of the N actuators. Each of
these queues may undergo a merge. So, the bfq_io_cq data structure
must be able to accommodate the above information for N queues.
This commit solves this problem by turning the bfqq_data scalar field
into an array of N elements (and by changing code so as to handle
this array).
This solution is written under the assumption that bfq_queues
associated with different actuators cannot be cross-merged. This
assumption holds naturally with basic queue merging: the latter is
triggered by spatial locality, and sectors for different actuators are
not close to each other (apart from the corner case of the last
sectors served by a given actuator and the first sectors served by the
next actuator). As for stable cross-merging, the assumption here is
that it is disabled.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Gabriele Felici <felicigb@gmail.com>
Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net>
Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com>
Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-5-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With a multi-actuator drive, a process may get associated with multiple
bfq_queues: one queue for each of the N actuators. So, the bfq_io_cq
data structure must be able to accommodate its per-queue persistent
information for N queues. Currently it stores this information for
just one queue, in several scalar fields.
This is a preparatory commit for moving to accommodating persistent
information for N queues. In particular, this commit packs all the
above scalar fields into a single data structure. Then there is now
only one field, in bfq_io_cq, that stores all the above information. This
scalar field will then be turned into an array by a following commit.
Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net>
Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com>
Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-4-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If queues associated with different actuators are merged, then control
is lost on each actuator. Therefore some actuator may be
underutilized, and throughput may decrease. This problem cannot occur
with basic queue merging, because the latter is triggered by spatial
locality, and sectors for different actuators are not close to each
other. Yet it may happen with stable merging. To address this issue,
this commit prevents stable merging from occurring among queues
associated with different actuators.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-3-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Single-LUN multi-actuator SCSI drives, as well as all multi-actuator
SATA drives appear as a single device to the I/O subsystem [1]. Yet
they address commands to different actuators internally, as a function
of Logical Block Addressing (LBAs). A given sector is reachable by
only one of the actuators. For example, Seagate’s Serial Advanced
Technology Attachment (SATA) version contains two actuators and maps
the lower half of the SATA LBA space to the lower actuator and the
upper half to the upper actuator.
Evidently, to fully utilize actuators, no actuator must be left idle
or underutilized while there is pending I/O for it. The block layer
must somehow control the load of each actuator individually. This
commit lays the ground for allowing BFQ to provide such a per-actuator
control.
BFQ associates an I/O-request sync bfq_queue with each process doing
synchronous I/O, or with a group of processes, in case of queue
merging. Then BFQ serves one bfq_queue at a time. While in service, a
bfq_queue is emptied in request-position order. Yet the same process,
or group of processes, may generate I/O for different actuators. In
this case, different streams of I/O (each for a different actuator)
get all inserted into the same sync bfq_queue. So there is basically
no individual control on when each stream is served, i.e., on when the
I/O requests of the stream are picked from the bfq_queue and
dispatched to the drive.
This commit enables BFQ to control the service of each actuator
individually for synchronous I/O, by simply splitting each sync
bfq_queue into N queues, one for each actuator. In other words, a sync
bfq_queue is now associated to a pair (process, actuator). As a
consequence of this split, the per-queue proportional-share policy
implemented by BFQ will guarantee that the sync I/O generated for each
actuator, by each process, receives its fair share of service.
This is just a preparatory patch. If the I/O of the same process
happens to be sent to different queues, then each of these queues may
undergo queue merging. To handle this event, the bfq_io_cq data
structure must be properly extended. In addition, stable merging must
be disabled to avoid loss of control on individual actuators. Finally,
also async queues must be split. These issues are described in detail
and addressed in next commits. As for this commit, although multiple
per-process bfq_queues are provided, the I/O of each process or group
of processes is still sent to only one queue, regardless of the
actuator the I/O is for. The forwarding to distinct bfq_queues will be
enabled after addressing the above issues.
[1] https://www.linaro.org/blog/budget-fair-queueing-bfq-linux-io-scheduler-optimizations-for-multi-actuator-sata-hard-drives/
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Gabriele Felici <felicigb@gmail.com>
Signed-off-by: Carmine Zaccagnino <carmine@carminezacc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPK8NUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgptS/EADT+m0n7jjonp7NoENoZT2y4o5ayESuEmBV
X8QUg/Ji1P3VG3QzI+yCqevGa2Rkkd8EenlokpjLliuqPdb/aZ56G7rsebotzWu3
zOV3XNvKvD0thiMIjmXABvmUKdb3lcrM5tpC9Uqq6L52SqbtkSsPUVO+rWE/tTZk
u97dUmyQcaD2brGfn4AcR0wgQoxrcLbmUpa/TKhFIDPDl+4PFi2ePoSQSsdDJT8R
PTvQhud1dl/wJ3733vj8S8s4Sxkbm5xXt50oDaTSmdOWSNOuMNuyW3WqkZ/SPdyK
LDmtOXEfuiokJK/l+DZ9SKt6jONW6ShdEaUo37/8yjYCnZFvWkcfn+6mWaDygjqS
eI3Mwb91w8K9krTZU1tGq3qOtxEJwbtLHCM96nh8SHLjNrYYrkZQZHOcea9CgX8h
iMzI5ylP2t6RofwHwwFoZYGOxrRz/R5LS+pCFIv720QnBjb9ZpO9zoDQaDl5tOS6
UpuL3XPzs9rZZizY00NG6+vQeSdSLRyyjs4XIWYxrZy2wuC2EjM0HstMfefldQcJ
uEfgrVgd/pcUTNzCG8uH8cZbmeflivm18J6OX86l2X9d3m62HD5gULHFOFxbDwsC
zoQOsyaGVRLpO0+/0MKs7aLaZlk40VDb4XdRsM6qbd4+x+J7yicvGrkUxS6cZMwT
VlQm3YUc0g==
=L12Q
-----END PGP SIGNATURE-----
Merge tag 'block-6.2-2023-01-20' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"Various little tweaks all over the place:
- NVMe pull request via Christoph:
- fix controller shutdown regression in nvme-apple (Janne Grunau)
- fix a polling on timeout regression in nvme-pci (Keith Busch)
- Fix a bug in the read request side request allocation caching
(Pavel)
- pktcdvd was brought back after we configured a NULL return on bio
splits, make it consistent with the others (me)
- BFQ refcount fix (Yu)
- Block cgroup policy activation fix (Yu)
- Fix for an md regression introduced in the 6.2 cycle (Adrian)"
* tag 'block-6.2-2023-01-20' of git://git.kernel.dk/linux:
nvme-pci: fix timeout request state check
nvme-apple: only reset the controller when RTKit is running
nvme-apple: reset controller during shutdown
block: fix hctx checks for batch allocation
block/rnbd-clt: fix wrong max ID in ida_alloc_max
blk-cgroup: fix missing pd_online_fn() while activating policy
pktcdvd: check for NULL returna fter calling bio_split_to_limits()
block, bfq: switch 'bfqg->ref' to use atomic refcount apis
md: fix incorrect declaration about claim_rdev in md_import_device
When there are no read queues read requests will be assigned a
default queue on allocation. However, blk_mq_get_cached_request() is not
prepared for that and will fail all attempts to grab read requests from
the cache. Worst case it doubles the number of requests allocated,
roughly half of which will be returned by blk_mq_free_plug_rqs().
It only affects batched allocations and so is io_uring specific.
For reference, QD8 t/io_uring benchmark improves by 20-35%.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/80d4511011d7d4751b4cf6375c4e38f237d935e3.1673955390.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the policy defines pd_online_fn(), it should be called after
pd_init_fn(), like blkg_create().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230103112833.2013432-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The updating of 'bfqg->ref' should be protected by 'bfqd->lock', however,
during code review, we found that bfq_pd_free() update 'bfqg->ref'
without holding the lock, which is problematic:
1) bfq_pd_free() triggered by removing cgroup is called asynchronously;
2) bfqq will grab bfqg reference, and exit bfqq will drop the reference,
which can concurrent with 1).
Unfortunately, 'bfqd->lock' can't be held here because 'bfqd' might already
be freed in bfq_pd_free(). Fix the problem by using atomic refcount apis.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230103084755.1256479-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPBsFAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqgnEAC0OqxnMsOPNbkLO7k6FsSrG7ZoENkOIMCt
Grk3D1cPkM13I0xc+WiaOBezMriPzfdXvt5AGDn9fd53Ih47qpSY4eU6pCqoCk5y
HWdn8KXZvhJGZsSy0Nz+cfPDW/8diJON8YBpJwWM/DfDdP8XibtjlIMTVTtJab6h
aGWjmy3leNfghOJ0cZ1wjL6maWFoowQASs52PZfajSc0mQ5X0i8BgQb1WOHNu89C
vEir9PYlTmdMnYlAKLsyEL3KoGUPm++zSLtJeyWYavlCMGK5WTyNkzmeXqsQhAGf
b1LjovQASe//1t2wvCzQviRf4cae0pE9JhiaYt2oxoDdHrfQj/WPndVS4yE9c+0O
BnLVTCFHNv86TRXNCbEUzI+Ftj6m9qt4MrHz8YpstX7FxGxYC+T5RqTwYClWZQ0j
llBuJUHj+kkAv6kBMJCHTyat6pxIDgcb52QMJr5mFWuEaTloraBIJC70hMtxBQV/
j5mrBYqCngCHVs+hAl9UQ4zqQVSvkeT11QFvwFolxIfs7qtfLqeGzYxvaeomqO3V
sA+H5NY50OEuPfFFmCpcNUJXeUKg7wP39iNHdz6P5cCDBCfUwbNbgKKKNmBovaC+
KhPd8Xo1MmzDuF+cylvTcjOBDte4425GN7PBj4vP1xbuHYcjg6AEFLawgqE9Y4XX
xyNlgJXPOg==
=ujiw
-----END PGP SIGNATURE-----
Merge tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"Nothing major in here, just a collection of NVMe fixes and dropping a
wrong might_sleep() that static checkers tripped over but which isn't
valid"
* tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linux:
MAINTAINERS: stop nvme matching for nvmem files
nvme: don't allow unprivileged passthrough on partitions
nvme: replace the "bool vec" arguments with flags in the ioctl path
nvme: remove __nvme_ioctl
nvme-pci: fix error handling in nvme_pci_enable()
nvme-pci: add NVME_QUIRK_IDENTIFY_CNS quirk to Apple T2 controllers
nvme-apple: add NVME_QUIRK_IDENTIFY_CNS quirk to fix regression
block: Drop spurious might_sleep() from blk_put_queue()
Dan reports the following smatch detected the following:
block/blk-cgroup.c:1863 blkcg_schedule_throttle() warn: sleeping in atomic context
caused by blkcg_schedule_throttle() calling blk_put_queue() in an
non-sleepable context.
blk_put_queue() acquired might_sleep() in 63f93fd6fa ("block: mark
blk_put_queue as potentially blocking") which transferred the might_sleep()
from blk_free_queue().
blk_free_queue() acquired might_sleep() in e8c7d14ac6 ("block: revert back
to synchronous request_queue removal") while turning request_queue removal
synchronous. However, this isn't necessary as nothing in the free path
actually requires sleeping.
It's pretty unusual to require a sleeping context in a put operation and
it's not needed in the first place. Let's drop it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lkml.kernel.org/r/Y7g3L6fntnTtOm63@kili
Cc: Christoph Hellwig <hch@lst.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Fixes: e8c7d14ac6 ("block: revert back to synchronous request_queue removal") # v5.9+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/Y7iFwjN+XzWvLv3y@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmO4SiAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgc9D/0XJufUgHsLeFCF5G+q6iL5Bz+d7ymw+VFv
xrNjOz8wUKYKXcJqxLrPdkmL1tcd1+fESNGgyBidn4P53BWoHB9dtbs8+Lova08t
I4lQmZHgxgbAMhSOwGvHlOTkdlBIw/fBgQ6XdI+1qmpxzma5+gjImjyp7oH+pODP
zqsg3DKRQmDApKWtvB6D5iItsWc1Jx5TEuOfU5/JjLuVZWl6O2qynNVUccF5T89O
jkt624yO+r70CVfX3NAdFTm/mOEUiGH97l4l/8OkekJ40pf73xzvNRF/S8z8nHb/
QUGY1tKvr08xfPusl3epmQ5aO938F0aFpKi2x6P+z3G6Uq+dqMMrjJl8XMDG+J+d
+yBow5yRH7o6oBb0YPPz/6S5zBjslsHtuKFd/rs4mCDfjp9GHiIIiIpdLxZEWawJ
WaYlc5WlzSdopT/IxfaRZ9HMHzscdKadjiFngSKdpEdCUw7wxdIey+/9xbKR+xh0
Es13MzyCCurj4OnyDl5cnetGJUNNiL1JvQmIaFVndyxnMfvOaZBBmKW7h9RYBIU/
nqi4vZwYoafnGUIfLFL6uq9F627lF/EhodDuLheqz0G2pWhmFJITOJUAakGNFf83
22CiKY2GyTrOy5tKqkNzv7BG/KyJZGP+CxyyQ/7xm0k2C9wEjYSpZHKcjaNZygU5
eswPKbZMkw==
=LJ5Q
-----END PGP SIGNATURE-----
Merge tag 'block-2023-01-06' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"The big change here is obviously the revert of the pktcdvd driver
removal. Outside of that, just minor tweaks. In detail:
- Re-instate the pktcdvd driver, which necessitates adding back
bio_copy_data_iter() and the fops->devnode() hook for now (me)
- Fix for splitting of a bio marked as NOWAIT, causing either nowait
reads or writes to error with EAGAIN even if parts of the IO
completed (me)
- Fix for ublk, punting management commands to io-wq as they can all
easily block for extended periods of time (Ming)
- Removal of SRCU dependency for the block layer (Paul)"
* tag 'block-2023-01-06' of git://git.kernel.dk/linux:
block: Remove "select SRCU"
Revert "pktcdvd: remove driver."
Revert "block: remove devnode callback from struct block_device_operations"
Revert "block: bio_copy_data_iter"
ublk: honor IO_URING_F_NONBLOCK for handling control command
block: don't allow splitting of a REQ_NOWAIT bio
block: handle bio_split_to_limits() NULL return
Now that the SRCU Kconfig option is unconditionally selected, there is
no longer any point in selecting it. Therefore, remove the "select SRCU"
Kconfig statements.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we split a bio marked with REQ_NOWAIT, then we can trigger spurious
EAGAIN if constituent parts of that split bio end up failing request
allocations. Parts will complete just fine, but just a single failure
in one of the chained bios will yield an EAGAIN final result for the
parent bio.
Return EAGAIN early if we end up needing to split such a bio, which
allows for saner recovery handling.
Cc: stable@vger.kernel.org # 5.15+
Link: https://github.com/axboe/liburing/issues/766
Reported-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This can't happen right now, but in preparation for allowing
bio_split_to_limits() returning NULL if it ended the bio, check for it
in all the callers.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOt35IQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpq4QD/9nGWlCdLRLyyiysUWhLwTmsZt7PSebG3KD
CmCyEt+o8n2PdxBe7Xq8glppvuQTwJwOYynMXcAWd0IBxYUnAkCDF4PTbmdiIiVY
fzci1UydIXw/HOVft/2IbIC0+Apo+UJ9WVqqhwm7ya0lAkLQYuT7iWmn1pxFdbcI
hi9ZbaghxtZXSQP4ZtKG+a8tQ99HTsf76xqCM6DdMCVOUH6/V1f5g67iSkYLCL3Q
V9bAq7U2VEXFdRC9m5yPG7KGUBRllE4etBvVAIIcAQBAgEktyvgvas5luwu5j+W0
R2z8KXp2X4BWGW+R45hpt2cdyfcJy24+6QnAGNQAs/3Muq1IfEMwmJ5tyR/y8HiS
0RvIv/BOwDMDOaM9YuW0beyHQMu+bwhtf+C453r1gsKmnL912+ElMzuqUpditkjr
d4nL5aUTk5iM38jzJpQylZSY+20wyUnOmxCxETpeSMaRrYY75PLOVCJLNncJuZtQ
GFtqUzMPVURLMGnxyJZLiG+qbGVXh9f7B7OStKDPhBJvqoZ2cQpwTzywmYxQOv+0
OO1DdmMDtUWNpuBN2U4HOzLElmB034OM3Fcia529IhLoXK/x57n9mXW0D0HeOd84
/EYSsmsT+spv7psKBNjhXkZwgVpVgsYOu8eUjRKYUmrYLEbTk+fGUtia3rBd4wjl
uNMuRhRtUA==
=cqhz
-----END PGP SIGNATURE-----
Merge tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"Mostly just NVMe, but also a single fixup for BFQ for a regression
that happened during the merge window. In detail:
- NVMe pull requests via Christoph:
- Fix doorbell buffer value endianness (Klaus Jensen)
- Fix Linux vs NVMe page size mismatch (Keith Busch)
- Fix a potential use memory access beyong the allocation limit
(Keith Busch)
- Fix a multipath vs blktrace NULL pointer dereference (Yanjun
Zhang)
- Fix various problems in handling the Command Supported and
Effects log (Christoph Hellwig)
- Don't allow unprivileged passthrough of commands that don't
transfer data but modify logical block content (Christoph
Hellwig)
- Add a features and quirks policy document (Christoph Hellwig)
- Fix some really nasty code that was correct but made smatch
complain (Sagi Grimberg)
- Use-after-free regression in BFQ from this merge window (Yu)"
* tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linux:
nvme-auth: fix smatch warning complaints
nvme: consult the CSE log page for unprivileged passthrough
nvme: also return I/O command effects from nvme_command_effects
nvmet: don't defer passthrough commands with trivial effects to the workqueue
nvmet: set the LBCC bit for commands that modify data
nvmet: use NVME_CMD_EFFECTS_CSUPP instead of open coding it
nvme: fix the NVME_CMD_EFFECTS_CSE_MASK definition
docs, nvme: add a feature and quirk policy document
nvme-pci: update sqsize when adjusting the queue depth
nvme: fix setting the queue depth in nvme_alloc_io_tag_set
block, bfq: fix uaf for bfqq in bfq_exit_icq_bfqq
nvme: fix multipath crash caused by flush request when blktrace is enabled
nvme-pci: fix page size checks
nvme-pci: fix mempool alloc size
nvme-pci: fix doorbell buffer value endianness
Commit 64dc8c732f ("block, bfq: fix possible uaf for 'bfqq->bic'")
will access 'bic->bfqq' in bic_set_bfqq(), however, bfq_exit_icq_bfqq()
can free bfqq first, and then call bic_set_bfqq(), which will cause uaf.
Fix the problem by moving bfq_exit_bfqq() behind bic_set_bfqq().
Fixes: 64dc8c732f ("block, bfq: fix possible uaf for 'bfqq->bic'")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20221226030605.1437081-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Due to several bugs caused by timers being re-armed after they are
shutdown and just before they are freed, a new state of timers was added
called "shutdown". After a timer is set to this state, then it can no
longer be re-armed.
The following script was run to find all the trivial locations where
del_timer() or del_timer_sync() is called in the same function that the
object holding the timer is freed. It also ignores any locations where
the timer->function is modified between the del_timer*() and the free(),
as that is not considered a "trivial" case.
This was created by using a coccinelle script and the following
commands:
$ cat timer.cocci
@@
expression ptr, slab;
identifier timer, rfield;
@@
(
- del_timer(&ptr->timer);
+ timer_shutdown(&ptr->timer);
|
- del_timer_sync(&ptr->timer);
+ timer_shutdown_sync(&ptr->timer);
)
... when strict
when != ptr->timer
(
kfree_rcu(ptr, rfield);
|
kmem_cache_free(slab, ptr);
|
kfree(ptr);
)
$ spatch timer.cocci . > /tmp/t.patch
$ patch -p1 < /tmp/t.patch
Link: https://lore.kernel.org/lkml/20221123201306.823305113@linutronix.de/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Pavel Machek <pavel@ucw.cz> [ LED ]
Acked-by: Kalle Valo <kvalo@kernel.org> [ wireless ]
Acked-by: Paolo Abeni <pabeni@redhat.com> [ networking ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOgp5AQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpm5SD/9tduSZQW00aDm83HbEikWdCgQm0w37tyYl
C2+IwRwLF8pnAoSb6yaO7LZM9ZUYfoIfIlkHXkKhT1xNJ/XdeGDgwjOHi106iaEx
kG08DcFnUjyJ4Yh6hnnpnSepIo0ckwa18pSaE4smvmKZirj3it3O6xSspyBxtUcv
q6PvJDMN15aG6uLHq3xNZPzoI2KYXBDgwanyImRhdvLoOTiS9rok+F9e2ob3lzAa
PB+FOipQoKb7M6jbyfZe4KbeTiJh4EYEl5Qa6ebrDIkOTm7zjc8sQbCkNeI7osh+
D0FvEQ1Vsrjj5Bp6N9CmZcrmNagjEcAPbzguxAilrgw2/XvA8d0fymziGXvuyUEv
bSAx6lyJzfMLrvtubSqMhIF+8DlccQnnXz2ccacwvAfayytzNJjC9serU+czHA4O
ZkPTwZFjAmbn6q6SK3qaOCB9IgITHipj8R/ncGu9KjNvM2QgzM+OIrP0xGxtk6uI
ZGrt9nGMUmgjtaliQjiDVZomMewru1lRWPRAjfQ995gmVkejgapUHYoaDtDzaLKZ
Q9BaK5CC2jltGUuuoFEnXnwu/Eyvp9y++pKkz4Esb+/Wkst4qyGtr9DOSTnv1wKN
W20h3Z5vOAXXquvUJ5S3mQl8TNJHiBz+/CRB9PZG8XFtn8ubGo8XttGdgjQgyLM3
6FHzcZgeWw==
=TSec
-----END PGP SIGNATURE-----
Merge tag 'block-6.2-2022-12-19' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Various fixes for BFQ (Yu, Yuwei)
- Fix for loop command line parsing (Isaac)
- No need to specifically clear REQ_ALLOC_CACHE on IOPOLL downgrade
anymore (me)
- blk-iocost enum fix for newer gcc (Jiri)
- UAF fix for queue release (Ming)
- blk-iolatency error handling memory leak fix (Tejun)
* tag 'block-6.2-2022-12-19' of git://git.kernel.dk/linux:
block: don't clear REQ_ALLOC_CACHE for non-polled requests
block: fix use-after-free of q->q_usage_counter
block, bfq: only do counting of pending-request for BFQ_GROUP_IOSCHED
blk-iolatency: Fix memory leak on add_disk() failures
loop: Fix the max_loop commandline argument treatment when it is set to 0
block/blk-iocost (gcc13): keep large values in a new enum
block, bfq: replace 0/1 with false/true in bic apis
block, bfq: don't return bfqg from __bfq_bic_change_cgroup()
block, bfq: fix possible uaf for 'bfqq->bic'
Here is the set of driver core and kernfs changes for 6.2-rc1.
The "big" change in here is the addition of a new macro,
container_of_const() that will preserve the "const-ness" of a pointer
passed into it.
The "problem" of the current container_of() macro is that if you pass in
a "const *", out of it can comes a non-const pointer unless you
specifically ask for it. For many usages, we want to preserve the
"const" attribute by using the same call. For a specific example, this
series changes the kobj_to_dev() macro to use it, allowing it to be used
no matter what the const value is. This prevents every subsystem from
having to declare 2 different individual macros (i.e.
kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce
the const value at build time, which having 2 macros would not do
either.
The driver for all of this have been discussions with the Rust kernel
developers as to how to properly mark driver core, and kobject, objects
as being "non-mutable". The changes to the kobject and driver core in
this pull request are the result of that, as there are lots of paths
where kobjects and device pointers are not modified at all, so marking
them as "const" allows the compiler to enforce this.
So, a nice side affect of the Rust development effort has been already
to clean up the driver core code to be more obvious about object rules.
All of this has been bike-shedded in quite a lot of detail on lkml with
different names and implementations resulting in the tiny version we
have in here, much better than my original proposal. Lots of subsystem
maintainers have acked the changes as well.
Other than this change, included in here are smaller stuff like:
- kernfs fixes and updates to handle lock contention better
- vmlinux.lds.h fixes and updates
- sysfs and debugfs documentation updates
- device property updates
All of these have been in the linux-next tree for quite a while with no
problems, OTHER than some merge issues with other trees that should be
obvious when you hit them (block tree deletes a driver that this tree
modifies, iommufd tree modifies code that this tree also touches). If
there are merge problems with these trees, please let me know.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY5wz3A8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yks0ACeKYUlVgCsER8eYW+x18szFa2QTXgAn2h/VhZe
1Fp53boFaQkGBjl8mGF8
=v+FB
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of driver core and kernfs changes for 6.2-rc1.
The "big" change in here is the addition of a new macro,
container_of_const() that will preserve the "const-ness" of a pointer
passed into it.
The "problem" of the current container_of() macro is that if you pass
in a "const *", out of it can comes a non-const pointer unless you
specifically ask for it. For many usages, we want to preserve the
"const" attribute by using the same call. For a specific example, this
series changes the kobj_to_dev() macro to use it, allowing it to be
used no matter what the const value is. This prevents every subsystem
from having to declare 2 different individual macros (i.e.
kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce
the const value at build time, which having 2 macros would not do
either.
The driver for all of this have been discussions with the Rust kernel
developers as to how to properly mark driver core, and kobject,
objects as being "non-mutable". The changes to the kobject and driver
core in this pull request are the result of that, as there are lots of
paths where kobjects and device pointers are not modified at all, so
marking them as "const" allows the compiler to enforce this.
So, a nice side affect of the Rust development effort has been already
to clean up the driver core code to be more obvious about object
rules.
All of this has been bike-shedded in quite a lot of detail on lkml
with different names and implementations resulting in the tiny version
we have in here, much better than my original proposal. Lots of
subsystem maintainers have acked the changes as well.
Other than this change, included in here are smaller stuff like:
- kernfs fixes and updates to handle lock contention better
- vmlinux.lds.h fixes and updates
- sysfs and debugfs documentation updates
- device property updates
All of these have been in the linux-next tree for quite a while with
no problems"
* tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (58 commits)
device property: Fix documentation for fwnode_get_next_parent()
firmware_loader: fix up to_fw_sysfs() to preserve const
usb.h: take advantage of container_of_const()
device.h: move kobj_to_dev() to use container_of_const()
container_of: add container_of_const() that preserves const-ness of the pointer
driver core: fix up missed drivers/s390/char/hmcdrv_dev.c class.devnode() conversion.
driver core: fix up missed scsi/cxlflash class.devnode() conversion.
driver core: fix up some missing class.devnode() conversions.
driver core: make struct class.devnode() take a const *
driver core: make struct class.dev_uevent() take a const *
cacheinfo: Remove of_node_put() for fw_token
device property: Add a blank line in Kconfig of tests
device property: Rename goto label to be more precise
device property: Move PROPERTY_ENTRY_BOOL() a bit down
device property: Get rid of __PROPERTY_ENTRY_ARRAY_EL*SIZE*()
kernfs: fix all kernel-doc warnings and multiple typos
driver core: pass a const * into of_device_uevent()
kobject: kset_uevent_ops: make name() callback take a const *
kobject: kset_uevent_ops: make filter() callback take a const *
kobject: make kobject_namespace take a const *
...
For blk-mq, queue release handler is usually called after
blk_mq_freeze_queue_wait() returns. However, the
q_usage_counter->release() handler may not be run yet at that time, so
this can cause a use-after-free.
Fix the issue by moving percpu_ref_exit() into blk_free_queue_rcu().
Since ->release() is called with rcu read lock held, it is agreed that
the race should be covered in caller per discussion from the two links.
Reported-by: Zhang Wensheng <zhangwensheng@huaweicloud.com>
Reported-by: Zhong Jinghua <zhongjinghua@huawei.com>
Link: https://lore.kernel.org/linux-block/Y5prfOjyyjQKUrtH@T590/T/#u
Link: https://lore.kernel.org/lkml/Y4%2FmzMd4evRg9yDi@fedora/
Cc: Hillf Danton <hdanton@sina.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Dennis Zhou <dennis@kernel.org>
Fixes: 2b0d3d3e4f ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20221215021629.74870-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The 'bfqd->num_groups_with_pending_reqs' is used when
CONFIG_BFQ_GROUP_IOSCHED is enabled, so let the variables and processes
take effect when CONFIG_BFQ_GROUP_IOSCHED is enabled.
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Yuwei Guan <Yuwei.Guan@zeekrlife.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20221110112622.389332-1-Yuwei.Guan@zeekrlife.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a gendisk is successfully initialized but add_disk() fails such as when
a loop device has invalid number of minor device numbers specified,
blkcg_init_disk() is called during init and then blkcg_exit_disk() during
error handling. Unfortunately, iolatency gets initialized in the former but
doesn't get cleaned up in the latter.
This is because, in non-error cases, the cleanup is performed by
del_gendisk() calling rq_qos_exit(), the assumption being that rq_qos
policies, iolatency being one of them, can only be activated once the disk
is fully registered and visible. That assumption is true for wbt and iocost,
but not so for iolatency as it gets initialized before add_disk() is called.
It is desirable to lazy-init rq_qos policies because they are optional
features and add to hot path overhead once initialized - each IO has to walk
all the registered rq_qos policies. So, we want to switch iolatency to lazy
init too. However, that's a bigger change. As a fix for the immediate
problem, let's just add an extra call to rq_qos_exit() in blkcg_exit_disk().
This is safe because duplicate calls to rq_qos_exit() become noop's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: darklight2357@icloud.com
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: d706751215 ("block: introduce blk-iolatency io controller")
Cc: stable@vger.kernel.org # v4.19+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/Y5TQ5gm3O4HXrXR3@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since gcc13, each member of an enum has the same type as the enum [1]. And
that is inherited from its members. Provided:
VTIME_PER_SEC_SHIFT = 37,
VTIME_PER_SEC = 1LLU << VTIME_PER_SEC_SHIFT,
...
AUTOP_CYCLE_NSEC = 10LLU * NSEC_PER_SEC,
the named type is unsigned long.
This generates warnings with gcc-13:
block/blk-iocost.c: In function 'ioc_weight_prfill':
block/blk-iocost.c:3037:37: error: format '%u' expects argument of type 'unsigned int', but argument 4 has type 'long unsigned int'
block/blk-iocost.c: In function 'ioc_weight_show':
block/blk-iocost.c:3047:34: error: format '%u' expects argument of type 'unsigned int', but argument 3 has type 'long unsigned int'
So split the anonymous enum with large values to a separate enum, so
that they don't affect other members.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36113
Cc: Martin Liska <mliska@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: cgroups@vger.kernel.org
Cc: linux-block@vger.kernel.org
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20221213120826.17446-1-jirislaby@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Our test report a uaf for 'bfqq->bic' in 5.10:
==================================================================
BUG: KASAN: use-after-free in bfq_select_queue+0x378/0xa30
CPU: 6 PID: 2318352 Comm: fsstress Kdump: loaded Not tainted 5.10.0-60.18.0.50.h602.kasan.eulerosv2r11.x86_64 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-20220320_160524-szxrtosci10000 04/01/2014
Call Trace:
bfq_select_queue+0x378/0xa30
bfq_dispatch_request+0xe8/0x130
blk_mq_do_dispatch_sched+0x62/0xb0
__blk_mq_sched_dispatch_requests+0x215/0x2a0
blk_mq_sched_dispatch_requests+0x8f/0xd0
__blk_mq_run_hw_queue+0x98/0x180
__blk_mq_delay_run_hw_queue+0x22b/0x240
blk_mq_run_hw_queue+0xe3/0x190
blk_mq_sched_insert_requests+0x107/0x200
blk_mq_flush_plug_list+0x26e/0x3c0
blk_finish_plug+0x63/0x90
__iomap_dio_rw+0x7b5/0x910
iomap_dio_rw+0x36/0x80
ext4_dio_read_iter+0x146/0x190 [ext4]
ext4_file_read_iter+0x1e2/0x230 [ext4]
new_sync_read+0x29f/0x400
vfs_read+0x24e/0x2d0
ksys_read+0xd5/0x1b0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x61/0xc6
Commit 3bc5e683c6 ("bfq: Split shared queues on move between cgroups")
changes that move process to a new cgroup will allocate a new bfqq to
use, however, the old bfqq and new bfqq can point to the same bic:
1) Initial state, two process with io in the same cgroup.
Process 1 Process 2
(BIC1) (BIC2)
| Λ | Λ
| | | |
V | V |
bfqq1 bfqq2
2) bfqq1 is merged to bfqq2.
Process 1 Process 2
(BIC1) (BIC2)
| |
\-------------\|
V
bfqq1 bfqq2(coop)
3) Process 1 exit, then issue new io(denoce IOA) from Process 2.
(BIC2)
| Λ
| |
V |
bfqq2(coop)
4) Before IOA is completed, move Process 2 to another cgroup and issue io.
Process 2
(BIC2)
Λ
|\--------------\
| V
bfqq2 bfqq3
Now that BIC2 points to bfqq3, while bfqq2 and bfqq3 both point to BIC2.
If all the requests are completed, and Process 2 exit, BIC2 will be
freed while there is no guarantee that bfqq2 will be freed before BIC2.
Fix the problem by clearing bfqq->bic while bfqq is detached from bic.
Fixes: 3bc5e683c6 ("bfq: Split shared queues on move between cgroups")
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221214030430.3304151-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOScsgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpi5ID/9pLXFYOq1+uDjU0KO/MdjMjK8Ukr34lCnk
WkajRLheE8JBKOFDE54XJk56sQSZHX9bTWqziar0h1fioh7FlQR/tVvzsERCm2M9
2y9THJNJygC68wgybStyiKlshFjl7TD7Kv5N9Y3xP3mkQygT+D6o8fXZk5xQbYyH
YdFSoq4rJVHxRL03yzQiReGGIYdOUEQQh8l1FiLwLlKa3lXAey1KuxWIzksVN0KK
aZB4QhiBpOiPgDHUVisq2XtyQjpZ2byoCImPzgrcqk9Jo4esvm/e6esrg4xlsvII
LKFFkTmbVqjUZtFjqakFHmfuzVor4nU5f+xb90ZHExuuODYckkxWp5rWhf9QwqqI
0ik6WYgI1/5vnHnX8f2DYzOFQf9qa/rLgg0CshyUODlD6RfHa9vntqYvlIFkmOBd
Q7KblIoK8YTzUS1M+v7X8JQ7gDR2KwygH37Da2KJS+vgvfIb8kJGr1ZORuhJuJJ7
Bl69gaNkHTHrqufp7UI64YXfueeuNu2J9z3zwzGoxeaFaofF/phDn0/2gCQE1fQI
XBhsMw+ETqI6B2SPHMnzYDu2DM1S8ZTOYQlaD4G3uqgWnAM1tG707395uAy5yu4n
D5azU1fVG4UocoNIyPujpaoSRs2zWZycEFEeUQkhyDDww/j4hlHi6H33eOnk0zsr
wxzFGfvHfw==
=k/vv
-----END PGP SIGNATURE-----
Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull requests via Christoph:
- Support some passthrough commands without CAP_SYS_ADMIN (Kanchan
Joshi)
- Refactor PCIe probing and reset (Christoph Hellwig)
- Various fabrics authentication fixes and improvements (Sagi
Grimberg)
- Avoid fallback to sequential scan due to transient issues (Uday
Shankar)
- Implement support for the DEAC bit in Write Zeroes (Christoph
Hellwig)
- Allow overriding the IEEE OUI and firmware revision in configfs
for nvmet (Aleksandr Miloserdov)
- Force reconnect when number of queue changes in nvmet (Daniel
Wagner)
- Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi
Grimberg, Christoph Hellwig, Christophe JAILLET)
- Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni)
- Use the common tagset helpers in nvme-pci driver (Christoph
Hellwig)
- Cleanup the nvme-pci removal path (Christoph Hellwig)
- Use kstrtobool() instead of strtobool (Christophe JAILLET)
- Allow unprivileged passthrough of Identify Controller (Joel
Granados)
- Support io stats on the mpath device (Sagi Grimberg)
- Minor nvmet cleanup (Sagi Grimberg)
- MD pull requests via Song:
- Code cleanups (Christoph)
- Various fixes
- Floppy pull request from Denis:
- Fix a memory leak in the init error path (Yuan)
- Series fixing some batch wakeup issues with sbitmap (Gabriel)
- Removal of the pktcdvd driver that was deprecated more than 5 years
ago, and subsequent removal of the devnode callback in struct
block_device_operations as no users are now left (Greg)
- Fix for partition read on an exclusively opened bdev (Jan)
- Series of elevator API cleanups (Jinlong, Christoph)
- Series of fixes and cleanups for blk-iocost (Kemeng)
- Series of fixes and cleanups for blk-throttle (Kemeng)
- Series adding concurrent support for sync queues in BFQ (Yu)
- Series bringing drbd a bit closer to the out-of-tree maintained
version (Christian, Joel, Lars, Philipp)
- Misc drbd fixes (Wang)
- blk-wbt fixes and tweaks for enable/disable (Yu)
- Fixes for mq-deadline for zoned devices (Damien)
- Add support for read-only and offline zones for null_blk
(Shin'ichiro)
- Series fixing the delayed holder tracking, as used by DM (Yu,
Christoph)
- Series enabling bio alloc caching for IRQ based IO (Pavel)
- Series enabling userspace peer-to-peer DMA (Logan)
- BFQ waker fixes (Khazhismel)
- Series fixing elevator refcount issues (Christoph, Jinlong)
- Series cleaning up references around queue destruction (Christoph)
- Series doing quiesce by tagset, enabling cleanups in drivers
(Christoph, Chao)
- Series untangling the queue kobject and queue references (Christoph)
- Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye,
Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph)
* tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits)
blktrace: Fix output non-blktrace event when blk_classic option enabled
block: sed-opal: Don't include <linux/kernel.h>
sed-opal: allow using IOC_OPAL_SAVE for locking too
blk-cgroup: Fix typo in comment
block: remove bio_set_op_attrs
nvmet: don't open-code NVME_NS_ATTR_RO enumeration
nvme-pci: use the tagset alloc/free helpers
nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set
nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers
nvme: consolidate setting the tagset flags
nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set
block: bio_copy_data_iter
nvme-pci: split out a nvme_pci_ctrl_is_dead helper
nvme-pci: return early on ctrl state mismatch in nvme_reset_work
nvme-pci: rename nvme_disable_io_queues
nvme-pci: cleanup nvme_suspend_queue
nvme-pci: remove nvme_pci_disable
nvme-pci: remove nvme_disable_admin_queue
nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl
nvme: use nvme_wait_ready in nvme_shutdown_ctrl
...
This release adds SM4 encryption support, contributed by Tianjia Zhang.
SM4 is a Chinese block cipher that is an alternative to AES.
I recommend against using SM4, but (according to Tianjia) some people
are being required to use it. Since SM4 has been turning up in many
other places (crypto API, wireless, TLS, OpenSSL, ARMv8 CPUs, etc.), it
hasn't been very controversial, and some people have to use it, I don't
think it would be fair for me to reject this optional feature.
Besides the above, there are a couple cleanups.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCY5auyBQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK1u4AP4lhLxaEJ9upkHZrPAvEdF7QjLhO/ju
h1LrvWHcEbvr6AEA/8ptc5RA1BAoSTDcqIWxIAWRztvptP4gUETb1b9C/ws=
=An5w
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fscrypt updates from Eric Biggers:
"This release adds SM4 encryption support, contributed by Tianjia
Zhang. SM4 is a Chinese block cipher that is an alternative to AES.
I recommend against using SM4, but (according to Tianjia) some people
are being required to use it. Since SM4 has been turning up in many
other places (crypto API, wireless, TLS, OpenSSL, ARMv8 CPUs, etc.),
it hasn't been very controversial, and some people have to use it, I
don't think it would be fair for me to reject this optional feature.
Besides the above, there are a couple cleanups"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: add additional documentation for SM4 support
fscrypt: remove unused Speck definitions
fscrypt: Add SM4 XTS/CTS symmetric algorithm support
blk-crypto: Add support for SM4-XTS blk crypto mode
fscrypt: add comment for fscrypt_valid_enc_modes_v1()
fscrypt: pass super_block to fscrypt_put_master_key_activeref()
Usually when closing a crypto device (eg: dm-crypt with LUKS) the
volume key is not required, as it requires root privileges anyway, and
root can deny access to a disk in many ways regardless. Requiring the
volume key to lock the device is a peculiarity of the OPAL
specification.
Given we might already have saved the key if the user requested it via
the 'IOC_OPAL_SAVE' ioctl, we can use that key to lock the device if no
key was provided here and the locking range matches, and the user sets
the appropriate flag with 'IOC_OPAL_SAVE'. This allows integrating OPAL
with tools and libraries that are used to the common behaviour and do
not ask for the volume key when closing a device.
Callers can always pass a non-zero key and it will be used regardless,
as before.
Suggested-by: Štěpán Horáček <stepan.horacek@gmail.com>
Signed-off-by: Luca Boccassi <bluca@debian.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20221206092913.4625-1-luca.boccassi@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the pktcdvdv removal, bio_copy_data_iter is unused now. Fold the
logic into bio_copy_data and remove the separate lower level function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221206144407.722049-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no need to update tg->slice_start[rw] to start when they are
equal already. So remove "eq" part of check before update slice_start.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-10-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no need to check elapsed time from last upgrade for each node in
hierarchy. Move this check before traversing as throtl_can_upgrade do
to remove repeat check.
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-9-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Function tg_last_low_overflow_time is called with intermediate node as
following:
throtl_hierarchy_can_downgrade
throtl_tg_can_downgrade
tg_last_low_overflow_time
throtl_hierarchy_can_upgrade
throtl_tg_can_upgrade
tg_last_low_overflow_time
throtl_hierarchy_can_downgrade/throtl_hierarchy_can_upgrade will traverse
from leaf node to sub-root node and pass traversed intermediate node
to tg_last_low_overflow_time.
No such limit could be found from context and implementation of
tg_last_low_overflow_time, so remove this limit in comment.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-8-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit c79892c557 ("blk-throttle: add upgrade logic for LIMIT_LOW
state") added upgrade logic for low limit and methioned that
1. "To determine if a cgroup exceeds its limitation, we check if the cgroup
has pending request. Since cgroup is throttled according to the limit,
pending request means the cgroup reaches the limit."
2. "If a cgroup has limit set for both read and write, we consider the
combination of them for upgrade. The reason is read IO and write IO can
interfere with each other. If we do the upgrade based in one direction IO,
the other direction IO could be severly harmed."
Besides, we also determine that cgroup reaches low limit if low limit is 0,
see comment in throtl_tg_can_upgrade.
Collect the information above, the desgin of upgrade check is as following:
1.The low limit is reached if limit is zero or io is already queued.
2.Cgroup will pass upgrade check if low limits of READ and WRITE are both
reached.
Simpfy the check code described above to removce repeat check and improve
readability. There is no functional change.
Detail equivalence proof is as following:
All replaced conditions to return true are as following:
condition 1
(!read_limit && !write_limit)
condition 2
read_limit && sq->nr_queued[READ] &&
(!write_limit || sq->nr_queued[WRITE])
condition 3
write_limit && sq->nr_queued[WRITE] &&
(!read_limit || sq->nr_queued[READ])
Transferring condition 2 as following:
(read_limit && sq->nr_queued[READ]) &&
(!write_limit || sq->nr_queued[WRITE])
is equivalent to
(read_limit && sq->nr_queued[READ]) &&
(!write_limit || (write_limit && sq->nr_queued[WRITE]))
is equivalent to
condition 2.1
(read_limit && sq->nr_queued[READ] &&
!write_limit) ||
condition 2.2
(read_limit && sq->nr_queued[READ] &&
(write_limit && sq->nr_queued[WRITE]))
Transferring condition 3 as following:
write_limit && sq->nr_queued[WRITE] &&
(!read_limit || sq->nr_queued[READ])
is equivalent to
(write_limit && sq->nr_queued[WRITE]) &&
(!read_limit || (read_limit && sq->nr_queued[READ]))
is equivalent to
condition 3.1
((write_limit && sq->nr_queued[WRITE]) &&
!read_limit) ||
condition 3.2
((write_limit && sq->nr_queued[WRITE]) &&
(read_limit && sq->nr_queued[READ]))
Condition 3.2 is the same as condition 2.2, so all conditions we get to
return are as following:
(!read_limit && !write_limit) (1)
(!read_limit && (write_limit && sq->nr_queued[WRITE])) (3.1)
((read_limit && sq->nr_queued[READ]) && !write_limit) (2.1)
((write_limit && sq->nr_queued[WRITE]) &&
(read_limit && sq->nr_queued[READ])) (2.2)
As we can extract conditions "(a1 || a2) && (b1 || b2)" to:
a1 && b1
a1 && b2
a2 && b1
ab && b2
Considering that:
a1 = !read_limit
a2 = read_limit && sq->nr_queued[READ]
b1 = !write_limit
b2 = write_limit && sq->nr_queued[WRITE]
We can pack replaced conditions to
(!read_limit || (read_limit && sq->nr_queued[READ])) &&
(!write_limit || (write_limit && sq->nr_queued[WRITE]))
which is equivalent to
(!read_limit || sq->nr_queued[READ]) &&
(!write_limit || sq->nr_queued[WRITE])
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-6-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In C language, When executing "if (expression1 && expression2)" and
expression1 return false, the expression2 may not be executed.
For "tg_within_bps_limit(tg, bio, bps_limit, &bps_wait) &&
tg_within_iops_limit(tg, bio, iops_limit, &iops_wait))", if bps is
limited, tg_within_bps_limit will return false and
tg_within_iops_limit will not be called. So even bps and iops are
both limited, iops_wait will not be calculated and is always zero.
So wait time of iops is always ignored.
Fix this by always calling tg_within_bps_limit and tg_within_iops_limit
to get wait time for both bps and iops.
Observed that:
1. Wait time in tg_within_iops_limit/tg_within_bps_limit need always
be stored as wait argument is always passed.
2. wait time is stored to zero if iops/bps is limited otherwise non-zero
is stored.
Simpfy tg_within_iops_limit/tg_within_bps_limit by removing wait argument
and return wait time directly. Caller tg_may_dispatch checks if wait time
is zero to find if iops/bps is limited.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-5-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ignore cgroup without io queued in blk_throtl_cancel_bios for two
reasons:
1. Save cpu cycle for trying to dispatch cgroup which is no io queued.
2. Avoid non-consistent state that cgroup is inserted to service queue
without THROTL_TG_PENDING set as tg_update_disptime will unconditional
re-insert cgroup to service queue. If we are on the default hierarchy,
IO dispatched from child in tg_dispatch_one_bio will trigger inserting
cgroup to service queue without erase first and ruin the tree.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-4-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Consider situation as following (on the default hierarchy):
HDD
|
root (bps limit: 4k)
|
child (bps limit :8k)
|
fio bs=8k
Rate of fio is supposed to be 4k, but result is 8k. Reason is as
following:
Size of single IO from fio is larger than bytes allowed in one
throtl_slice in child, so IOs are always queued in child group first.
When queued IOs in child are dispatched to parent group, BIO_BPS_THROTTLED
is set and these IOs will not be limited by tg_within_bps_limit anymore.
Fix this by only set BIO_BPS_THROTTLED when the bio traversed the entire
tree.
There patch has no influence on situation which is not on the default
hierarchy as each group is a single root group without parent.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-3-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On the default hierarchy (cgroup2), the throttle interface files don't
exist in the root cgroup, so the ablity to limit the whole system
by configuring root group is not existing anymore. In general, cgroup
doesn't wanna be in the business of restricting resources at the
system level, so correct the stale comment that we can limit whole
system to we can only limit subtree.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-2-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the removal of the pktcdvd driver, there are no in-kernel users of
the devnode callback in struct block_device_operations, so it can be
safely removed. If it is needed for new block drivers in the future, it
can be brought back.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20221203140747.1942969-1-gregkh@linuxfoundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make the description of @gendisk to @disk in blkcg_schedule_throttle()
to clear the below warnings:
block/blk-cgroup.c:1850: warning: Function parameter or member 'disk' not described in 'blkcg_schedule_throttle'
block/blk-cgroup.c:1850: warning: Excess function parameter 'gendisk' description in 'blkcg_schedule_throttle'
Fixes: de185b56e8 ("blk-cgroup: pass a gendisk to blkcg_schedule_throttle")
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3338
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221202011713.14834-1-yang.lee@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
SM4 is a symmetric cipher algorithm widely used in China. The SM4-XTS
variant is used to encrypt length-preserving data. This is the
mandatory algorithm in some special scenarios.
Add support for the algorithm to block inline encryption. This is needed
for the inlinecrypt mount option to be supported via
blk-crypto-fallback, as it is for the other fscrypt modes.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221201125819.36932-2-tianjia.zhang@linux.alibaba.com
Use only one hyphen in kernel-doc notation between the function name
and its short description.
The is the documented kerenl-doc format. It also fixes the HTML
presentation to be consistent with other functions.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/20221201070331.25685-1-rdunlap@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no iocg_pd_init function. The pd_alloc_fn function pointer of
iocost policy is set with ioc_pd_init. Just correct it.
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221018121932.10792-6-shikemeng@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>