When the process is migrated to a different cgroup (or in case of
writeback just starts submitting bios associated with a different
cgroup) bfq_merge_bio() can operate with stale cgroup information in
bic. Thus the bio can be merged to a request from a different cgroup or
it can result in merging of bfqqs for different cgroups or bfqqs of
already dead cgroups and causing possible use-after-free issues. Fix the
problem by updating cgroup information in bfq_merge_bio().
CC: stable@vger.kernel.org
Fixes: e21b7a0b98 ("block, bfq: add full hierarchical scheduling and cgroups support")
Tested-by: "yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-4-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When bfqq is shared by multiple processes it can happen that one of the
processes gets moved to a different cgroup (or just starts submitting IO
for different cgroup). In case that happens we need to split the merged
bfqq as otherwise we will have IO for multiple cgroups in one bfqq and
we will just account IO time to wrong entities etc.
Similarly if the bfqq is scheduled to merge with another bfqq but the
merge didn't happen yet, cancel the merge as it need not be valid
anymore.
CC: stable@vger.kernel.org
Fixes: e21b7a0b98 ("block, bfq: add full hierarchical scheduling and cgroups support")
Tested-by: "yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It can happen that the parent of a bfqq changes between the moment we
decide two queues are worth to merge (and set bic->stable_merge_bfqq)
and the moment bfq_setup_merge() is called. This can happen e.g. because
the process submitted IO for a different cgroup and thus bfqq got
reparented. It can even happen that the bfqq we are merging with has
parent cgroup that is already offline and going to be destroyed in which
case the merge can lead to use-after-free issues such as:
BUG: KASAN: use-after-free in __bfq_deactivate_entity+0x9cb/0xa50
Read of size 8 at addr ffff88800693c0c0 by task runc:[2:INIT]/10544
CPU: 0 PID: 10544 Comm: runc:[2:INIT] Tainted: G E 5.15.2-0.g5fb85fd-default #1 openSUSE Tumbleweed (unreleased) f1f3b891c72369aebecd2e43e4641a6358867c70
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl+0x46/0x5a
print_address_description.constprop.0+0x1f/0x140
? __bfq_deactivate_entity+0x9cb/0xa50
kasan_report.cold+0x7f/0x11b
? __bfq_deactivate_entity+0x9cb/0xa50
__bfq_deactivate_entity+0x9cb/0xa50
? update_curr+0x32f/0x5d0
bfq_deactivate_entity+0xa0/0x1d0
bfq_del_bfqq_busy+0x28a/0x420
? resched_curr+0x116/0x1d0
? bfq_requeue_bfqq+0x70/0x70
? check_preempt_wakeup+0x52b/0xbc0
__bfq_bfqq_expire+0x1a2/0x270
bfq_bfqq_expire+0xd16/0x2160
? try_to_wake_up+0x4ee/0x1260
? bfq_end_wr_async_queues+0xe0/0xe0
? _raw_write_unlock_bh+0x60/0x60
? _raw_spin_lock_irq+0x81/0xe0
bfq_idle_slice_timer+0x109/0x280
? bfq_dispatch_request+0x4870/0x4870
__hrtimer_run_queues+0x37d/0x700
? enqueue_hrtimer+0x1b0/0x1b0
? kvm_clock_get_cycles+0xd/0x10
? ktime_get_update_offsets_now+0x6f/0x280
hrtimer_interrupt+0x2c8/0x740
Fix the problem by checking that the parent of the two bfqqs we are
merging in bfq_setup_merge() is the same.
Link: https://lore.kernel.org/linux-block/20211125172809.GC19572@quack2.suse.cz/
CC: stable@vger.kernel.org
Fixes: 430a67f9d6 ("block, bfq: merge bursts of newly-created queues")
Tested-by: "yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-2-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bfq_setup_cooperator() can mark bic as stably merged even though it
decides to not merge its bfqqs (when bfq_setup_merge() returns NULL).
Make sure to mark bic as stably merged only if we are really going to
merge bfqqs.
CC: stable@vger.kernel.org
Tested-by: "yukuai (C)" <yukuai3@huawei.com>
Fixes: 430a67f9d6 ("block, bfq: merge bursts of newly-created queues")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the magic autofree semantics and require the callers to explicitly
call bio_init to initialize the bio.
This allows bio_free to catch accidental bio_put calls on bio_init()ed
bios as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20220406061228.410163-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove pscsi_get_bio and simplify the code flow in the only caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220406061228.410163-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a plain kmalloc that is not backed by a mempool is safe here for a
large read (and the actual page allocations), it must also be for a
small one, so simplify the code a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Phillip Lougher <phillip@squashfs.org.uk>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220406061228.410163-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use and embedded bios that is initialized when used instead of
bio_kmalloc plus bio_reset.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220406061228.410163-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Refine per-cpu bio alloc cache interfaces so that DM and other block
drivers can properly create and use the cache:
DM uses bioset_init_from_src() to do its final bioset initialization,
so must update bioset_init_from_src() to set BIOSET_PERCPU_CACHE if
%src bioset has a cache.
Also move bio_clear_polled() to include/linux/bio.h to allow users
outside of block core.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220324203526.62306-3-snitzer@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the BIO_PERCPU_CACHE bio-internal flag with a REQ_ALLOC_CACHE
one that can be passed to bio_alloc / bio_alloc_bioset, and implement
the percpu cache allocation logic in a helper called from
bio_alloc_bioset. This allows any bio_alloc_bioset user to use the
percpu caches instead of having the functionality tied to struct kiocb.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
[hch: refactored a bit]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220324203526.62306-2-snitzer@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYlvMdwAKCRCAXGG7T9hj
vueqAQCeJiUt0Adhs7ACqzvTsBc1TGXD44J6AAfwedMgtdgtvAD+LvWXLhTcgiCb
DT03AIpI1Z/40QgPYuJ3o4yAZN7eUg4=
=dV8F
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.18-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixlet from Juergen Gross:
"A single cleanup patch for the Xen balloon driver"
* tag 'for-linus-5.18-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/balloon: don't use PV mode extra memory for zone device allocations
- Use either MSR_TSX_FORCE_ABORT or MSR_IA32_TSX_CTRL to disable TSX to
cover all CPUs which allow to disable it.
- Disable TSX development mode at boot so that a microcode update which
provides TSX development mode does not suddenly make the system
vulnerable to TSX Asynchronous Abort.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmJb5LYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVVbD/9cxZWkFctCiymedUZqLabkfpYSki65
MngdpCPzCNaaIdlp44lwCido5+gJsY9unXdm3OAUzLjv6SsxxpDr5njz1/C6TM1l
XmWjlkLEbG2QDPd1Ybd/lpYQORBmiukyo8v8x0yFT7ZzwvSddoDZAbeUtkQBrIin
sDTeExsewKzL2X5qXhttrHLHu1PYgurn4ThIrrG+eg2e4FNk6UUFUS3TOyMvzJDg
NWJ7N5pGy9YkR7CISq1q+qdnH55pGaUrgonDi2qBTt3EaH0fQtZP2ZtIOYr3O4nI
YCx6isrIiGUB6kSygofxmk4B+22CaUJXd2OcUxMZ/Th/a2aCK+35BtGVPXQGi6nU
d7m+ZWB7dShOiejFygS59ty+5L5kliKXYZfUASsq1CLoXH8K1xUwBMkbY5FQ2WH1
Ue4KUvjguNqsgSRAfeHdOi6B36oot0Xf9JO013Wm3V/r9hsGPtSOjWwFuVvT/euw
a9iFtruATxDssBxH/l0djCKnwwm5yuOt1OpyizcIMFnlCgRD06h/6zgAvsJK7c8d
dh6lC4D2mXP1e2wtEyZelve1tmRJ/FeReyG2V5FNU7m1mWYGm1rJZ4AEvnbrzcbC
ePwFva0lPu8GVKG6HRgHfR8PjuQ7TFmKPKytT7fboIqQpTIY+1Q75wYD4eXkSu8Q
/ltzXQz/8lz7bA==
=UQaW
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Two x86 fixes related to TSX:
- Use either MSR_TSX_FORCE_ABORT or MSR_IA32_TSX_CTRL to disable TSX
to cover all CPUs which allow to disable it.
- Disable TSX development mode at boot so that a microcode update
which provides TSX development mode does not suddenly make the
system vulnerable to TSX Asynchronous Abort"
* tag 'x86-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsx: Disable TSX development mode at boot
x86/tsx: Use MSR_TSX_CTRL to clear CPUID bits
- Fix the warning condition in __run_timers() which does not take into
account, that a CPU base (especially the deferrable base) has never a
timer armed on it and therefore the next_expiry value can become stale.
- Replace a WARN_ON() in the NOHZ code with a WARN_ON_ONCE() to prevent
endless spam in dmesg.
- Remove the double star from a comment which is not meant to be in
kernel-doc format.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmJb44gTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoXptD/4tKI9W4q0CAZnXw/G9cJdUK2u2VQGb
QfB0DMABuDLMlHUSO92T49IcfjQUR7hz36kHXY0OxwALaEB8ftap6yZLfI1864mZ
RFatWcTl71ZYY0SSio0rMnbDawTeN9N69WLPR+A6fDH0ahBqTwpeJoa1zj5w99LF
kFoEtI5vEBtf+m94PljBTclxOKTXmRRCzEZRpJ6L4Ze9SHR7aMKtBgp84M7a3IAw
IzXq0Kw0MRIm4P26uOpsVzi2KOq6+F1O1zy4L8cHVeU+Y6HC1qWLPXsyhipb0liE
6STFs1L7ANX+dpMW8kmorXc3RcN4xCuCY4Sb17W6Hqiq3syQ+Ho2G0CTheL0yOpM
GVqf7SVes3EGAJwWyKaFmhpgWGKcS93p18XGUrf5V+rVIU+ggQcc2vM9E8FiM03d
TA6T9+1u0TvtPlmd/KQ3OPLMEtu6/3R0fOLEXnv+Cb/PYIYMvPpb9cu6LjJkNqGv
6R+3dWORI7mYLZD+vS3Y5mrmB5OECGXTPNJ+aBr46PfdvSVLxvmOhFlNR7eQKHZ8
wFwOC6n4wkD5Y1VOVGEn0GVGWVkZXdlQzZOQGqBhJPjsfPfOaGacyLqCcsZH4V6w
kitKI8z7A7evLozZMQqKZskwLG1BTQ13YeElb5jJMpG3aTG7lSmPk+iGgLFMDWkc
Z2k4YD0Uw+92Tw==
=SffK
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"A small set of fixes for the timers core:
- Fix the warning condition in __run_timers() which does not take
into account that a CPU base (especially the deferrable base) never
has a timer armed on it and therefore the next_expiry value can
become stale.
- Replace a WARN_ON() in the NOHZ code with a WARN_ON_ONCE() to
prevent endless spam in dmesg.
- Remove the double star from a comment which is not meant to be in
kernel-doc format"
* tag 'timers-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/sched: Fix non-kernel-doc comment
tick/nohz: Use WARN_ON_ONCE() to prevent console saturation
timers: Fix warning condition in __run_timers()
- Make the warning condition in flush_smp_call_function_queue() correct,
which checks a just emptied list head for being empty instead of
validating that there was no pending entry on the offlined CPU at all.
- The @cpu member of struct cpuhp_cpu_state is initialized when the CPU
hotplug thread for the upcoming CPU is created. That's too late because
the creation of the thread can fail and then the following rollback
operates on CPU0. Get rid of the CPU member and hand the CPU number to
the involved functions directly.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmJb4skTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUYaD/9SWzwEqwKICAhWWSMtRg6cvtVTAKGL
xWXNb5ZxLA8MdG9CYu7WaUF7mU3E8o7kF/dItvkdj0/9mg/IqcVnP1CEOfp0ioHc
1cbAZShORfFlOowO1WR9Xe3MSZB2mXPHsjh3FWPeGTDxnSByMDXROMFyotxji+7q
g1ZsyVbq3+hSVTOhnhpxrh8MS3qcCAsgYtHKRnPjOE/tqbNXSmbhmeuN7QBKLVp6
AD6DaWj8jGtZ8yzbpm0Ve3ZsjH+1SdZAzm4yhh3FHMKsSOwB8yuNwq1oNbbEjxWu
mTg+LCBBMGYybSTa9sUpDG5Xfc3/dyWnzkGnmp7M8bfElrI3x4d+sMdVhfFRjEXB
xlXmNqRoExgZwifyoEJbmSbG2SgLiWqyLqgpUvLS/yHud6IurFbu36TrByYRMMsG
CyaiLMdWIlBibo+xGkVgnLyc2io98KeoLJoc35HM7VYyn9oNI4hUU9OJDIkG5D2i
lyS+qMTFInFSOm7L5u/SUTYe4I0sXYJ6uEXxhR/Gi3ITXb7aLOJyxhNtq//u1dmg
6vZHgs8HZWylG3LxVn1QERWotVlPuNPRilsUGCVMsCHjOFnoPy+eM3ANtR2x/uaE
XpQ58jnhkhPXZfmnsjDEILkWB67J0/9FujbcPvixH8SRvUgBB7xRRbuo9zXPchhw
VDKy1BCIcZ93gA==
=XNLP
-----END PGP SIGNATURE-----
Merge tag 'smp-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP fixes from Thomas Gleixner:
"Two fixes for the SMP core:
- Make the warning condition in flush_smp_call_function_queue()
correct, which checked a just emptied list head for being empty
instead of validating that there was no pending entry on the
offlined CPU at all.
- The @cpu member of struct cpuhp_cpu_state is initialized when the
CPU hotplug thread for the upcoming CPU is created. That's too late
because the creation of the thread can fail and then the following
rollback operates on CPU0. Get rid of the CPU member and hand the
CPU number to the involved functions directly"
* tag 'smp-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Remove the 'cpu' member of cpuhp_cpu_state
smp: Fix offline cpu check in flush_smp_call_function_queue()
account that there can be an imbalance between present and possible CPUs,
which causes already assigned bits to be overwritten.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmJb4cETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWrDD/wOd9X4zQz6cZci2Xl091IhMlS2/sBm
afY98mTK56rX/BEgybMhzXZtBOSdDL2UDnBYznuYiPWof09+mqJGp4cLJAtVvK7E
cDoLdpqmFuQzO8Ka1lUgdgwR+d32pFmaBASbl/0COdl8qgNskQp+jD6BaztNxYbZ
HjkN2chMXqRlCOkKCthNajPEB3thmW2H0gDiZT/VEGYkQfp+HdoH7QhpIXt8HkWd
lb1g1bpDBUwsGrVJ2PlI+XvwmYdW/WOyebIq2D1XBhSS1RXVTt5xfXh/ZfhULzgn
2bnhvXU7XwDZRJNte1r8EFHSufXZUwhZIa9DRKHoy1fxNaxufRy0VJLEqcRHbcBO
xGIvXFfjmn+vmO6hwC1/DeP7Dz6WiTYoxTwJVpvgURqcEdr1HvnSy1CdMEOsAMUa
ZQDJzudV47a2qj80MRO1TSovW2gtDPdSWEUp5oOx2nYUZbqME146ubglZEcXLzq7
Y7BeMaO1VzXkfyk6ImtrBYY+qrXaT1ZQvCOGPrOltnlU0cBLkQMSUpjTstIcaOAp
iIvPLjVR1fDdgEBRck+ZAsM9snefz160wEbx23PzGuh2BenLpQfKZHIYhDO8wo2l
4uUmnp7VY0FS5I38f+rWNvqhFfJV1oda0dQfprP+s3Usz5oSegl764as5hjGcu+y
0Z/mCvaPODGctg==
=wSdp
-----END PGP SIGNATURE-----
Merge tag 'irq-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
"A single fix for the interrupt affinity spreading logic to take into
account that there can be an imbalance between present and possible
CPUs, which causes already assigned bits to be overwritten"
* tag 'irq-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/affinity: Consider that CPUs on nodes can be unbalanced
Regression fix for the 5.18 cycle:
* Fix a regression with battery data failing to load from DT
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE72YNB0Y/i3JqeVQT2O7X88g7+poFAmJWoKMACgkQ2O7X88g7
+ppmlBAAjfWLNNjsx6oMp71CCVm0KyubJZSYO9dGgCCutQhASUMAmfAnAg6V3rr1
fABm0UYfT3uMwkWw3vth8o6GMAYAiu4nZ1s/dLzObDzZVz4HHl/rEBYE5iEUZ47B
CXPYL/kRXoJZ0jAn7aNHvA6Bb/qEKRPCcrooae73927fT35Pk1dYqJ9rxpWRDR8b
f46Am26xuSlC6mzaJUVQwVvg3pwjJT5lCdBIbQ6FSqvVLDpSg/vdJtXlJHtya1ec
l6j6I9+JB4YtBLOtmaGxLLt9cpow6SRI8yTzpm5qF5wIIuFpzbx5c/sdTwVeMJ+Y
OM5LHwOyusYzQN90nse6xMzUpTNZ7Rq1wYQXH4QtvD2Xm9LrMN4arAw7s25PScQE
DBF8qVnxVA7EseHN8H4NNrgWGEcg+eByEa9m20uUN0H5mkKa5Qrn/Yk8/qOcZ/ge
oCPkvOnX6wALbmtcoaSgS7GQLcIW2GwN6HA3scQNV/8oDica+gX0hMk2ZThwMBGi
ygPjcc2qUnTD3QliXi8Rh85iTfnVHnntsMjkQRdCrxcCY1jV0zIAMp1f7I22ToQG
qD/CUvsmZc0x/dx8O8U3LgZRW3lyXG/VCgM0Axbh41HKpEVi3ayMmTkiPR0+dMpe
CRh75/1iEdwt0p0m/LIkCPC1VwK1loPmTlyS/ycAabP3W/SDHjI=
=S4f6
-----END PGP SIGNATURE-----
Merge tag 'for-v5.18-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply
Pull power supply fixes from Sebastian Reichel:
- Fix a regression with battery data failing to load from DT
* tag 'for-v5.18-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply:
power: supply: Reset err after not finding static battery
power: supply: samsung-sdi-battery: Add missing charge restart voltages
Pull i2c fixes from Wolfram Sang:
"Regular set of fixes for drivers and the dev-interface"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: ismt: Fix undefined behavior due to shift overflowing the constant
i2c: dev: Force case user pointers in compat_i2cdev_ioctl()
i2c: dev: check return value when calling dev_set_name()
i2c: qcom-geni: Use dev_err_probe() for GPI DMA error
i2c: imx: Implement errata ERR007805 or e7805 bus frequency limit
i2c: pasemi: Wait for write xfers to finish
- fix the set/get_multiple() callbacks in gpio-sim
- use correct format characters in gpiolib-acpi
- use an unsigned type for pins in gpiolib-acpi
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEFp3rbAvDxGAT0sefEacuoBRx13IFAmJbI2UACgkQEacuoBRx
13Iu1hAAmPpHwe64eECqWdCR0b6L/XZg4gihDaXa4EObL91gBNQL/SR+CnhCGmjm
8PJwoKSUIHyUGI9qQ83Na4jtDrcNV+9H2EgdgA7PHMDX6P5kS6NuaxzvJ7UPi+xB
tBVFgqGwx25JAE71/22bmH3CDWTdPDXm9FSTPItQiOXweBBzJKNGQegJYzo3Z1M7
3jLwkvU0LUI7YvjZHRG8vSp2MV3PGWUsDnaBOQ+ajhBQLnEgTudQSsh9Z2e/zZVR
MJ26FG5pkzbJ+fY4ern9S84PR/IDcPPMcjO63VHO8AY0gR78aNNBuOj7TTOdsbk6
Rs+z9LPFsNsJBNxqD6iWXMKK35ox6jThqja31w3v9C41uf1sfMFidaUqrBX+k6uN
d8oX2VcvsXVtnh9KqKsAijpQQxkZbrfkfh/O4xy/IOZv/r6WHUUtzae/DelfmNhu
eM+dVfaPywxQ+HSiqWjklgz6vKzAxNEJJbklFpbwfyJPeWucpD5HGhyWftI4QzsP
1kx0HxcQ20sE5Euiw8yqcR0ZaBc+Sdu7TU8kt2NbMxpBDf58Rs6bd19MsD/p9+Ge
JF09kNTh3iWTbtpcK7yww1TccbuzvKUVImsUWWnjLIk8e3Ar0XP3LGWCs0WgqYEf
z39CQJ8iVN2/xOtsm0WLUDnce6FATqc2O75sLB3qkf1TaMYQj/s=
=Yt3p
-----END PGP SIGNATURE-----
Merge tag 'gpio-fixes-for-v5.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
"A single fix for gpio-sim and two patches for GPIO ACPI pulled from
Andy:
- fix the set/get_multiple() callbacks in gpio-sim
- use correct format characters in gpiolib-acpi
- use an unsigned type for pins in gpiolib-acpi"
* tag 'gpio-fixes-for-v5.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: sim: fix setting and getting multiple lines
gpiolib: acpi: Convert type for pin to be unsigned
gpiolib: acpi: use correct format characters
There are a number of SoC bugfixes that came in since the merge window,
and more of them are already pending. This batch includes
- A boot time regression fix for davinci that triggered on
multi_v5_defconfig when booting any platform
- Defconfig updates to address removed features, changed symbol
names or dependencies, for gemini, ux500, and pxa
- Email address changes for Krzysztof Kozlowski
- Build warning fixes for ep93xx and iop32x
- Devicetree warning fixes across many platforms
- Minor bugfixes for the reset controller, memory controller
and SCMI firmware subsystems plus the versatile-express board
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmJbNdgACgkQmmx57+YA
GNlqag/+MyNA0d4VWqxv/5KScfM1TB/oF+G55BwkoDQRGAsfon8ocZHx7dnGk+k8
lVOYrgx1FOwBLpYmJ34SVKNznNV1x7cJB6XwwK8vDj1SievjScz8E5fx1rdO5Ayu
YQFlrLjOqSXucObQgbviHACc5uv7RB1bKYKESN/idklbY9TgNS5TIEHZxeldDkxY
bSSu52RSdvklf5XjYAMLph0hEmhY9N090C3ftBP5WTaHVDuniquS2ubSRxyomVia
WQsRFi7haXZrXFw7B20dz/nrq89yibBxHqiOAvvC09Ce2woo5sSvwxeRstls4IVt
bXwQNg7EsezZvZ+MSnNlHk6kPLG51ECm1dB3cCk++N23NLbd34GYzbK/TwbRBzyw
jeBrsLD5lzENBNBG5mfAlpDMq7HoPLRshEV+5FIGcQZtDKHZnA3c2ARHNFfAikma
3ozasK6BzRsnSQIUwWaoli9w3pj79/DOvdEoSdCVTk+RQ5Fm1aWoZXtiPin/yvsa
MOMkJOwdo42+kAi79PRVfR2JRPCC/P1JcmKykvn7Tb3AphkZBdRGjll6ZYdzt2hR
tynfPiBxXT+r61lgPM5Fs3NBZSZ2IPDePlYs5W2fHCIhof9XQrziPmHmM+OiXj2a
JwXLX6ymLFgtFRgK2ChtRgzxjHCyrk7pRGneWHxQlM7yqeliepg=
=Y3N8
-----END PGP SIGNATURE-----
Merge tag 'soc-fixes-5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"There are a number of SoC bugfixes that came in since the merge
window, and more of them are already pending.
This batch includes:
- A boot time regression fix for davinci that triggered on
multi_v5_defconfig when booting any platform
- Defconfig updates to address removed features, changed symbol names
or dependencies, for gemini, ux500, and pxa
- Email address changes for Krzysztof Kozlowski
- Build warning fixes for ep93xx and iop32x
- Devicetree warning fixes across many platforms
- Minor bugfixes for the reset controller, memory controller and SCMI
firmware subsystems plus the versatile-express board"
* tag 'soc-fixes-5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (34 commits)
ARM: config: Update Gemini defconfig
arm64: dts: qcom/sdm845-shift-axolotl: Fix boolean properties with values
ARM: dts: align SPI NOR node name with dtschema
ARM: dts: Fix more boolean properties with values
arm/arm64: dts: qcom: Fix boolean properties with values
arm64: dts: imx: Fix imx8*-var-som touchscreen property sizes
arm: dts: imx: Fix boolean properties with values
arm64: dts: tegra: Fix boolean properties with values
arm: dts: at91: Fix boolean properties with values
arm: configs: imote2: Drop defconfig as board support dropped.
ep93xx: clock: Don't use plain integer as NULL pointer
ep93xx: clock: Fix UAF in ep93xx_clk_register_gate()
ARM: vexpress/spc: Fix all the kernel-doc build warnings
ARM: vexpress/spc: Fix kernel-doc build warning for ve_spc_cpu_in_wfi
ARM: config: u8500: Re-enable AB8500 battery charging
ARM: config: u8500: Add some common hardware
memory: fsl_ifc: populate child nodes of buses and mfd devices
ARM: config: Refresh U8500 defconfig
firmware: arm_scmi: Fix sparse warnings in OPTEE transport driver
firmware: arm_scmi: Replace zero-length array with flexible-array member
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmJavk0ACgkQSfxwEqXe
A640KA/9FHngPjK8AOuFJJRfdNYu5CWcn4PD/xmYGhP1QUuKOfIF6rUGrl7w4jCh
+mKICOt2Rt2ZqSscS1ccpddiwAcoURrra6pW8Oo5IiuHlPBWR/jow9FzeLZpP4II
5mcWSZX4/MJFs9o6T+25FE9tPjsVcEi1hgYEU0ecXtXYK+mIUbPobF1pWlI+C61F
8sYQ4GrpJHLWreou7SYfzbI3siaXmewie8aYgrqvU8Bt7U2UVTA0j8VxeX7+r87/
xZ+/n+E3WOiEt2h0UyOB6+5bIL0t6qJI/plc8kQN/R5UHSoMrT06MdwrThI4exI3
YDDf488aKiYPdeQt3kFXH4o1PVK9R072Z+ZK63jfUxGOQs1DXI2DxCSqgO3fPfIR
v9uy0kG6zWG+C8RuX3VV12gIL6/XOFRx7UHVwSnt+A1Li/DPEdNoYlKesSGOKuel
vbXBmed2z5v02KMd1Y9LtrioLco8JDFYD0OEtbEaGjv7Kt+EcVtapo1e7N2VSmk6
IKxp7soEorJSR13rR5vyeF+yOZmxn6BOePc7m48O0Wqx76DHlyXiMPNTJUrHIC2z
XfD8+P1mHA2Iz71T7YI9Dmzw9uIVZuUteiEpvc0o/uy8z3YzOUftnqQBsAKwbxz5
HKrIdkocQNbXK46GKpjK6OY8BCwOeRM+z/XdJwYDek7G1+1ULpg=
=2JCC
-----END PGP SIGNATURE-----
Merge tag 'random-5.18-rc3-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator fixes from Jason Donenfeld:
- Per your suggestion, random reads now won't fail if there's a page
fault after some non-zero amount of data has been read, which makes
the behavior consistent with all other reads in the kernel.
- Rather than an inconsistent mix of random_get_entropy() returning an
unsigned long or a cycles_t, now it just returns an unsigned long.
- A memcpy() was replaced with an memmove(), because the addresses are
sometimes overlapping. In practice the destination is always before
the source, so not really an issue, but better to be correct than
not.
* tag 'random-5.18-rc3-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
random: use memmove instead of memcpy for remaining 32 bytes
random: make random_get_entropy() return an unsigned long
random: allow partial reads if later user copies fail
13 fixes, all in drivers. The most extensive changes are in the iscsi
series (affecting drivers qedi, cxgbi and bnx2i), the next most is
scsi_debug, but that's just a simple revert and then minor updates to
pm80xx.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYlseaSYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishRlWAP9ygp0e
i9eU3ZXsiVJbi/b1UrQwBj1z2oO579J4f286cwEA5ko+q8eAzvj3jxkQarBv79tt
RvYEBYVBXc5igl3VnuI=
=aO3T
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"13 fixes, all in drivers.
The most extensive changes are in the iscsi series (affecting drivers
qedi, cxgbi and bnx2i), the next most is scsi_debug, but that's just a
simple revert and then minor updates to pm80xx"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: iscsi: MAINTAINERS: Add Mike Christie as co-maintainer
scsi: qedi: Fix failed disconnect handling
scsi: iscsi: Fix NOP handling during conn recovery
scsi: iscsi: Merge suspend fields
scsi: iscsi: Fix unbound endpoint error handling
scsi: iscsi: Fix conn cleanup and stop race during iscsid restart
scsi: iscsi: Fix endpoint reuse regression
scsi: iscsi: Release endpoint ID when its freed
scsi: iscsi: Fix offload conn cleanup when iscsid restarts
scsi: iscsi: Move iscsi_ep_disconnect()
scsi: pm80xx: Enable upper inbound, outbound queues
scsi: pm80xx: Mask and unmask upper interrupt vectors 32-63
Revert "scsi: scsi_debug: Address races following module load"
* Couple of fixes related to handling unsigned value of the pin from ACPI
The following is an automated git shortlog grouped by driver:
gpiolib:
- acpi: Convert type for pin to be unsigned
- acpi: use correct format characters
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEqaflIX74DDDzMJJtb7wzTHR8rCgFAmJUN7cACgkQb7wzTHR8
rCjeMQ/+KZHhBgPHE2Rl928bI/IQYuhcGjFmsLUt/KRFFsjVpzFQro9QI0IFRava
/8j6cUy8mTV6v5cxScljusS7++j8K3C6kazsXr2ieGPKpaGDHtKs93QjbC52jF/T
x/CI7NbnkvJHPWmpIYSuc/vGHHCHAAsnI62OHAIyEARRYBbd5VAkyrvnchH2skaK
bHH0+1La4l2fBkmfA6fpnwLVnRNXEeRTUdQdGk/14Snv1zGNeciGR2/wbg0YtEtf
CvEPi9U7d7RJCjBiege9kcb8R4ABOngypL1C9r7XvtFy17glxkc+6nrJx3NgW3bY
Eksi4naML2oe4R/PyZ740wztJCUKdR6oV/OmyRz5FhTA4+ZWRz3pL3o2Rv+DTZ5g
AcugQZCMe8mNGerFLPxBHpD1s0eD5Gxrnq7CeEpQ0wpBqBj90iZPAvE2mMTghqjH
Vm9OsLX/X8/vAPLeMLgHy9tDwl5FRqkM7BxV3qBp14MBpPfzDmQc3VS5SGcw/car
eZI7mQH85iSpnYjkqfDAwQtr3Jbxt8OuEnADWPiJknuTwGigQm9MLh37cW9xih0r
BQWsHMO9ephWhDSwpKVl+7fbQpIUd2TQFFBOnzIW69ALevD16FkEdnn08HEI2Kd6
mEB60/aA04PNyS6EtbD4ZbCNDjdB3S7EITPPrVPeGycD2JAp5sA=
=i+RD
-----END PGP SIGNATURE-----
Merge tag 'intel-gpio-v5.18-2' of gitolite.kernel.org:pub/scm/linux/kernel/git/andy/linux-gpio-intel into gpio/for-current
intel-gpio for v5.18-2
* Couple of fixes related to handling unsigned value of the pin from ACPI
gpiolib:
- acpi: Convert type for pin to be unsigned
- acpi: use correct format characters
In order to immediately overwrite the old key on the stack, before
servicing a userspace request for bytes, we use the remaining 32 bytes
of block 0 as the key. This means moving indices 8,9,a,b,c,d,e,f ->
4,5,6,7,8,9,a,b. Since 4 < 8, for the kernel implementations of
memcpy(), this doesn't actually appear to be a problem in practice. But
relying on that characteristic seems a bit brittle. So let's change that
to a proper memmove(), which is the by-the-books way of handling
overlapping memory copies.
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Merge misc fixes from Andrew Morton:
"14 patches.
Subsystems affected by this patch series: MAINTAINERS, binfmt, and
mm (tmpfs, secretmem, kasan, kfence, pagealloc, zram, compaction,
hugetlb, vmalloc, and kmemleak)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: kmemleak: take a full lowmem check in kmemleak_*_phys()
mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"
revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"
hugetlb: do not demote poisoned hugetlb pages
mm: compaction: fix compiler warning when CONFIG_COMPACTION=n
mm: fix unexpected zeroed page mapping with zram swap
mm, page_alloc: fix build_zonerefs_node()
mm, kfence: support kmem_dump_obj() for KFENCE objects
kasan: fix hw tags enablement when KUNIT tests are disabled
irq_work: use kasan_record_aux_stack_noalloc() record callstack
mm/secretmem: fix panic when growing a memfd_secret
tmpfs: fix regressions from wider use of ZERO_PAGE
MAINTAINERS: Broadcom internal lists aren't maintainers
than digest size.
- Fix DM multipath's historical-service-time path selector to not use
sched_clock() and ktime_get_ns(); only use ktime_get_ns().
- Fix dm_io->orig_bio NULL pointer dereference in dm_zone_map_bio()
due to 5.18 changes that overlooked DM zone's use of ->orig_bio
- Fix for regression that broke the use of dm_accept_partial_bio() for
"abnormal" IO (e.g. WRITE ZEROES) that does not need duplicate bios
- Fix DM's issuing of empty flush bio so that it's size is 0.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmJZ17MACgkQxSPxCi2d
A1pC9Qf+NXc9u0Amf7GeORpQqAWPpogdMma3TIO3F1WaG2U5AspwaJpeZueVfmCg
z5IV4UcPtH5nJQ4BXre7r0yRqKuAX3b1EaxSApetwvs3AZJpqnFnoyFx/Bv9aU74
ZTtniNwrC+Z/TgZKk9hz+pdtEzq9p9shPDNnI32nYCMTxOFdH3ep+dBSs4/zaByb
3zSxFIEIaElP0rQJEGofgzlInIvzKOpKhWqnMBMILmoVkctdgiOpB817E/BkaMgS
hfRDjTZjgdwpXzJJd0MMUlde0MJfV7eo6pMw7e0nbO1I2E1aOtcY+XrlMJTTc0nW
py07slHJIzcoI9kBioDq7askbNO0oQ==
=V+AB
-----END PGP SIGNATURE-----
Merge tag 'for-5.18/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- Fix memory corruption in DM integrity target when tag_size is less
than digest size.
- Fix DM multipath's historical-service-time path selector to not use
sched_clock() and ktime_get_ns(); only use ktime_get_ns().
- Fix dm_io->orig_bio NULL pointer dereference in dm_zone_map_bio() due
to 5.18 changes that overlooked DM zone's use of ->orig_bio
- Fix for regression that broke the use of dm_accept_partial_bio() for
"abnormal" IO (e.g. WRITE ZEROES) that does not need duplicate bios
- Fix DM's issuing of empty flush bio so that it's size is 0.
* tag 'for-5.18/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: fix bio length of empty flush
dm: allow dm_accept_partial_bio() for dm_io without duplicate bios
dm zone: fix NULL pointer dereference in dm_zone_map_bio
dm mpath: only use ktime_get_ns() in historical selector
dm integrity: fix memory corruption when tag_size is less than digest size
Commit 3ee48b6af4 ("mm, x86: Saving vmcore with non-lazy freeing of
vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
purge the vmap areas instead of doing it lazily.
Commit 690467c81b ("mm/vmalloc: Move draining areas out of caller
context") moved the purging from the vunmap() caller to a worker thread.
Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
(possibly forever). For example, consider the following scenario:
1. Thread reads from /proc/vmcore. This eventually calls
__copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
vmap_lazy_nr to lazy_max_pages() + 1.
2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
pages (one page plus the guard page) to the purge list and
vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
drain_vmap_work is scheduled.
3. Thread returns from the kernel and is scheduled out.
4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
frees the 2 pages on the purge list. vmap_lazy_nr is now
lazy_max_pages() + 1.
5. This is still over the threshold, so it tries to purge areas again,
but doesn't find anything.
6. Repeat 5.
If the system is running with only one CPU (which is typicial for kdump)
and preemption is disabled, then this will never make forward progress:
there aren't any more pages to purge, so it hangs. If there is more
than one CPU or preemption is enabled, then the worker thread will spin
forever in the background. (Note that if there were already pages to be
purged at the time that set_iounmap_nonlazy() was called, this bug is
avoided.)
This can be reproduced with anything that reads from /proc/vmcore
multiple times. E.g., vmcore-dmesg /proc/vmcore.
It turns out that improvements to vmap() over the years have obsoleted
the need for this "optimization". I benchmarked `dd if=/proc/vmcore
of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore.
The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and
5.18-rc1 with set_iounmap_nonlazy() removed entirely:
|5.17 |5.18+fix|5.18+removal
4k|40.86s| 40.09s| 26.73s
1M|24.47s| 23.98s| 21.84s
The removal was the fastest (by a wide margin with 4k reads). This
patch removes set_iounmap_nonlazy().
Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com
Fixes: 690467c81b ("mm/vmalloc: Move draining areas out of caller context")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 925346c129 ("fs/binfmt_elf: fix PT_LOAD p_align values for
loaders") was an attempt to fix regressions due to 9630f0d60f
("fs/binfmt_elf: use PT_LOAD p_align values for static PIE").
But regressionss continue to be reported:
https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/https://bugzilla.kernel.org/show_bug.cgi?id=215720https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info
This patch reverts the fix, so the original can also be reverted.
Fixes: 925346c129 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders")
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is possible for poisoned hugetlb pages to reside on the free lists.
The huge page allocation routines which dequeue entries from the free
lists make a point of avoiding poisoned pages. There is no such check
and avoidance in the demote code path.
If a hugetlb page on the is on a free list, poison will only be set in
the head page rather then the page with the actual error. If such a
page is demoted, then the poison flag may follow the wrong page. A page
without error could have poison set, and a page with poison could not
have the flag set.
Check for poison before attempting to demote a hugetlb page. Also,
return -EBUSY to the caller if only poisoned pages are on the free list.
Link: https://lkml.kernel.org/r/20220307215707.50916-1-mike.kravetz@oracle.com
Fixes: 8531fc6f52 ("hugetlb: add hugetlb demote page support")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The below warning is reported when CONFIG_COMPACTION=n:
mm/compaction.c:56:27: warning: 'HPAGE_FRAG_CHECK_INTERVAL_MSEC' defined but not used [-Wunused-const-variable=]
56 | static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix it by moving 'HPAGE_FRAG_CHECK_INTERVAL_MSEC' under
CONFIG_COMPACTION defconfig.
Also since this is just a 'static const int' type, use #define for it.
Link: https://lkml.kernel.org/r/1647608518-20924-1-git-send-email-quic_charante@quicinc.com
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Two processes under CLONE_VM cloning, user process can be corrupted by
seeing zeroed page unexpectedly.
CPU A CPU B
do_swap_page do_swap_page
SWP_SYNCHRONOUS_IO path SWP_SYNCHRONOUS_IO path
swap_readpage valid data
swap_slot_free_notify
delete zram entry
swap_readpage zeroed(invalid) data
pte_lock
map the *zero data* to userspace
pte_unlock
pte_lock
if (!pte_same)
goto out_nomap;
pte_unlock
return and next refault will
read zeroed data
The swap_slot_free_notify is bogus for CLONE_VM case since it doesn't
increase the refcount of swap slot at copy_mm so it couldn't catch up
whether it's safe or not to discard data from backing device. In the
case, only the lock it could rely on to synchronize swap slot freeing is
page table lock. Thus, this patch gets rid of the swap_slot_free_notify
function. With this patch, CPU A will see correct data.
CPU A CPU B
do_swap_page do_swap_page
SWP_SYNCHRONOUS_IO path SWP_SYNCHRONOUS_IO path
swap_readpage original data
pte_lock
map the original data
swap_free
swap_range_free
bd_disk->fops->swap_slot_free_notify
swap_readpage read zeroed data
pte_unlock
pte_lock
if (!pte_same)
goto out_nomap;
pte_unlock
return
on next refault will see mapped data by CPU B
The concern of the patch would increase memory consumption since it
could keep wasted memory with compressed form in zram as well as
uncompressed form in address space. However, most of cases of zram uses
no readahead and do_swap_page is followed by swap_free so it will free
the compressed form from in zram quickly.
Link: https://lkml.kernel.org/r/YjTVVxIAsnKAXjTd@google.com
Fixes: 0bcac06f27 ("mm, swap: skip swapcache for swapin of synchronous device")
Reported-by: Ivan Babrou <ivan@cloudflare.com>
Tested-by: Ivan Babrou <ivan@cloudflare.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 6aa303defb ("mm, vmscan: only allocate and reclaim from
zones with pages managed by the buddy allocator") only zones with free
memory are included in a built zonelist. This is problematic when e.g.
all memory of a zone has been ballooned out when zonelists are being
rebuilt.
The decision whether to rebuild the zonelists when onlining new memory
is done based on populated_zone() returning 0 for the zone the memory
will be added to. The new zone is added to the zonelists only, if it
has free memory pages (managed_zone() returns a non-zero value) after
the memory has been onlined. This implies, that onlining memory will
always free the added pages to the allocator immediately, but this is
not true in all cases: when e.g. running as a Xen guest the onlined new
memory will be added only to the ballooned memory list, it will be freed
only when the guest is being ballooned up afterwards.
Another problem with using managed_zone() for the decision whether a
zone is being added to the zonelists is, that a zone with all memory
used will in fact be removed from all zonelists in case the zonelists
happen to be rebuilt.
Use populated_zone() when building a zonelist as it has been done before
that commit.
There was a report that QubesOS (based on Xen) is hitting this problem.
Xen has switched to use the zone device functionality in kernel 5.9 and
QubesOS wants to use memory hotplugging for guests in order to be able
to start a guest with minimal memory and expand it as needed. This was
the report leading to the patch.
Link: https://lkml.kernel.org/r/20220407120637.9035-1-jgross@suse.com
Fixes: 6aa303defb ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Calling kmem_obj_info() via kmem_dump_obj() on KFENCE objects has been
producing garbage data due to the object not actually being maintained
by SLAB or SLUB.
Fix this by implementing __kfence_obj_info() that copies relevant
information to struct kmem_obj_info when the object was allocated by
KFENCE; this is called by a common kmem_obj_info(), which also calls the
slab/slub/slob specific variant now called __kmem_obj_info().
For completeness, kmem_dump_obj() now displays if the object was
allocated by KFENCE.
Link: https://lore.kernel.org/all/20220323090520.GG16885@xsang-OptiPlex-9020/
Link: https://lkml.kernel.org/r/20220406131558.3558585-1-elver@google.com
Fixes: b89fb5ef0c ("mm, kfence: insert KFENCE hooks for SLUB")
Fixes: d3fb45f370 ("mm, kfence: insert KFENCE hooks for SLAB")
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz> [slab]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kasan enables hw tags via kasan_enable_tagging() which based on the mode
passed via kernel command line selects the correct hw backend.
kasan_enable_tagging() is meant to be invoked indirectly via the cpu
features framework of the architectures that support these backends.
Currently the invocation of this function is guarded by
CONFIG_KASAN_KUNIT_TEST which allows the enablement of the correct backend
only when KUNIT tests are enabled in the kernel.
This inconsistency was introduced in commit:
ed6d74446c ("kasan: test: support async (again) and asymm modes for HW_TAGS")
... and prevents to enable MTE on arm64 when KUNIT tests for kasan hw_tags are
disabled.
Fix the issue making sure that the CONFIG_KASAN_KUNIT_TEST guard does not
prevent the correct invocation of kasan_enable_tagging().
Link: https://lkml.kernel.org/r/20220408124323.10028-1-vincenzo.frascino@arm.com
Fixes: ed6d74446c ("kasan: test: support async (again) and asymm modes for HW_TAGS")
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On PREEMPT_RT kernel and KASAN is enabled. the kasan_record_aux_stack()
may call alloc_pages(), and the rt-spinlock will be acquired, if currently
in atomic context, will trigger warning:
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 239, name: bootlogd
Preemption disabled at:
[<ffffffffbab1a531>] rt_mutex_slowunlock+0xa1/0x4e0
CPU: 3 PID: 239 Comm: bootlogd Tainted: G W 5.17.1-rt17-yocto-preempt-rt+ #105
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
Call Trace:
__might_resched.cold+0x13b/0x173
rt_spin_lock+0x5b/0xf0
get_page_from_freelist+0x20c/0x1610
__alloc_pages+0x25e/0x5e0
__stack_depot_save+0x3c0/0x4a0
kasan_save_stack+0x3a/0x50
__kasan_record_aux_stack+0xb6/0xc0
kasan_record_aux_stack+0xe/0x10
irq_work_queue_on+0x6a/0x1c0
pull_rt_task+0x631/0x6b0
do_balance_callbacks+0x56/0x80
__balance_callbacks+0x63/0x90
rt_mutex_setprio+0x349/0x880
rt_mutex_slowunlock+0x22a/0x4e0
rt_spin_unlock+0x49/0x80
uart_write+0x186/0x2b0
do_output_char+0x2e9/0x3a0
n_tty_write+0x306/0x800
file_tty_write.isra.0+0x2af/0x450
tty_write+0x22/0x30
new_sync_write+0x27c/0x3a0
vfs_write+0x3f7/0x5d0
ksys_write+0xd9/0x180
__x64_sys_write+0x43/0x50
do_syscall_64+0x44/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fix it by using kasan_record_aux_stack_noalloc() to avoid the call to
alloc_pages().
Link: https://lkml.kernel.org/r/20220402142555.2699582-1-qiang1.zhang@intel.com
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When one tries to grow an existing memfd_secret with ftruncate, one gets
a panic [1]. For example, doing the following reliably induces the
panic:
fd = memfd_secret();
ftruncate(fd, 10);
ptr = mmap(NULL, 10, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
strcpy(ptr, "123456789");
munmap(ptr, 10);
ftruncate(fd, 20);
The basic reason for this is, when we grow with ftruncate, we call down
into simple_setattr, and then truncate_inode_pages_range, and eventually
we try to zero part of the memory. The normal truncation code does this
via the direct map (i.e., it calls page_address() and hands that to
memset()).
For memfd_secret though, we specifically don't map our pages via the
direct map (i.e. we call set_direct_map_invalid_noflush() on every
fault). So the address returned by page_address() isn't useful, and
when we try to memset() with it we panic.
This patch avoids the panic by implementing a custom setattr for
memfd_secret, which detects resizes specifically (setting the size for
the first time works just fine, since there are no existing pages to try
to zero), and rejects them with EINVAL.
One could argue growing should be supported, but I think that will
require a significantly more lengthy change. So, I propose a minimal
fix for the benefit of stable kernels, and then perhaps to extend
memfd_secret to support growing in a separate patch.
[1]:
BUG: unable to handle page fault for address: ffffa0a889277028
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD afa01067 P4D afa01067 PUD 83f909067 PMD 83f8bf067 PTE 800ffffef6d88060
Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 0 PID: 281 Comm: repro Not tainted 5.17.0-dbg-DEV #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:memset_erms+0x9/0x10
Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
RSP: 0018:ffffb932c09afbf0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffda63c4249dc0 RCX: 0000000000000fd8
RDX: 0000000000000fd8 RSI: 0000000000000000 RDI: ffffa0a889277028
RBP: ffffb932c09afc00 R08: 0000000000001000 R09: ffffa0a889277028
R10: 0000000000020023 R11: 0000000000000000 R12: ffffda63c4249dc0
R13: ffffa0a890d70d98 R14: 0000000000000028 R15: 0000000000000fd8
FS: 00007f7294899580(0000) GS:ffffa0af9bc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffa0a889277028 CR3: 0000000107ef6006 CR4: 0000000000370ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? zero_user_segments+0x82/0x190
truncate_inode_partial_folio+0xd4/0x2a0
truncate_inode_pages_range+0x380/0x830
truncate_setsize+0x63/0x80
simple_setattr+0x37/0x60
notify_change+0x3d8/0x4d0
do_sys_ftruncate+0x162/0x1d0
__x64_sys_ftruncate+0x1c/0x20
do_syscall_64+0x44/0xa0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Modules linked in: xhci_pci xhci_hcd virtio_net net_failover failover virtio_blk virtio_balloon uhci_hcd ohci_pci ohci_hcd evdev ehci_pci ehci_hcd 9pnet_virtio 9p netfs 9pnet
CR2: ffffa0a889277028
[lkp@intel.com: secretmem_iops can be static]
Signed-off-by: kernel test robot <lkp@intel.com>
[axelrasmussen@google.com: return EINVAL]
Link: https://lkml.kernel.org/r/20220324210909.1843814-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20220412193023.279320-1-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chuck Lever reported fsx-based xfstests generic 075 091 112 127 failing
when 5.18-rc1 NFS server exports tmpfs: bisected to recent tmpfs change.
Whilst nfsd_splice_action() does contain some questionable handling of
repeated pages, and Chuck was able to work around there, history from
Mark Hemment makes clear that there might be similar dangers elsewhere:
it was not a good idea for me to pass ZERO_PAGE down to unknown actors.
Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
instead of copy_page_to_iter().
We would use iov_iter_zero() throughout, but the x86 clear_user() is not
nearly so well optimized as copy to user (dd of 1T sparse tmpfs file
takes 57 seconds rather than 44 seconds).
And now pagecache_init() does not need to SetPageUptodate(ZERO_PAGE(0)):
which had caused boot failure on arm noMMU STM32F7 and STM32H7 boards
Link: https://lkml.kernel.org/r/9a978571-8648-e830-5735-1f4748ce2e30@google.com
Fixes: 56a8c8eb1e ("tmpfs: do not allocate pages on read")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Patrice CHOTARD <patrice.chotard@foss.st.com>
Reported-by: Chuck Lever III <chuck.lever@oracle.com>
Tested-by: Chuck Lever III <chuck.lever@oracle.com>
Cc: Mark Hemment <markhemm@googlemail.com>
Cc: Patrice CHOTARD <patrice.chotard@foss.st.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert the broadcom internal list M: and L: entries to R: as exploder
email addresses are neither maintainers nor mailing lists.
Reorder the entries as necessary.
Link: https://lkml.kernel.org/r/04eb301f5b3adbefdd78e76657eff0acb3e3d87f.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix:
drivers/i2c/busses/i2c-ismt.c: In function ‘ismt_hw_init’:
drivers/i2c/busses/i2c-ismt.c:770:2: error: case label does not reduce to an integer constant
case ISMT_SPGT_SPD_400K:
^~~~
drivers/i2c/busses/i2c-ismt.c:773:2: error: case label does not reduce to an integer constant
case ISMT_SPGT_SPD_1M:
^~~~
See https://lore.kernel.org/r/YkwQ6%2BtIH8GQpuct@zn.tnic for the gory
details as to why it triggers with older gccs only.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Seth Heasley <seth.heasley@intel.com>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
Sparse has warned us about wrong address space for user pointers:
i2c-dev.c:561:50: warning: incorrect type in initializer (different address spaces)
i2c-dev.c:561:50: expected unsigned char [usertype] *buf
i2c-dev.c:561:50: got void [noderef] __user *
Force cast the pointer to (__u8 *) that is used by I²C core code.
Note, this is an additional fix to the previously addressed similar issue
in the I2C_RDWR case in the same function.
Fixes: 3265a7e6b4 ("i2c: dev: Add __user annotation")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
If dev_set_name() fails, the dev_name() is null, check the return
value of dev_set_name() to avoid the null-ptr-deref.
Fixes: 1413ef638a ("i2c: dev: Fix the race between the release of i2c_dev and cdev")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
The GPI DMA engine driver can be compiled as a module, in which case the
likely probe deferral "error" shows up in the kernel log. Switch to
using dev_err_probe() to silence this warning and to ensure that
"devices_deferred" in debugfs carries this information.
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Reviewed-by: Vinod Koul <vkoul@kernel.org>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
The i.MX8MP Mask Set Errata for Mask 1P33A, Rev. 2.0 has description of
errata ERR007805 as below. This errata is found on all MX8M{M,N,P,Q},
MX7{S,D}, MX6{UL{,L,Z},S{,LL,X},S,D,DL,Q,DP,QP} . MX7ULP, MX8Q, MX8X
are not affected. MX53 and older status is unknown, as the errata
first appears in MX6 errata sheets from 2016 and the latest errata
sheet for MX53 is from 2015. Older SoC errata sheets predate the
MX53 errata sheet. MX8ULP and MX9 status is unknown as the errata
sheet is not available yet.
"
ERR007805 I2C: When the I2C clock speed is configured for 400 kHz,
the SCL low period violates the I2C spec of 1.3 uS min
Description: When the I2C module is programmed to operate at the
maximum clock speed of 400 kHz (as defined by the I2C spec), the SCL
clock low period violates the I2C spec of 1.3 uS min. The user must
reduce the clock speed to obtain the SCL low time to meet the 1.3us
I2C minimum required. This behavior means the SoC is not compliant
to the I2C spec at 400kHz.
Workaround: To meet the clock low period requirement in fast speed
mode, SCL must be configured to 384KHz or less.
"
Implement the workaround by matching on the affected SoC specific
compatible strings and by limiting the maximum bus frequency in case
the SoC is affected.
Signed-off-by: Marek Vasut <marex@denx.de>
To: linux-i2c@vger.kernel.org
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
Wait for completion of write transfers before returning from the driver.
At first sight it may seem advantageous to leave write transfers queued
for the controller to carry out on its own time, but there's a couple of
issues with it:
* Driver doesn't check for FIFO space.
* The queued writes can complete while the driver is in its I2C read
transfer path which means it will get confused by the raising of
XEN (the 'transaction ended' signal). This can cause a spurious
ENODATA error due to premature reading of the MRXFIFO register.
Adding the wait fixes some unreliability issues with the driver. There's
some efficiency cost to it (especially with pasemi_smb_waitready doing
its polling), but that will be alleviated once the driver receives
interrupt support.
Fixes: beb58aa39e ("i2c: PA Semi SMBus driver")
Signed-off-by: Martin Povišer <povik+lin@cutebit.org>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
The commit 92986f6b4c ("dm: use bio_clone_fast in alloc_io/alloc_tio")
removed bio_clone_fast() call from alloc_tio() when ci->io->tio is
available. In this case, ci->bio is not copied to ci->io->tio.clone.
This is fine since init_clone_info() sets same values to ci->bio and
ci->io->tio.clone.
However, when incoming bios have REQ_PREFLUSH flag, __send_empty_flush()
prepares a zero length bio on stack and set it to ci->bio. At this time,
ci->io->tio.clone still keeps non-zero length. When alloc_tio() chooses
this ci->io->tio.clone as the bio to map, it is passed to targets as
non-empty flush bio. It causes bio length check failure in dm-zoned and
unexpected operation such as dm_accept_partial_bio() call.
To avoid the non-empty flush bio, set zero length to ci->io->tio.clone
in __send_empty_flush().
Fixes: 92986f6b4c ("dm: use bio_clone_fast in alloc_io/alloc_tio")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>