mirror of
https://github.com/torvalds/linux.git
synced 2024-11-22 20:22:09 +00:00
fbe76a6557
45230 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Pasha Tatashin
|
fbe76a6557 |
task_stack: uninline stack_not_used
Given that stack_not_used() is not performance critical function uninline it. Link: https://lkml.kernel.org/r/20240730150158.832783-4-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240724203322.2765486-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Pasha Tatashin
|
c4a6fce856 |
vmstat: kernel stack usage histogram
As part of the dynamic kernel stack project, we need to know the amount of data that can be saved by reducing the default kernel stack size [1]. Provide a kernel stack usage histogram to aid in optimizing kernel stack sizes and minimizing memory waste in large-scale environments. The histogram divides stack usage into power-of-two buckets and reports the results in /proc/vmstat. This information is especially valuable in environments with millions of machines, where even small optimizations can have a significant impact. The histogram data is presented in /proc/vmstat with entries like "kstack_1k", "kstack_2k", and so on, indicating the number of threads that exited with stack usage falling within each respective bucket. Example outputs: Intel: $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 ARM with 64K page_size: $ grep kstack /proc/vmstat kstack_1k 1 kstack_2k 340 kstack_4k 25212 kstack_8k 1659 kstack_16k 0 kstack_32k 0 kstack_64k 0 Note: once the dynamic kernel stack is implemented it will depend on the implementation the usability of this feature: On hardware that supports faults on kernel stacks, we will have other metrics that show the total number of pages allocated for stacks. On hardware where faults are not supported, we will most likely have some optimization where only some threads are extended, and for those, these metrics will still be very useful. [1] https://lwn.net/Articles/974367 Link: https://lkml.kernel.org/r/20240730150158.832783-3-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240724203322.2765486-3-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zi Yan
|
2a28713a67 |
memory tiering: introduce folio_use_access_time() check
If memory tiering mode is on and a folio is not in the top tier memory, folio's cpupid field is repurposed to store page access time. Instead of an open coded check, use a function to encapsulate the check. Link: https://lkml.kernel.org/r/20240724130115.793641-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Danilo Krummrich
|
590b9d576c |
mm: kvmalloc: align kvrealloc() with krealloc()
Besides the obvious (and desired) difference between krealloc() and kvrealloc(), there is some inconsistency in their function signatures and behavior: - krealloc() frees the memory when the requested size is zero, whereas kvrealloc() simply returns a pointer to the existing allocation. - krealloc() behaves like kmalloc() if a NULL pointer is passed, whereas kvrealloc() does not accept a NULL pointer at all and, if passed, would fault instead. - krealloc() is self-contained, whereas kvrealloc() relies on the caller to provide the size of the previous allocation. Inconsistent behavior throughout allocation APIs is error prone, hence make kvrealloc() behave like krealloc(), which seems superior in all mentioned aspects. Besides that, implementing kvrealloc() by making use of krealloc() and vrealloc() provides oppertunities to grow (and shrink) allocations more efficiently. For instance, vrealloc() can be optimized to allocate and map additional pages to grow the allocation or unmap and free unused pages to shrink the allocation. [dakr@kernel.org: document concurrency restrictions] Link: https://lkml.kernel.org/r/20240725125442.4957-1-dakr@kernel.org [dakr@kernel.org: disable KASAN when switching to vmalloc] Link: https://lkml.kernel.org/r/20240730185049.6244-2-dakr@kernel.org [dakr@kernel.org: properly document __GFP_ZERO behavior] Link: https://lkml.kernel.org/r/20240730185049.6244-5-dakr@kernel.org Link: https://lkml.kernel.org/r/20240722163111.4766-3-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Christian König <christian.koenig@amd.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Linus Torvalds
|
3e9bff3bbe |
vfs-6.11-rc6.fixes
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZsxg5QAKCRCRxhvAZXjc olSiAQDvFvim4YtMmUDagC3yWTBsf+o3lYdAIuzNE0NtSn4vpAEAl/HVhQCaEDjv mcE3jokEsbvyXLnzs78PrY0Heua2mQg= =AHAd -----END PGP SIGNATURE----- Merge tag 'vfs-6.11-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "VFS: - Ensure that backing files uses file->f_ops->splice_write() for splice netfs: - Revert the removal of PG_private_2 from netfs_release_folio() as cephfs still relies on this - When AS_RELEASE_ALWAYS is set on a mapping the folio needs to always be invalidated during truncation - Fix losing untruncated data in a folio by making letting netfs_release_folio() return false if the folio is dirty - Fix trimming of streaming-write folios in netfs_inval_folio() - Reset iterator before retrying a short read - Fix interaction of streaming writes with zero-point tracker afs: - During truncation afs currently calls truncate_setsize() which sets i_size, expands the pagecache and truncates it. The first two operations aren't needed because they will have already been done. So call truncate_pagecache() instead and skip the redundant parts overlayfs: - Fix checking of the number of allowed lower layers so 500 layers can actually be used instead of just 499 - Add missing '\n' to pr_err() output - Pass string to ovl_parse_layer() and thus allow it to be used for Opt_lowerdir as well pidfd: - Revert blocking the creation of pidfds for kthread as apparently userspace relies on this. Specifically, it breaks systemd during shutdown romfs: - Fix romfs_read_folio() to use the correct offset with folio_zero_tail()" * tag 'vfs-6.11-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix interaction of streaming writes with zero-point tracker netfs: Fix missing iterator reset on retry of short read netfs: Fix trimming of streaming-write folios in netfs_inval_folio() netfs: Fix netfs_release_folio() to say no if folio dirty afs: Fix post-setattr file edit to do truncation correctly mm: Fix missing folio invalidation calls during truncation ovl: ovl_parse_param_lowerdir: Add missed '\n' for pr_err ovl: fix wrong lowerdir number check for parameter Opt_lowerdir ovl: pass string to ovl_parse_layer() backing-file: convert to using fops->splice_write Revert "pidfd: prevent creation of pidfds for kthreads" romfs: fix romfs_read_folio() netfs, ceph: Partially revert "netfs: Replace PG_fscache by setting folio->private and marking dirty" |
||
Linus Torvalds
|
d2bafcf224 |
cgroup: Fixes for v6.11-rc4
Three patches addressing cpuset corner cases. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZskvjQ4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGTRnAQCVJj+pPLO76ofJC51p4TcITsDD37trYHPyxaCB zZ7XdAEA82NhGgy+kdlICrsiBYKK10jGDNGkXWicdCI8GmEe1Qo= =axit -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Three patches addressing cpuset corner cases" * tag 'cgroup-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplug cgroup/cpuset: Clear effective_xcpus on cpus_allowed clearing only if cpus.exclusive not set cgroup/cpuset: fix panic caused by partcmd_update |
||
Linus Torvalds
|
cb2c84b380 |
workqueue: Fixes for v6.11-rc4
Nothing too interesting. One patch to remove spurious warning and others to address static checker warnings. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZskq8g4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGfTVAP42MsAOyrlND+cH/zQpSc8OhGbm3v0gJFnPn4UE Y3B4kgD/W68n57MQ5uWh1vHHvsqjizbXfRez1dVJoGqa/q88GQs= =Uwdx -----END PGP SIGNATURE----- Merge tag 'wq-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Nothing too interesting. One patch to remove spurious warning and others to address static checker warnings" * tag 'wq-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Correct declaration of cpu_pwq in struct workqueue_struct workqueue: Fix spruious data race in __flush_work() workqueue: Remove incorrect "WARN_ON_ONCE(!list_empty(&worker->entry));" from dying worker workqueue: Fix UBSAN 'subtraction overflow' error in shift_and_mask() workqueue: doc: Fix function name, remove markers |
||
Christian Brauner
|
232590ea7f
|
Revert "pidfd: prevent creation of pidfds for kthreads"
This reverts commit
|
||
Linus Torvalds
|
b0da640826 |
printk fixup for 6.11-rc5
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmbDBfYACgkQUqAMR0iA lPL3ohAArEJ46nPdGWXEZ+K78biXlz/F3IXT+FH95YgtpIk0Tha6Jc5xybGerf/N 91GzWGbFweEFIIHq9i/CeBnmUEYsMocDF2hlmPiCvaqvMl1J6EuXgERUaPWqaQTS fPZab7x8MitH64hFGWbMbvt8ZDJXyQaixtkQyA0AoRPMTpiQy0mFWbFIhtN9M+Cx dov2l4N9je8X46X7SWDdKNvVEXHPnpWpq5NeMr9FW7yM4Kun3Hdb3Ks58sHS2oLm EmPFQ6kNuxpHyXNvfjeE/JdXQZvK2gGOCNS4zykpGVYJJvhmfrNSwR7iGhm0z/Zw sFObF46fK2NTkD5UZ9jQK8+uTiOwpiZSka8v55LocLa7gg2e1G7owaRSIMKjeNYT GVmcdkgLqdtfKo3D3rM+auWXlP9o+ioqM52HCewWzMXd0HC2nLx28X/66oHbif9U qJSjDPTtvlVEfIcbLr0bRX9KrYeqwtXD74zxB+msbi3Z2C/O9CrFfnGaI0h6+8cb RwAptjiO8QdbKkL06CW5RjM5ulNqtPmRETziwA01gh5h6AE5oR1PHCf0DM12ulYK /gY/rMznZ6qK0G+BYQyRhMgZh5P5KPvL77a7kxknuj4va2s6c2EsnG8u5iYcYAdo YHWN6Jad1OPfQyHsqQ7IL+zlQzTPKmuy3PHQcZwBezUPWRY96kI= =2wc2 -----END PGP SIGNATURE----- Merge tag 'printk-for-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: - Do not block printk on non-panic CPUs when they are dumping backtraces * tag 'printk-for-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk/panic: Allow cpu backtraces to be written into ringbuffer during panic |
||
Linus Torvalds
|
c3f2d783a4 |
16 hotfixes. All except one are for MM. 10 of these are cc:stable and
the others pertain to post-6.10 issues. As usual with these merges, singletons and doubletons all over the place, no identifiable-by-me theme. Please see the lovingly curated changelogs to get the skinny. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZsFf8wAKCRDdBJ7gKXxA jvEUAP97y/sqKD8rQNc0R8fRGSPNPamwyok8RHwohb0JEHovlAD9HsQ9Ad57EpqR wBexMxJRFc7Dt73Tu6IkLQ1iNGqABAc= =8KNp -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-08-17-19-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "16 hotfixes. All except one are for MM. 10 of these are cc:stable and the others pertain to post-6.10 issues. As usual with these merges, singletons and doubletons all over the place, no identifiable-by-me theme. Please see the lovingly curated changelogs to get the skinny" * tag 'mm-hotfixes-stable-2024-08-17-19-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/migrate: fix deadlock in migrate_pages_batch() on large folios alloc_tag: mark pages reserved during CMA activation as not tagged alloc_tag: introduce clear_page_tag_ref() helper function crash: fix riscv64 crash memory reserve dead loop selftests: memfd_secret: don't build memfd_secret test on unsupported arches mm: fix endless reclaim on machines with unaccepted memory selftests/mm: compaction_test: fix off by one in check_compaction() mm/numa: no task_numa_fault() call if PMD is changed mm/numa: no task_numa_fault() call if PTE is changed mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order fallback to order 0 mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu mm: don't account memmap per-node mm: add system wide stats items category mm: don't account memmap on failure mm/hugetlb: fix hugetlb vs. core-mm PT locking mseal: fix is_madv_discard() |
||
Linus Torvalds
|
810996a363 |
powerpc fixes for 6.11 #2
- Fix crashes on 85xx with some configs since the recent hugepd rework. - Fix boot warning with hugepages and CONFIG_DEBUG_VIRTUAL on some platforms. - Don't enable offline cores when changing SMT modes, to match existing userspace behaviour. Thanks to: Christophe Leroy, Dr. David Alan Gilbert, Guenter Roeck, Nysal Jan K.A, Shrikanth Hegde, Thomas Gleixner, Tyrel Datwyler. -----BEGIN PGP SIGNATURE----- iQJLBAABCAA1FiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmbBN48XHG1pY2hhZWxA ZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYDFhA/7ByodEuDtTZRAhQxJbzTlEMMk OdEURo5MqJZo2P9A3G1KKQKUUy1cQwKLcOaCa7nSh3IXHswXEGZK/Do1lgUj8BAx BcaTlm6aAgMnxkEXIGMNBCGn54IxA7pQV7TUUdr+3CJU0udtYceej03beWZuQVvN DxdoHflNojU+h8AUWEm5KW6X/o8C+DI6rMAP5zW8Xvsbz/QmSSn1frAs+Dgnacyh niAToWbW4ibw0LJ8NBDIxIgqDXZHGUY9/KMSAn1WgpERcbY8FUD3PWw2FzJxjqKw h/sjDRpFhY7mImZtzTKez2OHMPiq+730OVEmgfoER/smknnIYi/tO4e2r+wA9YS7 IIpyl42sdTPV6ke1DDT5sUlWq4LjPLobB+2WKwgDkSOnTRjF1/9nf4AVdtwh2cuS Y/Sttz3YjtfeSPG3sWnn5HkMbBksMoSSO+Q9BqB2BQAIHWHPDZWwadGhSw1omV7/ poYoR3KbmomLL39qk49P0thmhhCDhF64j7XN4ESFUK7tFL1BHCZ2vXSI5vIi0CHZ z65pJxsid/0oz04abINAsrDOyZTIkPBTDawda4UEHfXpUOOM9iFPfQfcFnJYRCPk xiOYAhRj10l7eQeSXOcaP1TXraW+DCs4N5neCaZ0zI/4vwTcrFMn37bB7DVYLjkB 08vDj12ybMrz51mjCj4= =sZ+f -----END PGP SIGNATURE----- Merge tag 'powerpc-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix crashes on 85xx with some configs since the recent hugepd rework. - Fix boot warning with hugepages and CONFIG_DEBUG_VIRTUAL on some platforms. - Don't enable offline cores when changing SMT modes, to match existing userspace behaviour. Thanks to Christophe Leroy, Dr. David Alan Gilbert, Guenter Roeck, Nysal Jan K.A, Shrikanth Hegde, Thomas Gleixner, and Tyrel Datwyler. * tag 'powerpc-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/topology: Check if a core is online cpu/SMT: Enable SMT only if a core is online powerpc/mm: Fix boot warning with hugepages and CONFIG_DEBUG_VIRTUAL powerpc/mm: Fix size of allocated PGDIR soc: fsl: qbman: remove unused struct 'cgr_comp' |
||
Linus Torvalds
|
4a621e2910 |
A couple of fixes for tracing:
- Prevent a NULL pointer dereference in the error path of RTLA tool - Fix an infinite loop bug when reading from the ring buffer when closed. If there's a thread trying to read the ring buffer and it gets closed by another thread, the one reading will go into an infinite loop when the buffer is empty instead of exiting back to user space. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZr9fuRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qqV8AQCoAmS7Mov+BLtL1am5HcGvqv60E9IL 1BlGQAsRYeLmMgD/UjUOXx3PfrQaKt7O479NT7NxOm6vPFA5e7W611M4KQw= =QGI+ -----END PGP SIGNATURE----- Merge tag 'trace-v6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "A couple of fixes for tracing: - Prevent a NULL pointer dereference in the error path of RTLA tool - Fix an infinite loop bug when reading from the ring buffer when closed. If there's a thread trying to read the ring buffer and it gets closed by another thread, the one reading will go into an infinite loop when the buffer is empty instead of exiting back to user space" * tag 'trace-v6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rtla/osnoise: Prevent NULL dereference in error handling tracing: Return from tracing_buffers_read() if the file has been closed |
||
Jinjie Ruan
|
edb907a613 |
crash: fix riscv64 crash memory reserve dead loop
On RISCV64 Qemu machine with 512MB memory, cmdline "crashkernel=500M,high" will cause system stall as below: Zone ranges: DMA32 [mem 0x0000000080000000-0x000000009fffffff] Normal empty Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000080000000-0x000000008005ffff] node 0: [mem 0x0000000080060000-0x000000009fffffff] Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff] (stall here) commit 5d99cadf1568 ("crash: fix x86_32 crash memory reserve dead loop bug") fix this on 32-bit architecture. However, the problem is not completely solved. If `CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX` on 64-bit architecture, for example, when system memory is equal to CRASH_ADDR_LOW_MAX on RISCV64, the following infinite loop will also occur: -> reserve_crashkernel_generic() and high is true -> alloc at [CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX] fail -> alloc at [0, CRASH_ADDR_LOW_MAX] fail and repeatedly (because CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX). As Catalin suggested, do not remove the ",high" reservation fallback to ",low" logic which will change arm64's kdump behavior, but fix it by skipping the above situation similar to commit d2f32f23190b ("crash: fix x86_32 crash memory reserve dead loop"). After this patch, it print: cannot allocate crashkernel (size:0x1f400000) Link: https://lkml.kernel.org/r/20240812062017.2674441-1-ruanjinjie@huawei.com Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Dave Young <dyoung@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Linus Torvalds
|
e724918b37 |
hardening fixes for v6.11-rc4
- gcc-plugins: randstruct: Remove GCC 4.7 or newer requirement (Thorsten Blum) - kallsyms: Clean up interaction with LTO suffixes (Song Liu) - refcount: Report UAF for refcount_sub_and_test(0) when counter==0 (Petr Pavlu) - kunit/overflow: Avoid misallocation of driver name (Ivan Orlov) -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCZr5D6wAKCRA2KwveOeQk u5dXAQC9ddd3iHqDAWfbCLY41/5K3KByFspVqf8hw2sFK3Uq9wD/eWU0hWFIk1gq 1hUSb7vExo+oiahYPKIUMx5Zf69hHAk= =dmVd -----END PGP SIGNATURE----- Merge tag 'hardening-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - gcc-plugins: randstruct: Remove GCC 4.7 or newer requirement (Thorsten Blum) - kallsyms: Clean up interaction with LTO suffixes (Song Liu) - refcount: Report UAF for refcount_sub_and_test(0) when counter==0 (Petr Pavlu) - kunit/overflow: Avoid misallocation of driver name (Ivan Orlov) * tag 'hardening-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: kallsyms: Match symbols exactly with CONFIG_LTO_CLANG kallsyms: Do not cleanup .llvm.<hash> suffix before sorting symbols kunit/overflow: Fix UB in overflow_allocation_test gcc-plugins: randstruct: Remove GCC 4.7 or newer requirement refcount: Report UAF for refcount_sub_and_test(0) when counter==0 |
||
Song Liu
|
fb6a421fb6 |
kallsyms: Match symbols exactly with CONFIG_LTO_CLANG
With CONFIG_LTO_CLANG=y, the compiler may add .llvm.<hash> suffix to
function names to avoid duplication. APIs like kallsyms_lookup_name()
and kallsyms_on_each_match_symbol() tries to match these symbol names
without the .llvm.<hash> suffix, e.g., match "c_stop" with symbol
c_stop.llvm.17132674095431275852. This turned out to be problematic
for use cases that require exact match, for example, livepatch.
Fix this by making the APIs to match symbols exactly.
Also cleanup kallsyms_selftests accordingly.
Signed-off-by: Song Liu <song@kernel.org>
Fixes:
|
||
Linus Torvalds
|
4ac0f08f44 |
vfs-6.11-rc4.fixes
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZrym4AAKCRCRxhvAZXjc oqT3AP9ydoUNavaZcRayH8r3ybvz9+aJGJ6Q7NznFVCk71vn0gD/buLzmq96Muns M5DWHbft2AFwK0Rz2nx8j5OXUeHwrQg= =HZBL -----END PGP SIGNATURE----- Merge tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "VFS: - Fix the name of file lease slab cache. When file leases were split out of file locks the name of the file lock slab cache was used for the file leases slab cache as well. - Fix a type in take_fd() helper. - Fix infinite directory iteration for stable offsets in tmpfs. - When the icache is pruned all reclaimable inodes are marked with I_FREEING and other processes that try to lookup such inodes will block. But some filesystems like ext4 can trigger lookups in their inode evict callback causing deadlocks. Ext4 does such lookups if the ea_inode feature is used whereby a separate inode may be used to store xattrs. Introduce I_LRU_ISOLATING which pins the inode while its pages are reclaimed. This avoids inode deletion during inode_lru_isolate() avoiding the deadlock and evict is made to wait until I_LRU_ISOLATING is done. netfs: - Fault in smaller chunks for non-large folio mappings for filesystems that haven't been converted to large folios yet. - Fix the CONFIG_NETFS_DEBUG config option. The config option was renamed a short while ago and that introduced two minor issues. First, it depended on CONFIG_NETFS whereas it wants to depend on CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter does. Second, the documentation for the config option wasn't fixed up. - Revert the removal of the PG_private_2 writeback flag as ceph is using it and fix how that flag is handled in netfs. - Fix DIO reads on 9p. A program watching a file on a 9p mount wouldn't see any changes in the size of the file being exported by the server if the file was changed directly in the source filesystem. Fix this by attempting to read the full size specified when a DIO read is requested. - Fix a NULL pointer dereference bug due to a data race where a cachefiles cookies was retired even though it was still in use. Check the cookie's n_accesses counter before discarding it. nsfs: - Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as the kernel is writing to userspace. pidfs: - Prevent the creation of pidfds for kthreads until we have a use-case for it and we know the semantics we want. It also confuses userspace why they can get pidfds for kthreads. squashfs: - Fix an unitialized value bug reported by KMSAN caused by a corrupted symbolic link size read from disk. Check that the symbolic link size is not larger than expected" * tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: Squashfs: sanity check symbolic link size 9p: Fix DIO read through netfs vfs: Don't evict inode under the inode lru traversing context netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag" file: fix typo in take_fd() comment pidfd: prevent creation of pidfds for kthreads netfs: clean up after renaming FSCACHE_DEBUG config libfs: fix infinite directory reads for offset dir nsfs: fix ioctl declaration fs/netfs/fscache_cookie: add missing "n_accesses" check filelock: fix name of file_lease slab cache netfs: Fault in smaller chunks for non-large folio mappings |
||
Kyle Huey
|
100bff2381 |
perf/bpf: Don't call bpf_overflow_handler() for tracing events
The regressing commit is new in 6.10. It assumed that anytime event->prog
is set bpf_overflow_handler() should be invoked to execute the attached bpf
program. This assumption is false for tracing events, and as a result the
regressing commit broke bpftrace by invoking the bpf handler with garbage
inputs on overflow.
Prior to the regression the overflow handlers formed a chain (of length 0,
1, or 2) and perf_event_set_bpf_handler() (the !tracing case) added
bpf_overflow_handler() to that chain, while perf_event_attach_bpf_prog()
(the tracing case) did not. Both set event->prog. The chain of overflow
handlers was replaced by a single overflow handler slot and a fixed call to
bpf_overflow_handler() when appropriate. This modifies the condition there
to check event->prog->type == BPF_PROG_TYPE_PERF_EVENT, restoring the
previous behavior and fixing bpftrace.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Reported-by: Joe Damato <jdamato@fastly.com>
Closes: https://lore.kernel.org/lkml/ZpFfocvyF3KHaSzF@LQ3V64L9R2/
Fixes:
|
||
Ryo Takakura
|
bcc954c6ca |
printk/panic: Allow cpu backtraces to be written into ringbuffer during panic
commit |
||
Yonghong Song
|
bed2eb964c |
bpf: Fix a kernel verifier crash in stacksafe()
Daniel Hodges reported a kernel verifier crash when playing with sched-ext.
Further investigation shows that the crash is due to invalid memory access
in stacksafe(). More specifically, it is the following code:
if (exact != NOT_EXACT &&
old->stack[spi].slot_type[i % BPF_REG_SIZE] !=
cur->stack[spi].slot_type[i % BPF_REG_SIZE])
return false;
The 'i' iterates old->allocated_stack.
If cur->allocated_stack < old->allocated_stack the out-of-bound
access will happen.
To fix the issue add 'i >= cur->allocated_stack' check such that if
the condition is true, stacksafe() should fail. Otherwise,
cur->stack[spi].slot_type[i % BPF_REG_SIZE] memory access is legal.
Fixes:
|
||
Nysal Jan K.A
|
6c17ea1f3e |
cpu/SMT: Enable SMT only if a core is online
If a core is offline then enabling SMT should not online CPUs of
this core. By enabling SMT, what is intended is either changing the SMT
value from "off" to "on" or setting the SMT level (threads per core) from a
lower to higher value.
On PowerPC the ppc64_cpu utility can be used, among other things, to
perform the following functions:
ppc64_cpu --cores-on # Get the number of online cores
ppc64_cpu --cores-on=X # Put exactly X cores online
ppc64_cpu --offline-cores=X[,Y,...] # Put specified cores offline
ppc64_cpu --smt={on|off|value} # Enable, disable or change SMT level
If the user has decided to offline certain cores, enabling SMT should
not online CPUs in those cores. This patch fixes the issue and changes
the behaviour as described, by introducing an arch specific function
topology_is_core_online(). It is currently implemented only for PowerPC.
Fixes:
|
||
Christian Brauner
|
3b5bbe798b
|
pidfd: prevent creation of pidfds for kthreads
It's currently possible to create pidfds for kthreads but it is unclear
what that is supposed to mean. Until we have use-cases for it and we
figured out what behavior we want block the creation of pidfds for
kthreads.
Link: https://lore.kernel.org/r/20240731-gleis-mehreinnahmen-6bbadd128383@brauner
Fixes:
|
||
Linus Torvalds
|
7270e931b5 |
Updates for time keeping:
- Fix a couple of issues in the NTP code where user supplied values are neither sanity checked nor clamped to the operating range. This results in integer overflows and eventualy NTP getting out of sync. According to the history the sanity checks had been removed in favor of clamping the values, but the clamping never worked correctly under all circumstances. The NTP people asked to not bring the sanity checks back as it might break existing applications. Make the clamping work correctly and add it where it's missing - If adjtimex() sets the clock it has to trigger the hrtimer subsystem so it can adjust and if the clock was set into the future expire timers if needed. The caller should provide a bitmask to tell hrtimers which clocks have been adjusted. adjtimex() uses not the proper constant and uses CLOCK_REALTIME instead, which is 0. So hrtimers adjusts only the clocks, but does not check for expired timers, which might make them expire really late. Use the proper bitmask constant instead. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAma4wQ0THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoWNmEACMeq/vMoqbbhfgmTK2+XKfUarF5AX8 61uK/rY6ysO/Qz1P+3K4j+coxhuz2t0ekjIL6htgPE0yU5JR3/VjjUpGIbBLUZfa aY9Ciy0OHFyTaoduyLKyiO/O7GyI6j8vMMhhNyQDaK5Zm+pIin18FqW6udg79HYh bDkVtCWg27M1zFd9aqRAc1EX+uFfCrSUi+1oc+E3/knDrNFUVwKCznAeDQQZii6Y pGmt733o7RRkABSf5T1bNOEVpbMlZowcf7zF3J57otz/lZFuwjRtTdmuG4ha3grs B+4FLNRZFEIEFPW0We43gAW1jLNjIL8xgZ6CMUwkUYOGQ21wmMxTOUCwg6/YMa9Y vBceijrICOa1EsyO28XqgRkfIvhdoNsp+c5rAN4LcQd5T7F0SoQCn9A71LXpPXgK ulnWjAgpt+ovD2+OFX0Ul5ySY04TgPcNVeJfnZeYxpuShlPg0GX+z0RuMl9aLbc3 y11P0PDJiguZaoUZ8lUU2W6XA+eFEA2ZOqP+L6FZwIaDwutmXSjHR//ZkTcNg4/h rIbB8SFsq3BSMo3Ry2p/KMYWoZ1fF3Tm3Qp9/wpiAx1YSTJ6x8LGkHHq5c9qP5ba qJWi0vz8dgTGd2ta/xzglvPVWwT08rvrwACHCTcJp3Jq8uvJ27mQbTvZs6p3cFE6 RkEBGDvEIfADew== =EY09 -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull time keeping fixes from Thomas Gleixner: - Fix a couple of issues in the NTP code where user supplied values are neither sanity checked nor clamped to the operating range. This results in integer overflows and eventualy NTP getting out of sync. According to the history the sanity checks had been removed in favor of clamping the values, but the clamping never worked correctly under all circumstances. The NTP people asked to not bring the sanity checks back as it might break existing applications. Make the clamping work correctly and add it where it's missing - If adjtimex() sets the clock it has to trigger the hrtimer subsystem so it can adjust and if the clock was set into the future expire timers if needed. The caller should provide a bitmask to tell hrtimers which clocks have been adjusted. adjtimex() uses not the proper constant and uses CLOCK_REALTIME instead, which is 0. So hrtimers adjusts only the clocks, but does not check for expired timers, which might make them expire really late. Use the proper bitmask constant instead. * tag 'timers-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Fix bogus clock_was_set() invocation in do_adjtimex() ntp: Safeguard against time_constant overflow ntp: Clamp maxerror and esterror to operating range |
||
Linus Torvalds
|
56fe0a6a9f |
Three small fixes for interrupt core and drivers:
- The interrupt core fails to honor caller supplied affinity hints for non-managed interrupts and uses the system default affinity on startup instead. Set the missing flag in the descriptor to tell the core to use the provided affinity. - Fix a shift out of bounds error in the Xilinx driver - Handle switching to level trigger correctly in the RISCV APLIC driver. It failed to retrigger the interrupt which causes it to become stale. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAma4vEUTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoQpoEACcirhCU0x7jfGj7KWJtnx1dko1gG9G AN86+1lZaHa63vBysAvvEPFVhrbC9JI09SLFTNYrhFTWk9lZeTr7HC9ZgvH2U/Yp YrYci/5PMBZow7rHjJUcohGM25xFppskMwtUnp1udNsPbXQvY/cFkzi/p5xwfB7b S4P10UuZTLBiHYDylphIjIQpf2ltQiXDcdxLGeeYnMVdQ4W5sJVqj39YfZmq+Au3 E2IwDuA6SyPIMuEbs+rxKHNl30QmaGhU4CmzOE6A6bgcZ9u4AbvSf1+3maeOrOQf Erq3oMPhKemWXHdeTIZiufOjJZjph2qJfMNSzEoYnOO1edA+I9y5BkirngIwUOKX iAl3Oh5f6GwcNuFeVCAW6xr1jMnRDQ3SJUk89wxfgxtZjTVUTjbbtegm97XirhSf +QXXgVX9zpPYwbAVdwsCoSYi+Ne9XPj+ylixRKBzx6+4McnAdx3OllyfRhH7Hk53 BuXGmSdy9/n+093jyNzhdyQ/5U1lL2XrUXoNh79M/duBp6RI0jpet406Ui/Q96VB mXKXG/0imRd/WCWR9KDzKjjWdNcToRcsfta7ZzeekJtFIab6e2+G65lIPALB3rNp 1vNfFEWWTjpHpzN2pmmaRQBbwRBpLUrr89wc2KENuHs7DnQu75O1u5SZLPGwEMEI VuBxXQUKGxZkTA== =hR0M -----END PGP SIGNATURE----- Merge tag 'irq-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "Three small fixes for interrupt core and drivers: - The interrupt core fails to honor caller supplied affinity hints for non-managed interrupts and uses the system default affinity on startup instead. Set the missing flag in the descriptor to tell the core to use the provided affinity. - Fix a shift out of bounds error in the Xilinx driver - Handle switching to level trigger correctly in the RISCV APLIC driver. It failed to retrigger the interrupt which causes it to become stale" * tag 'irq-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/riscv-aplic: Retrigger MSI interrupt on source configuration irqchip/xilinx: Fix shift out of bounds genirq/irqdesc: Honor caller provided affinity in alloc_desc() |
||
Linus Torvalds
|
0409cc53c4 |
dma-mapping fix for Linux 6.11
- avoid a deadlock with dma-debug and netconsole (Rik van Riel) -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAma3CHILHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYPMEg/+M/k+Z+KcFgUNBmsyiPckRHAEBwzC+zpHWcaX0uEJ xYy3F2JW/+ba2p8GzTAwb5gE1DZTmBp0GC9PFYaolRBhhoeRZnWXimO4OynxFf2b vMrqyFixBYG3GeX+pnLFsT5WB+ZoZVknF/Fvxl9RmJjUh8p4KMJw7CLflu4xqsc6 6lX+IlcV637vZOToFXm2h5todAgqoatz1VhyLekGOL5BEUuDw8QjydqpXd7XAG+T S0/rIga0fcmTjn+6+5RUzpcyaVAxy/KVzvHx731kjO/ZUswritxlSydZgtD0Tabj z8N/3ge+TGvekpffSZ6K1EmldCypuQu/WRDlwxDx5LQu4DP4vwkUURToggiQPHlQ 7YP0roOLLfc5zgjQsmpzj/DmymFw+E3bFz6DRw+9f8ftt6rB58ICCO+YXjL4W4aL wu5IrUsIwoc5W7nBkrlUQZbRTrTrvl62HbuO1pMIirZ4bntuJVYLyOTJ+n7RwYFl TukNu5WlkHnHvnMt96TGW5lVKBTDGz1aUYUju41USyLpYCZYsKYrHiEAdf0WFB8q WXprsL6JSSRZ+ukIvucHDdZlBptqaxrLtj3UeALPle05dq12ykG6KOix3FZGVAWA 0WD6SKUV7Z+Cs+WcCnW2zLNuq3NNTiSRCMSvPmSH7soxu3BLRUxPkwTTthgeGlFx DZs= =tNDn -----END PGP SIGNATURE----- Merge tag 'dma-mapping-6.11-2024-08-10' of git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fix from Christoph Hellwig: - avoid a deadlock with dma-debug and netconsole (Rik van Riel) * tag 'dma-mapping-6.11-2024-08-10' of git://git.infradead.org/users/hch/dma-mapping: dma-debug: avoid deadlock between dma debug vs printk and netconsole |
||
Steven Rostedt
|
d0949cd44a |
tracing: Return from tracing_buffers_read() if the file has been closed
When running the following:
# cd /sys/kernel/tracing/
# echo 1 > events/sched/sched_waking/enable
# echo 1 > events/sched/sched_switch/enable
# echo 0 > tracing_on
# dd if=per_cpu/cpu0/trace_pipe_raw of=/tmp/raw0.dat
The dd task would get stuck in an infinite loop in the kernel. What would
happen is the following:
When ring_buffer_read_page() returns -1 (no data) then a check is made to
see if the buffer is empty (as happens when the page is not full), it will
call wait_on_pipe() to wait until the ring buffer has data. When it is it
will try again to read data (unless O_NONBLOCK is set).
The issue happens when there's a reader and the file descriptor is closed.
The wait_on_pipe() will return when that is the case. But this loop will
continue to try again and wait_on_pipe() will again return immediately and
the loop will continue and never stop.
Simply check if the file was closed before looping and exit out if it is.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240808235730.78bf63e5@rorschach.local.home
Fixes:
|
||
Linus Torvalds
|
146430a0c2 |
Probes fixes for v6.11-rc2:
- kprobes: Fix misusing str_has_prefix() parameter order to check symbol prefix correctly. - bpf: kprobe: remove unused declaring of bpf_kprobe_override. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAma2M6UbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bGkIH/3ldvZswD2Fl+XA7tvvC 7DJbeYBiYXDo1AqcpbC9dgoQL4EBZOoyXBMMks2von/Qekrq1wU8wQQTvFEfpz9m 7RzYYy8tKZa6/RzHf+vfM8yDCMgka3C4NlFyVaohIBOXDKpIhx1cfvmXixPx1Q9S 9IEdqxWdMrA5FPZH6ks13s+yHqQoAvyN40cmDL9bVETHe4vH4oMABfBjppUzlRcz C9fLv7Aw3GTmkwX8mQYeHRG4sntUcqSjn2Ik1uvWizq2yYAIMe7RAbHXP/Wvl01h p6U8kSb/Q7nFIdF5cHJy/XMH/2wb/1MtqPzXeNrGGhHgnejDPS+0GRJzk1CDr9sc ur0= =/tVN -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull kprobe fixes from Masami Hiramatsu: - Fix misusing str_has_prefix() parameter order to check symbol prefix correctly - bpf: remove unused declaring of bpf_kprobe_override * tag 'probes-fixes-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: kprobes: Fix to check symbol prefixes correctly bpf: kprobe: remove unused declaring of bpf_kprobe_override |
||
Linus Torvalds
|
2124d84db2 |
module: make waiting for a concurrent module loader interruptible
The recursive aes-arm-bs module load situation reported by Russell King is getting fixed in the crypto layer, but this in the meantime fixes the "recursive load hangs forever" by just making the waiting for the first module load be interruptible. This should now match the old behavior before commit |
||
Linus Torvalds
|
9466b6ae6b |
tracing fixes for v6.11:
- Have reading of event format files test if the meta data still exists. When a event is freed, a flag (EVENT_FILE_FL_FREED) in the meta data is set to state that it is to prevent any new references to it from happening while waiting for existing references to close. When the last reference closes, the meta data is freed. But the "format" was missing a check to this flag (along with some other files) that allowed new references to happen, and a use-after-free bug to occur. - Have the trace event meta data use the refcount infrastructure instead of relying on its own atomic counters. - Have tracefs inodes use alloc_inode_sb() for allocation instead of using kmem_cache_alloc() directly. - Have eventfs_create_dir() return an ERR_PTR instead of NULL as the callers expect a real object or an ERR_PTR. - Have release_ei() use call_srcu() and not call_rcu() as all the protection is on SRCU and not RCU. - Fix ftrace_graph_ret_addr() to use the task passed in and not current. - Fix overflow bug in get_free_elt() where the counter can overflow the integer and cause an infinite loop. - Remove unused function ring_buffer_nr_pages() - Have tracefs freeing use the inode RCU infrastructure instead of creating its own. When the kernel had randomize structure fields enabled, the rcu field of the tracefs_inode was overlapping the rcu field of the inode structure, and corrupting it. Instead, use the destroy_inode() callback to do the initial cleanup of the code, and then have free_inode() free it. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZrTvXxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qu39AP9ze6ELpShDrxbXhf0adbNqG2IXMepa MMLqfq8tU8E/vAEAuZXJ6rKXeGvKeONa06ocvWJ0dpb2cy/n4hmx+KtM5gI= =Pkh4 -----END PGP SIGNATURE----- Merge tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Have reading of event format files test if the metadata still exists. When a event is freed, a flag (EVENT_FILE_FL_FREED) in the metadata is set to state that it is to prevent any new references to it from happening while waiting for existing references to close. When the last reference closes, the metadata is freed. But the "format" was missing a check to this flag (along with some other files) that allowed new references to happen, and a use-after-free bug to occur. - Have the trace event meta data use the refcount infrastructure instead of relying on its own atomic counters. - Have tracefs inodes use alloc_inode_sb() for allocation instead of using kmem_cache_alloc() directly. - Have eventfs_create_dir() return an ERR_PTR instead of NULL as the callers expect a real object or an ERR_PTR. - Have release_ei() use call_srcu() and not call_rcu() as all the protection is on SRCU and not RCU. - Fix ftrace_graph_ret_addr() to use the task passed in and not current. - Fix overflow bug in get_free_elt() where the counter can overflow the integer and cause an infinite loop. - Remove unused function ring_buffer_nr_pages() - Have tracefs freeing use the inode RCU infrastructure instead of creating its own. When the kernel had randomize structure fields enabled, the rcu field of the tracefs_inode was overlapping the rcu field of the inode structure, and corrupting it. Instead, use the destroy_inode() callback to do the initial cleanup of the code, and then have free_inode() free it. * tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracefs: Use generic inode RCU for synchronizing freeing ring-buffer: Remove unused function ring_buffer_nr_pages() tracing: Fix overflow in get_free_elt() function_graph: Fix the ret_stack used by ftrace_graph_ret_addr() eventfs: Use SRCU for freeing eventfs_inodes eventfs: Don't return NULL in eventfs_create_dir() tracefs: Fix inode allocation tracing: Use refcount for trace_event_file reference counter tracing: Have format file honor EVENT_FILE_FL_FREED |
||
Linus Torvalds
|
b3f5620f76 |
bcachefs fixes for 6.11-rc3
Assorted little stuff: - lockdep fixup for lockdep_set_notrack_class() - we can now remove a device when using erasure coding without deadlocking, though we still hit other issues - the "allocator stuck" timeout is now configurable, and messages are ratelimited; default timeout has been increased from 10 seconds to 30 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAma05JwACgkQE6szbY3K bnYE+Q//VJ6R/UxDoxjk8zgftRcdgwXod6U+/E0Kj3ZBKLYXkcGaWWmmGMkFafBp eL7Y3wtHSKiMsHYX9KEdFUZFLe1KI4c16RgNIXk9nwkF+3/+8pEDHKPFuoGHJH3O HComHGqwVg8Zx2jRNvEkvQ980iH7OBGhCjMFXhJ3xbMGLdw91TQQi49a+Q/vz7QT y3Cl1dgX5xBl7fqKefsYa+X6mpWi4/6t60vJvatI+bvDfznjI6jN3qGVLlQye7tC 6VbJAjHsPPyNMlWa99UaHqDdaM325zR2ES0bsfHd8Up4iAwO8OgjzYQxpYTgi51i 6DTiGEOV2S8gF+Rnprnbzsnau0hEvrtQY2Ub85TCIGbZJa8b+aDIlq9k8jF36O2E 2CUTleQ/E129RxXpkZGsVRpNmemdCi6rHAcluaFEgezX4FJH8BVOwQQq2Xz7rd7E 3ZP5iAWmX0IgOL0VOCP/ZXl/lEMwSk0VAED3jEbT7f2K7rU9nXDO2bIEx1wXDCm1 b32kvmUi2FBjqLHSqvAPEb52tvvZuliMUY7z9dEx+AX9AVC9kGE+amGexosKb/LY nWzey+D0cKHtgbkMFrCClkpg75Tnt9ISJbad53+5qhN8an/a71djdj8Zk0UQnQjv 6Amv4Ns1lDo3XGC1QtYkF5HqiWaupbUXAftptpS4Av4X1zZEQIc= =q1dD -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent Overstreet: "Assorted little stuff: - lockdep fixup for lockdep_set_notrack_class() - we can now remove a device when using erasure coding without deadlocking, though we still hit other issues - the 'allocator stuck' timeout is now configurable, and messages are ratelimited. The default timeout has been increased from 10 seconds to 30" * tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs: bcachefs: Use bch2_wait_on_allocator() in btree node alloc path bcachefs: Make allocator stuck timeout configurable, ratelimit messages bcachefs: Add missing path_traverse() to btree_iter_next_node() bcachefs: ec should not allocate from ro devs bcachefs: Improved allocator debugging for ec bcachefs: Add missing bch2_trans_begin() call bcachefs: Add a comment for bucket helper types bcachefs: Don't rely on implicit unsigned -> signed integer conversion lockdep: Fix lockdep_set_notrack_class() for CONFIG_LOCK_STAT bcachefs: Fix double free of ca->buckets_nouse |
||
Linus Torvalds
|
cb5b81bc9a |
module: warn about excessively long module waits
Russell King reported that the arm cbc(aes) crypto module hangs when loaded, and Herbert Xu bisected it to commit |
||
Linus Torvalds
|
660e4b18a7 |
9 hotfixes. 5 are cc:stable, 4 either pertain to post-6.10 material or
aren't considered necessary for earlier kernels. 5 are MM and 4 are non-MM. No identifiable theme here - please see the individual changelogs. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZrQhyAAKCRDdBJ7gKXxA jvLLAP46sQ/HspAbx+5JoeKBTiX6XW4Hfd+MAk++EaTAyAhnxQD+Mfq7rPOIHm/G wiXPVvLO8FEx0lbq06rnXvdotaWFrQg= =mlE4 -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-08-07-18-32' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Nine hotfixes. Five are cc:stable, the others either pertain to post-6.10 material or aren't considered necessary for earlier kernels. Five are MM and four are non-MM. No identifiable theme here - please see the individual changelogs" * tag 'mm-hotfixes-stable-2024-08-07-18-32' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: padata: Fix possible divide-by-0 panic in padata_mt_helper() mailmap: update entry for David Heidelberg memcg: protect concurrent access to mem_cgroup_idr mm: shmem: fix incorrect aligned index when checking conflicts mm: shmem: avoid allocating huge pages larger than MAX_PAGECACHE_ORDER for shmem mm: list_lru: fix UAF for memory cgroup kcov: properly check for softirq context MAINTAINERS: Update LTP members and web selftests: mm: add s390 to ARCH check |
||
Waiman Long
|
6d45e1c948 |
padata: Fix possible divide-by-0 panic in padata_mt_helper()
We are hit with a not easily reproducible divide-by-0 panic in padata.c at
bootup time.
[ 10.017908] Oops: divide error: 0000 1 PREEMPT SMP NOPTI
[ 10.017908] CPU: 26 PID: 2627 Comm: kworker/u1666:1 Not tainted 6.10.0-15.el10.x86_64 #1
[ 10.017908] Hardware name: Lenovo ThinkSystem SR950 [7X12CTO1WW]/[7X12CTO1WW], BIOS [PSE140J-2.30] 07/20/2021
[ 10.017908] Workqueue: events_unbound padata_mt_helper
[ 10.017908] RIP: 0010:padata_mt_helper+0x39/0xb0
:
[ 10.017963] Call Trace:
[ 10.017968] <TASK>
[ 10.018004] ? padata_mt_helper+0x39/0xb0
[ 10.018084] process_one_work+0x174/0x330
[ 10.018093] worker_thread+0x266/0x3a0
[ 10.018111] kthread+0xcf/0x100
[ 10.018124] ret_from_fork+0x31/0x50
[ 10.018138] ret_from_fork_asm+0x1a/0x30
[ 10.018147] </TASK>
Looking at the padata_mt_helper() function, the only way a divide-by-0
panic can happen is when ps->chunk_size is 0. The way that chunk_size is
initialized in padata_do_multithreaded(), chunk_size can be 0 when the
min_chunk in the passed-in padata_mt_job structure is 0.
Fix this divide-by-0 panic by making sure that chunk_size will be at least
1 no matter what the input parameters are.
Link: https://lkml.kernel.org/r/20240806174647.1050398-1-longman@redhat.com
Fixes:
|
||
Andrey Konovalov
|
7d4df2dad3 |
kcov: properly check for softirq context
When collecting coverage from softirqs, KCOV uses in_serving_softirq() to check whether the code is running in the softirq context. Unfortunately, in_serving_softirq() is > 0 even when the code is running in the hardirq or NMI context for hardirqs and NMIs that happened during a softirq. As a result, if a softirq handler contains a remote coverage collection section and a hardirq with another remote coverage collection section happens during handling the softirq, KCOV incorrectly detects a nested softirq coverate collection section and prints a WARNING, as reported by syzbot. This issue was exposed by commit |
||
Jianhui Zhou
|
58f7e4d7ba |
ring-buffer: Remove unused function ring_buffer_nr_pages()
Because ring_buffer_nr_pages() is not an inline function and user accesses buffer->buffers[cpu]->nr_pages directly, the function ring_buffer_nr_pages is removed. Signed-off-by: Jianhui Zhou <912460177@qq.com> Link: https://lore.kernel.org/tencent_F4A7E9AB337F44E0F4B858D07D19EF460708@qq.com Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Tze-nan Wu
|
bcf86c01ca |
tracing: Fix overflow in get_free_elt()
"tracing_map->next_elt" in get_free_elt() is at risk of overflowing.
Once it overflows, new elements can still be inserted into the tracing_map
even though the maximum number of elements (`max_elts`) has been reached.
Continuing to insert elements after the overflow could result in the
tracing_map containing "tracing_map->max_size" elements, leaving no empty
entries.
If any attempt is made to insert an element into a full tracing_map using
`__tracing_map_insert()`, it will cause an infinite loop with preemption
disabled, leading to a CPU hang problem.
Fix this by preventing any further increments to "tracing_map->next_elt"
once it reaches "tracing_map->max_elt".
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes:
|
||
Petr Pavlu
|
604b72b325 |
function_graph: Fix the ret_stack used by ftrace_graph_ret_addr()
When ftrace_graph_ret_addr() is invoked to convert a found stack return
address to its original value, the function can end up producing the
following crash:
[ 95.442712] BUG: kernel NULL pointer dereference, address: 0000000000000028
[ 95.442720] #PF: supervisor read access in kernel mode
[ 95.442724] #PF: error_code(0x0000) - not-present page
[ 95.442727] PGD 0 P4D 0-
[ 95.442731] Oops: Oops: 0000 [#1] PREEMPT SMP PTI
[ 95.442736] CPU: 1 UID: 0 PID: 2214 Comm: insmod Kdump: loaded Tainted: G OE K 6.11.0-rc1-default #1 67c62a3b3720562f7e7db5f11c1fdb40b7a2857c
[ 95.442747] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE, [K]=LIVEPATCH
[ 95.442750] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[ 95.442754] RIP: 0010:ftrace_graph_ret_addr+0x42/0xc0
[ 95.442766] Code: [...]
[ 95.442773] RSP: 0018:ffff979b80ff7718 EFLAGS: 00010006
[ 95.442776] RAX: ffffffff8ca99b10 RBX: ffff979b80ff7760 RCX: ffff979b80167dc0
[ 95.442780] RDX: ffffffff8ca99b10 RSI: ffff979b80ff7790 RDI: 0000000000000005
[ 95.442783] RBP: 0000000000000001 R08: 0000000000000005 R09: 0000000000000000
[ 95.442786] R10: 0000000000000005 R11: 0000000000000000 R12: ffffffff8e9491e0
[ 95.442790] R13: ffffffff8d6f70f0 R14: ffff979b80167da8 R15: ffff979b80167dc8
[ 95.442793] FS: 00007fbf83895740(0000) GS:ffff8a0afdd00000(0000) knlGS:0000000000000000
[ 95.442797] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 95.442800] CR2: 0000000000000028 CR3: 0000000005070002 CR4: 0000000000370ef0
[ 95.442806] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 95.442809] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 95.442816] Call Trace:
[ 95.442823] <TASK>
[ 95.442896] unwind_next_frame+0x20d/0x830
[ 95.442905] arch_stack_walk_reliable+0x94/0xe0
[ 95.442917] stack_trace_save_tsk_reliable+0x7d/0xe0
[ 95.442922] klp_check_and_switch_task+0x55/0x1a0
[ 95.442931] task_call_func+0xd3/0xe0
[ 95.442938] klp_try_switch_task.part.5+0x37/0x150
[ 95.442942] klp_try_complete_transition+0x79/0x2d0
[ 95.442947] klp_enable_patch+0x4db/0x890
[ 95.442960] do_one_initcall+0x41/0x2e0
[ 95.442968] do_init_module+0x60/0x220
[ 95.442975] load_module+0x1ebf/0x1fb0
[ 95.443004] init_module_from_file+0x88/0xc0
[ 95.443010] idempotent_init_module+0x190/0x240
[ 95.443015] __x64_sys_finit_module+0x5b/0xc0
[ 95.443019] do_syscall_64+0x74/0x160
[ 95.443232] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 95.443236] RIP: 0033:0x7fbf82f2c709
[ 95.443241] Code: [...]
[ 95.443247] RSP: 002b:00007fffd5ea3b88 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 95.443253] RAX: ffffffffffffffda RBX: 000056359c48e750 RCX: 00007fbf82f2c709
[ 95.443257] RDX: 0000000000000000 RSI: 000056356ed4efc5 RDI: 0000000000000003
[ 95.443260] RBP: 000056356ed4efc5 R08: 0000000000000000 R09: 00007fffd5ea3c10
[ 95.443263] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
[ 95.443267] R13: 000056359c48e6f0 R14: 0000000000000000 R15: 0000000000000000
[ 95.443272] </TASK>
[ 95.443274] Modules linked in: [...]
[ 95.443385] Unloaded tainted modules: intel_uncore_frequency(E):1 isst_if_common(E):1 skx_edac(E):1
[ 95.443414] CR2: 0000000000000028
The bug can be reproduced with kselftests:
cd linux/tools/testing/selftests
make TARGETS='ftrace livepatch'
(cd ftrace; ./ftracetest test.d/ftrace/fgraph-filter.tc)
(cd livepatch; ./test-livepatch.sh)
The problem is that ftrace_graph_ret_addr() is supposed to operate on the
ret_stack of a selected task but wrongly accesses the ret_stack of the
current task. Specifically, the above NULL dereference occurs when
task->curr_ret_stack is non-zero, but current->ret_stack is NULL.
Correct ftrace_graph_ret_addr() to work with the right ret_stack.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reported-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lore.kernel.org/20240803131211.17255-1-petr.pavlu@suse.com
Fixes:
|
||
Steven Rostedt
|
6e2fdceffd |
tracing: Use refcount for trace_event_file reference counter
Instead of using an atomic counter for the trace_event_file reference counter, use the refcount interface. It has various checks to make sure the reference counting is correct, and will warn if it detects an error (like refcount_inc() on '0'). Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20240726144208.687cce24@rorschach.local.home Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt
|
b156040869 |
tracing: Have format file honor EVENT_FILE_FL_FREED
When eventfs was introduced, special care had to be done to coordinate the
freeing of the file meta data with the files that are exposed to user
space. The file meta data would have a ref count that is set when the file
is created and would be decremented and freed after the last user that
opened the file closed it. When the file meta data was to be freed, it
would set a flag (EVENT_FILE_FL_FREED) to denote that the file is freed,
and any new references made (like new opens or reads) would fail as it is
marked freed. This allowed other meta data to be freed after this flag was
set (under the event_mutex).
All the files that were dynamically created in the events directory had a
pointer to the file meta data and would call event_release() when the last
reference to the user space file was closed. This would be the time that it
is safe to free the file meta data.
A shortcut was made for the "format" file. It's i_private would point to
the "call" entry directly and not point to the file's meta data. This is
because all format files are the same for the same "call", so it was
thought there was no reason to differentiate them. The other files
maintain state (like the "enable", "trigger", etc). But this meant if the
file were to disappear, the "format" file would be unaware of it.
This caused a race that could be trigger via the user_events test (that
would create dynamic events and free them), and running a loop that would
read the user_events format files:
In one console run:
# cd tools/testing/selftests/user_events
# while true; do ./ftrace_test; done
And in another console run:
# cd /sys/kernel/tracing/
# while true; do cat events/user_events/__test_event/format; done 2>/dev/null
With KASAN memory checking, it would trigger a use-after-free bug report
(which was a real bug). This was because the format file was not checking
the file's meta data flag "EVENT_FILE_FL_FREED", so it would access the
event that the file meta data pointed to after the event was freed.
After inspection, there are other locations that were found to not check
the EVENT_FILE_FL_FREED flag when accessing the trace_event_file. Add a
new helper function: event_file_file() that will make sure that the
event_mutex is held, and will return NULL if the trace_event_file has the
EVENT_FILE_FL_FREED flag set. Have the first reference of the struct file
pointer use event_file_file() and check for NULL. Later uses can still use
the event_file_data() helper function if the event_mutex is still held and
was not released since the event_file_file() call.
Link: https://lore.kernel.org/all/20240719204701.1605950-1-minipli@grsecurity.net/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Ilkka Naulapää <digirigawa@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Florian Fainelli <florian.fainelli@broadcom.com>
Cc: Alexey Makhalov <alexey.makhalov@broadcom.com>
Cc: Vasavi Sirnapalli <vasavi.sirnapalli@broadcom.com>
Link: https://lore.kernel.org/20240730110657.3b69d3c1@gandalf.local.home
Fixes:
|
||
Shay Drory
|
edbbaae42a |
genirq/irqdesc: Honor caller provided affinity in alloc_desc()
Currently, whenever a caller is providing an affinity hint for an
interrupt, the allocation code uses it to calculate the node and copies the
cpumask into irq_desc::affinity.
If the affinity for the interrupt is not marked 'managed' then the startup
of the interrupt ignores irq_desc::affinity and uses the system default
affinity mask.
Prevent this by setting the IRQD_AFFINITY_SET flag for the interrupt in the
allocator, which causes irq_setup_affinity() to use irq_desc::affinity on
interrupt startup if the mask contains an online CPU.
[ tglx: Massaged changelog ]
Fixes:
|
||
Kent Overstreet
|
ff9bf4b341 |
lockdep: Fix lockdep_set_notrack_class() for CONFIG_LOCK_STAT
We won't find a contended lock if it's not being tracked. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> |
||
Rik van Riel
|
bd44ca3de4 |
dma-debug: avoid deadlock between dma debug vs printk and netconsole
Currently the dma debugging code can end up indirectly calling printk under the radix_lock. This happens when a radix tree node allocation fails. This is a problem because the printk code, when used together with netconsole, can end up inside the dma debugging code while trying to transmit a message over netcons. This creates the possibility of either a circular deadlock on the same CPU, with that CPU trying to grab the radix_lock twice, or an ABBA deadlock between different CPUs, where one CPU grabs the console lock first and then waits for the radix_lock, while the other CPU is holding the radix_lock and is waiting for the console lock. The trace captured by lockdep is of the ABBA variant. -> #2 (&dma_entry_hash[i].lock){-.-.}-{2:2}: _raw_spin_lock_irqsave+0x5a/0x90 debug_dma_map_page+0x79/0x180 dma_map_page_attrs+0x1d2/0x2f0 bnxt_start_xmit+0x8c6/0x1540 netpoll_start_xmit+0x13f/0x180 netpoll_send_skb+0x20d/0x320 netpoll_send_udp+0x453/0x4a0 write_ext_msg+0x1b9/0x460 console_flush_all+0x2ff/0x5a0 console_unlock+0x55/0x180 vprintk_emit+0x2e3/0x3c0 devkmsg_emit+0x5a/0x80 devkmsg_write+0xfd/0x180 do_iter_readv_writev+0x164/0x1b0 vfs_writev+0xf9/0x2b0 do_writev+0x6d/0x110 do_syscall_64+0x80/0x150 entry_SYSCALL_64_after_hwframe+0x4b/0x53 -> #0 (console_owner){-.-.}-{0:0}: __lock_acquire+0x15d1/0x31a0 lock_acquire+0xe8/0x290 console_flush_all+0x2ea/0x5a0 console_unlock+0x55/0x180 vprintk_emit+0x2e3/0x3c0 _printk+0x59/0x80 warn_alloc+0x122/0x1b0 __alloc_pages_slowpath+0x1101/0x1120 __alloc_pages+0x1eb/0x2c0 alloc_slab_page+0x5f/0x150 new_slab+0x2dc/0x4e0 ___slab_alloc+0xdcb/0x1390 kmem_cache_alloc+0x23d/0x360 radix_tree_node_alloc+0x3c/0xf0 radix_tree_insert+0xf5/0x230 add_dma_entry+0xe9/0x360 dma_map_page_attrs+0x1d2/0x2f0 __bnxt_alloc_rx_frag+0x147/0x180 bnxt_alloc_rx_data+0x79/0x160 bnxt_rx_skb+0x29/0xc0 bnxt_rx_pkt+0xe22/0x1570 __bnxt_poll_work+0x101/0x390 bnxt_poll+0x7e/0x320 __napi_poll+0x29/0x160 net_rx_action+0x1e0/0x3e0 handle_softirqs+0x190/0x510 run_ksoftirqd+0x4e/0x90 smpboot_thread_fn+0x1a8/0x270 kthread+0x102/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x11/0x20 This bug is more likely than it seems, because when one CPU has run out of memory, chances are the other has too. The good news is, this bug is hidden behind the CONFIG_DMA_API_DEBUG, so not many users are likely to trigger it. Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Konstantin Ovsepian <ovs@meta.com> Signed-off-by: Christoph Hellwig <hch@lst.de> |
||
Uros Bizjak
|
c4c8f369b6 |
workqueue: Correct declaration of cpu_pwq in struct workqueue_struct
cpu_pwq is used in various percpu functions that expect variable in __percpu address space. Correct the declaration of cpu_pwq to struct pool_workqueue __rcu * __percpu *cpu_pwq to declare the variable as __percpu pointer. The patch also fixes following sparse errors: workqueue.c:380:37: warning: duplicate [noderef] workqueue.c:380:37: error: multiple address spaces given: __rcu & __percpu workqueue.c:2271:15: error: incompatible types in comparison expression (different address spaces): workqueue.c:2271:15: struct pool_workqueue [noderef] __rcu * workqueue.c:2271:15: struct pool_workqueue [noderef] __percpu * and uncovers a couple of exisiting "incorrect type in assignment" warnings (from __rcu address space), which this patch does not address. Found by GCC's named address space checks. There were no changes in the resulting object files. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
Tejun Heo
|
8bc35475ef |
workqueue: Fix spruious data race in __flush_work()
When flushing a work item for cancellation, __flush_work() knows that it exclusively owns the work item through its PENDING bit. |
||
Lai Jiangshan
|
98cc1730c8 |
workqueue: Remove incorrect "WARN_ON_ONCE(!list_empty(&worker->entry));" from dying worker
The commit |
||
Will Deacon
|
38f7e14519 |
workqueue: Fix UBSAN 'subtraction overflow' error in shift_and_mask()
UBSAN reports the following 'subtraction overflow' error when booting
in a virtual machine on Android:
| Internal error: UBSAN: integer subtraction overflow: 00000000f2005515 [#1] PREEMPT SMP
| Modules linked in:
| CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.10.0-00006-g3cbe9e5abd46-dirty #4
| Hardware name: linux,dummy-virt (DT)
| pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : cancel_delayed_work+0x34/0x44
| lr : cancel_delayed_work+0x2c/0x44
| sp : ffff80008002ba60
| x29: ffff80008002ba60 x28: 0000000000000000 x27: 0000000000000000
| x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
| x23: 0000000000000000 x22: 0000000000000000 x21: ffff1f65014cd3c0
| x20: ffffc0e84c9d0da0 x19: ffffc0e84cab3558 x18: ffff800080009058
| x17: 00000000247ee1f8 x16: 00000000247ee1f8 x15: 00000000bdcb279d
| x14: 0000000000000001 x13: 0000000000000075 x12: 00000a0000000000
| x11: ffff1f6501499018 x10: 00984901651fffff x9 : ffff5e7cc35af000
| x8 : 0000000000000001 x7 : 3d4d455453595342 x6 : 000000004e514553
| x5 : ffff1f6501499265 x4 : ffff1f650ff60b10 x3 : 0000000000000620
| x2 : ffff80008002ba78 x1 : 0000000000000000 x0 : 0000000000000000
| Call trace:
| cancel_delayed_work+0x34/0x44
| deferred_probe_extend_timeout+0x20/0x70
| driver_register+0xa8/0x110
| __platform_driver_register+0x28/0x3c
| syscon_init+0x24/0x38
| do_one_initcall+0xe4/0x338
| do_initcall_level+0xac/0x178
| do_initcalls+0x5c/0xa0
| do_basic_setup+0x20/0x30
| kernel_init_freeable+0x8c/0xf8
| kernel_init+0x28/0x1b4
| ret_from_fork+0x10/0x20
| Code: f9000fbf 97fffa2f 39400268 37100048 (d42aa2a0)
| ---[ end trace 0000000000000000 ]---
| Kernel panic - not syncing: UBSAN: integer subtraction overflow: Fatal exception
This is due to shift_and_mask() using a signed immediate to construct
the mask and being called with a shift of 31 (WORK_OFFQ_POOL_SHIFT) so
that it ends up decrementing from INT_MIN.
Use an unsigned constant '1U' to generate the mask in shift_and_mask().
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Fixes:
|
||
Waiman Long
|
ff0ce721ec |
cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplug
It was found that some hotplug operations may cause multiple rebuild_sched_domains_locked() calls. Some of those intermediate calls may use cpuset states not in the final correct form leading to incorrect sched domain setting. Fix this problem by using the existing force_rebuild flag to inhibit immediate rebuild_sched_domains_locked() calls if set and only doing one final call at the end. Also renaming the force_rebuild flag to force_sd_rebuild to make its meaning for clear. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
Waiman Long
|
311a1bdc44 |
cgroup/cpuset: Clear effective_xcpus on cpus_allowed clearing only if cpus.exclusive not set
Commit |
||
Chen Ridong
|
959ab6350a |
cgroup/cpuset: fix panic caused by partcmd_update
We find a bug as below:
BUG: unable to handle page fault for address: 00000003
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 3 PID: 358 Comm: bash Tainted: G W I 6.6.0-10893-g60d6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/4
RIP: 0010:partition_sched_domains_locked+0x483/0x600
Code: 01 48 85 d2 74 0d 48 83 05 29 3f f8 03 01 f3 48 0f bc c2 89 c0 48 9
RSP: 0018:ffffc90000fdbc58 EFLAGS: 00000202
RAX: 0000000100000003 RBX: ffff888100b3dfa0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000002fe80
RBP: ffff888100b3dfb0 R08: 0000000000000001 R09: 0000000000000000
R10: ffffc90000fdbcb0 R11: 0000000000000004 R12: 0000000000000002
R13: ffff888100a92b48 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f44a5425740(0000) GS:ffff888237d80000(0000) knlGS:0000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000100030973 CR3: 000000010722c000 CR4: 00000000000006e0
Call Trace:
<TASK>
? show_regs+0x8c/0xa0
? __die_body+0x23/0xa0
? __die+0x3a/0x50
? page_fault_oops+0x1d2/0x5c0
? partition_sched_domains_locked+0x483/0x600
? search_module_extables+0x2a/0xb0
? search_exception_tables+0x67/0x90
? kernelmode_fixup_or_oops+0x144/0x1b0
? __bad_area_nosemaphore+0x211/0x360
? up_read+0x3b/0x50
? bad_area_nosemaphore+0x1a/0x30
? exc_page_fault+0x890/0xd90
? __lock_acquire.constprop.0+0x24f/0x8d0
? __lock_acquire.constprop.0+0x24f/0x8d0
? asm_exc_page_fault+0x26/0x30
? partition_sched_domains_locked+0x483/0x600
? partition_sched_domains_locked+0xf0/0x600
rebuild_sched_domains_locked+0x806/0xdc0
update_partition_sd_lb+0x118/0x130
cpuset_write_resmask+0xffc/0x1420
cgroup_file_write+0xb2/0x290
kernfs_fop_write_iter+0x194/0x290
new_sync_write+0xeb/0x160
vfs_write+0x16f/0x1d0
ksys_write+0x81/0x180
__x64_sys_write+0x21/0x30
x64_sys_call+0x2f25/0x4630
do_syscall_64+0x44/0xb0
entry_SYSCALL_64_after_hwframe+0x78/0xe2
RIP: 0033:0x7f44a553c887
It can be reproduced with cammands:
cd /sys/fs/cgroup/
mkdir test
cd test/
echo +cpuset > ../cgroup.subtree_control
echo root > cpuset.cpus.partition
cat /sys/fs/cgroup/cpuset.cpus.effective
0-3
echo 0-3 > cpuset.cpus // taking away all cpus from root
This issue is caused by the incorrect rebuilding of scheduling domains.
In this scenario, test/cpuset.cpus.partition should be an invalid root
and should not trigger the rebuilding of scheduling domains. When calling
update_parent_effective_cpumask with partcmd_update, if newmask is not
null, it should recheck newmask whether there are cpus is available
for parect/cs that has tasks.
Fixes:
|
||
Thomas Gleixner
|
5916be8a53 |
timekeeping: Fix bogus clock_was_set() invocation in do_adjtimex()
The addition of the bases argument to clock_was_set() fixed up all call
sites correctly except for do_adjtimex(). This uses CLOCK_REALTIME
instead of CLOCK_SET_WALL as argument. CLOCK_REALTIME is 0.
As a result the effect of that clock_was_set() notification is incomplete
and might result in timers expiring late because the hrtimer code does
not re-evaluate the affected clock bases.
Use CLOCK_SET_WALL instead of CLOCK_REALTIME to tell the hrtimers code
which clock bases need to be re-evaluated.
Fixes:
|
||
Justin Stitt
|
06c03c8edc |
ntp: Safeguard against time_constant overflow
Using syzkaller with the recently reintroduced signed integer overflow sanitizer produces this UBSAN report: UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:738:18 9223372036854775806 + 4 cannot be represented in type 'long' Call Trace: handle_overflow+0x171/0x1b0 __do_adjtimex+0x1236/0x1440 do_adjtimex+0x2be/0x740 The user supplied time_constant value is incremented by four and then clamped to the operating range. Before commit |