Commit Graph

41263 Commits

Author SHA1 Message Date
Mel Gorman
ef6a22b70f sched/numa: apply the scan delay to every new vma
Pach series "sched/numa: Enhance vma scanning", v3.

The patchset proposes one of the enhancements to numa vma scanning
suggested by Mel.  This is continuation of [3].

Reposting the rebased patchset to akpm mm-unstable tree (March 1) 

Existing mechanism of scan period involves, scan period derived from
per-thread stats.  Process Adaptive autoNUMA [1] proposed to gather NUMA
fault stats at per-process level to capture aplication behaviour better.

During that course of discussion, Mel proposed several ideas to enhance
current numa balancing.  One of the suggestion was below

Track what threads access a VMA.  The suggestion was to use an unsigned
long pid_mask and use the lower bits to tag approximately what threads
access a VMA.  Skip VMAs that did not trap a fault.  This would be
approximate because of PID collisions but would reduce scanning of areas
the thread is not interested in.  The above suggestion intends not to
penalize threads that has no interest in the vma, thus reduce scanning
overhead.

V3 changes are mostly based on PeterZ comments (details below in changes)

Summary of patchset:

Current patchset implements:

1. Delay the vma scanning logic for newly created VMA's so that
   additional overhead of scanning is not incurred for short lived tasks
   (implementation by Mel)

2. Store the information of tasks accessing VMA in 2 windows.  It is
   regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval. 
   The above time is derived from experimenting (Suggested by PeterZ) to
   balance between frequent clearing vs obsolete access data

3. hash_32 used to encode task index accessing VMA information

4. VMA's acess information is used to skip scanning for the tasks
   which had not accessed VMA

Changes since V2:
patch1: 
 - Renaming of structure, macro to function,
 - Add explanation to heuristics
 - Adding more details from result (PeterZ)
 Patch2:
 - Usage of test and set bit (PeterZ)
 - Move storing access PID info to numa_migrate_prep()
 - Add a note on fainess among tasks allowed to scan
   (PeterZ)
 Patch3:
 - Maintain two windows of access PID information
  (PeterZ supported implementation and Gave idea to extend
   to N if needed)
 Patch4:
 - Apply hash_32 function to track VMA accessing PIDs (PeterZ)

Changes since RFC V1:
 - Include Mel's vma scan delay patch
 - Change the accessing pid store logic (Thanks Mel)
 - Fencing structure / code to NUMA_BALANCING (David, Mel)
 - Adding clearing access PID logic (Mel)
 - Descriptive change log ( Mike Rapoport)

Things to ponder over:
==========================================

- Improvement to clearing accessing PIDs logic (discussed in-detail in
  patch3 itself (Done in this patchset by implementing 2 window history)

- Current scan period is not changed in the patchset, so we do see
  frequent tries to scan.  Relaxing scan period dynamically could improve
  results further.

[1] sched/numa: Process Adaptive autoNUMA 
 Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/

[2] RFC V1 Link: 
  https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/

[3] V2 Link:
  https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@amd.com/


Results:
Summary: Huge autonuma cost reduction seen in mmtest. Kernbench improvement 
is more than 5% and huge system time (80%+) improvement from mmtest autonuma.
(dbench had huge std deviation to post)

kernbench
===========
                      6.2.0-mmunstable-base  6.2.0-mmunstable-patched
Amean     user-256    22002.51 (   0.00%)    22649.95 *  -2.94%*
Amean     syst-256    10162.78 (   0.00%)     8214.13 *  19.17%*
Amean     elsp-256      160.74 (   0.00%)      156.92 *   2.38%*

Duration User       66017.43    67959.84
Duration System     30503.15    24657.03
Duration Elapsed      504.61      493.12

                      6.2.0-mmunstable-base  6.2.0-mmunstable-patched
Ops NUMA alloc hit                1738835089.00  1738780310.00
Ops NUMA alloc local              1738834448.00  1738779711.00
Ops NUMA base-page range updates      477310.00      392566.00
Ops NUMA PTE updates                  477310.00      392566.00
Ops NUMA hint faults                   96817.00       87555.00
Ops NUMA hint local faults %           10150.00        2192.00
Ops NUMA hint local percent               10.48           2.50
Ops NUMA pages migrated                86660.00       85363.00
Ops AutoNUMA cost                        489.07         442.14

autonumabench
===============
                      6.2.0-mmunstable-base  6.2.0-mmunstable-patched
Amean     syst-NUMA01                  399.50 (   0.00%)       52.05 *  86.97%*
Amean     syst-NUMA01_THREADLOCAL        0.21 (   0.00%)        0.22 *  -5.41%*
Amean     syst-NUMA02                    0.80 (   0.00%)        0.78 *   2.68%*
Amean     syst-NUMA02_SMT                0.65 (   0.00%)        0.68 *  -3.95%*
Amean     elsp-NUMA01                  313.26 (   0.00%)      313.11 *   0.05%*
Amean     elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.08 *  -1.76%*
Amean     elsp-NUMA02                    3.19 (   0.00%)        3.24 *  -1.52%*
Amean     elsp-NUMA02_SMT                3.72 (   0.00%)        3.61 *   2.92%*

Duration User      396433.47   324835.96
Duration System      2808.70      376.66
Duration Elapsed     2258.61     2258.12

                      6.2.0-mmunstable-base  6.2.0-mmunstable-patched
Ops NUMA alloc hit                  59921806.00    49623489.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local                59920880.00    49622594.00
Ops NUMA base-page range updates   152259275.00       50075.00
Ops NUMA PTE updates               152259275.00       50075.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults               154660352.00       39014.00
Ops NUMA hint local faults %       138550501.00       23139.00
Ops NUMA hint local percent               89.58          59.31
Ops NUMA pages migrated              8179067.00       14147.00
Ops AutoNUMA cost                     774522.98         195.69


This patch (of 4):

Currently whenever a new task is created we wait for
sysctl_numa_balancing_scan_delay to avoid unnessary scanning overhead. 
Extend the same logic to new or very short-lived VMAs.

[raghavendra.kt@amd.com: add initialization in vm_area_dup())]
Link: https://lkml.kernel.org/r/cover.1677672277.git.raghavendra.kt@amd.com
Link: https://lkml.kernel.org/r/7a6fbba87c8b51e67efd3e74285bb4cb311a16ca.1677672277.git.raghavendra.kt@amd.com
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Disha Talreja <dishaa.talreja@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:03:03 -07:00
Suren Baghdasaryan
c7f8f31c00 mm: separate vma->lock from vm_area_struct
vma->lock being part of the vm_area_struct causes performance regression
during page faults because during contention its count and owner fields
are constantly updated and having other parts of vm_area_struct used
during page fault handling next to them causes constant cache line
bouncing.  Fix that by moving the lock outside of the vm_area_struct.

All attempts to keep vma->lock inside vm_area_struct in a separate cache
line still produce performance regression especially on NUMA machines. 
Smallest regression was achieved when lock is placed in the fourth cache
line but that bloats vm_area_struct to 256 bytes.

Considering performance and memory impact, separate lock looks like the
best option.  It increases memory footprint of each VMA but that can be
optimized later if the new size causes issues.  Note that after this
change vma_init() does not allocate or initialize vma->lock anymore.  A
number of drivers allocate a pseudo VMA on the stack but they never use
the VMA's lock, therefore it does not need to be allocated.  The future
drivers which might need the VMA lock should use
vm_area_alloc()/vm_area_free() to allocate the VMA.

Link: https://lkml.kernel.org/r/20230227173632.3292573-34-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:03:02 -07:00
Suren Baghdasaryan
0d2ebf9c3f mm/mmap: free vm_area_struct without call_rcu in exit_mmap
call_rcu() can take a long time when callback offloading is enabled.  Its
use in the vm_area_free can cause regressions in the exit path when
multiple VMAs are being freed.

Because exit_mmap() is called only after the last mm user drops its
refcount, the page fault handlers can't be racing with it.  Any other
possible user like oom-reaper or process_mrelease are already synchronized
using mmap_lock.  Therefore exit_mmap() can free VMAs directly, without
the use of call_rcu().

Expose __vm_area_free() and use it from exit_mmap() to avoid possible
call_rcu() floods and performance regressions caused by it.

Link: https://lkml.kernel.org/r/20230227173632.3292573-33-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:03:02 -07:00
Suren Baghdasaryan
f2e13784c1 kernel/fork: assert no VMA readers during its destruction
Assert there are no holders of VMA lock for reading when it is about to be
destroyed.

Link: https://lkml.kernel.org/r/20230227173632.3292573-21-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:02:59 -07:00
Suren Baghdasaryan
5e31275cc9 mm: add per-VMA lock and helper functions to control it
Introduce per-VMA locking.  The lock implementation relies on a per-vma
and per-mm sequence counters to note exclusive locking:

  - read lock - (implemented by vma_start_read) requires the vma
    (vm_lock_seq) and mm (mm_lock_seq) sequence counters to differ.
    If they match then there must be a vma exclusive lock held somewhere.
  - read unlock - (implemented by vma_end_read) is a trivial vma->lock
    unlock.
  - write lock - (vma_start_write) requires the mmap_lock to be held
    exclusively and the current mm counter is assigned to the vma counter.
    This will allow multiple vmas to be locked under a single mmap_lock
    write lock (e.g. during vma merging). The vma counter is modified
    under exclusive vma lock.
  - write unlock - (vma_end_write_all) is a batch release of all vma
    locks held. It doesn't pair with a specific vma_start_write! It is
    done before exclusive mmap_lock is released by incrementing mm
    sequence counter (mm_lock_seq).
  - write downgrade - if the mmap_lock is downgraded to the read lock, all
    vma write locks are released as well (effectivelly same as write
    unlock).

Link: https://lkml.kernel.org/r/20230227173632.3292573-13-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:02:57 -07:00
Michel Lespinasse
20cce633f4 mm: rcu safe VMA freeing
This prepares for page faults handling under VMA lock, looking up VMAs
under protection of an rcu read lock, instead of the usual mmap read lock.

Link: https://lkml.kernel.org/r/20230227173632.3292573-11-surenb@google.com
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:02:57 -07:00
Kirill A. Shutemov
23baf831a3 mm, treewide: redefine MAX_ORDER sanely
MAX_ORDER currently defined as number of orders page allocator supports:
user can ask buddy allocator for page order between 0 and MAX_ORDER-1.

This definition is counter-intuitive and lead to number of bugs all over
the kernel.

Change the definition of MAX_ORDER to be inclusive: the range of orders
user can ask from buddy allocator is 0..MAX_ORDER now.

[kirill@shutemov.name: fix min() warning]
  Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box
[akpm@linux-foundation.org: fix another min_t warning]
[kirill@shutemov.name: fixups per Zi Yan]
  Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name
[akpm@linux-foundation.org: fix underlining in docs]
  Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/
Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 19:42:46 -07:00
Kirill A. Shutemov
934487e98f perf/core: fix MAX_ORDER usage in rb_alloc_aux_page()
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator
can deliver is MAX_ORDER-1.

Fix MAX_ORDER usage in rb_alloc_aux_page().

Link: https://lkml.kernel.org/r/20230315113133.11326-7-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 19:42:45 -07:00
Nicholas Piggin
2655421ae6 lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
On big systems, the mm refcount can become highly contented when doing a
lot of context switching with threaded applications.  user<->idle switch
is one of the important cases.  Abandoning lazy tlb entirely slows this
switching down quite a bit in the common uncontended case, so that is not
viable.

Implement a scheme where lazy tlb mm references do not contribute to the
refcount, instead they get explicitly removed when the refcount reaches
zero.

The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they
switch away from this mm to init_mm if it was being used as the lazy tlb
mm.  Enabling the shoot lazies option therefore requires that the arch
ensures that mm_cpumask contains all CPUs that could possibly be using mm.
A DEBUG_VM option IPIs every CPU in the system after this to ensure there
are no references remaining before the mm is freed.

Shootdown IPIs cost could be an issue, but they have not been observed to
be a serious problem with this scheme, because short-lived processes tend
not to migrate CPUs much, therefore they don't get much chance to leave
lazy tlb mm references on remote CPUs.  There are a lot of options to
reduce them if necessary, described in comments.

The near-worst-case can be benchmarked with will-it-scale:

  context_switch1_threads -t $(($(nproc) / 2))

This will create nproc threads (nproc / 2 switching pairs) all sharing the
same mm that spread over all CPUs so each CPU does thread->idle->thread
switching.

[ Rik came up with basically the same idea a few years ago, so credit
  to him for that. ]

Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/
Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/
Link: https://lkml.kernel.org/r/20230203071837.1136453-5-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28 16:20:08 -07:00
Nicholas Piggin
aa464ba9a1 lazy tlb: introduce lazy tlb mm refcount helper functions
Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. 
This makes the lazy tlb mm references more obvious, and allows the
refcounting scheme to be modified in later changes.  There is no
functional change with this patch.

Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28 16:20:08 -07:00
Nicholas Piggin
6cad87b0d2 kthread: simplify kthread_use_mm refcounting
Patch series "shoot lazy tlbs (lazy tlb refcount scalability
improvement)", v7.

This series improves scalability of context switching between user and
kernel threads on large systems with a threaded process spread across a
lot of CPUs.

Discussion of v6 here:
https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/


This patch (of 5):

Remove the special case avoiding refcounting when the mm to be used is the
same as the kernel thread's active (lazy tlb) mm.  kthread_use_mm() should
not be such a performance critical path that this matters much.  This
simplifies a later change to lazy tlb mm refcounting.

Link: https://lkml.kernel.org/r/20230203071837.1136453-1-npiggin@gmail.com
Link: https://lkml.kernel.org/r/20230203071837.1136453-2-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28 16:20:07 -07:00
Linus Torvalds
18940c888c - Fix a corner case where vruntime of a task is not being sanitized
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQgQeEACgkQEsHwGGHe
 VUpJ+w//c01JpzBXQvGGlNSTTuUzSxLAQz0n8lQmixpYHOUgVL3CQlOF+OfnkYPp
 mz8m3nDh1FB9o70Sd2J4/OjY2Gh1qENrBLmj509LTnZE9hADRI0T5Mn93Jo7m7HY
 QoXrYWEwJYwsLzJ66mR8A0xts5jWgkJsAWKXF9gfxf/ieeycdJ1GdOdzC1tp4Nfe
 /4SEjSbUhx/bsBbAdJ38Z/iQGT0HuyQLOBGBuBcFE0JnP/aYEanAQsxxP2LObeVw
 Za7ATxdJ9I1TErVfRsG0GDSiVKCYzSG2GME5TXibgPJ2g1+m0I7gZgpGO9Q8Wzo4
 7y0X+vqsykY/Me3xEDBVaeCiHmFTambkxOR2xVJ2TISN8b390yePy4vgY1QQDidd
 eNh9S2x1dsKp8i4NdYyeW7xwaTfIDp9Yp4XNP8cw02VzW0FSCnsmCzwGHIXsF4K/
 Sib+bhKUo+Qmck5nJlV6R5Xr9cvGgyPpBvD8/XqqwF5lHJ7xg4qkPwPKjoKL1HRj
 YT1t+l0kzcg/onyDAuPe1mIRFf7Q8x5G8zkUGMG401h2tazv19rjK4+V1UemhBqA
 h5Cf1BBy6+6kn4DDb/zD+0IgpDFKJDaClxNwfPzaplAoMC/8+0nxxnu7f32eT59k
 /JtMisERhr6lG0Q+TfURhB3yyCiBjBR2EKcKzHz+KARtxs7a4T0=
 =Z8oJ
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Borislav Petkov:

 - Fix a corner case where vruntime of a task is not being sanitized

* tag 'sched_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Sanitize vruntime of entity being migrated
2023-03-26 09:18:30 -07:00
Linus Torvalds
f6cdaeb08b - Do the delayed RCU wakeup for kthreads in the proper order so that
former doesn't get ignored
 
 - A noinstr warning fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQgPhsACgkQEsHwGGHe
 VUqecA//VN/9pvCpJbf0S92lDXjbjuAna4+rYak6mjxke2lYHeXNsCjPpBtdfMnK
 Mvr1HykKVBGIwlIPsKvwzt1FYN08rhdPnb/XvwM1ZliFlXP/HhN7K3AWgld559sZ
 hY9gKfijof5D7VhqjRS7ZA2qo2by72ekXnuVQT0cmwZiHPQLx8M4ySHuMUZD3Lmz
 GQyvITuDyBX3BiIGVtqpOhpugpEjAoEBE0MPwXrMEe9Glk4i1z86Xd1Cr4ksCFZL
 gIdDSnlUuLuXn1OfLA35fXzoaStB07aTMEbRl+iD1KUopdRzpqj9j0rqYINkW0Ar
 W/BzyMNw2itEbGrF+kjjCpolwmJvMcJUuvEZKO/gNTv2qYZW5BQqQrHYILdmezyz
 HvGndyT996D3uoUIMIGqJf+41cwqPEjGyMLs3GfYnZMdVnZZnG/KflWQCiGUR9RR
 LfxaryNZfT8MfQP6uslxOWubMupfsen7Hk8oljUpT2GzUAsWxTjqYWkFx6bMNMkV
 Kx304at7R9jD81qC3Rdkqu0F5Z17YZWubd1oJEhi8HeMq8uxFxPb983SkXLY0w7Y
 4Ss/MhJwt30e9ltGCMwgF83uOnndXwzFJG4TR9TqsO0TcdUE6XJPbPj/K8Wsi/u7
 fvnCbBLsDaEJDEAikWJsSOaMjUwnajyaYomy+9VCR536/DlIq5M=
 =c5tH
 -----END PGP SIGNATURE-----

Merge tag 'core_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core fixes from Borislav Petkov:

 - Do the delayed RCU wakeup for kthreads in the proper order so that
   former doesn't get ignored

 - A noinstr warning fix

* tag 'core_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry/rcu: Check TIF_RESCHED _after_ delayed RCU wake-up
  entry: Fix noinstr warning in __enter_from_user_mode()
2023-03-26 09:06:20 -07:00
Linus Torvalds
f768b35a23 Fixes for 6.3-rc3:
* Fix a race in the percpu counters summation code where the summation
    failed to add in the values for any CPUs that were dying but not yet
    dead.  This fixes some minor discrepancies and incorrect assertions
    when running generic/650.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZBdAbgAKCRBKO3ySh0YR
 pkltAQCs4QO5LjYReqjUxd4cSsLtNnNon09qswRsl2GuRyI36AEAxI9QMq4Q6D9V
 ZasNbiTCkV3KPKfmp6gf1mQNLk1lGQ0=
 =Bz3q
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.3-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs percpu counter fixes from Darrick Wong:
 "We discovered a filesystem summary counter corruption problem that was
  traced to cpu hot-remove racing with the call to percpu_counter_sum
  that sets the free block count in the superblock when writing it to
  disk. The root cause is that percpu_counter_sum doesn't cull from
  dying cpus and hence misses those counter values if the cpu shutdown
  hooks have not yet run to merge the values.

  I'm hoping this is a fairly painless fix to the problem, since the
  dying cpu mask should generally be empty. It's been in for-next for a
  week without any complaints from the bots.

   - Fix a race in the percpu counters summation code where the
     summation failed to add in the values for any CPUs that were dying
     but not yet dead. This fixes some minor discrepancies and incorrect
     assertions when running generic/650"

* tag 'xfs-6.3-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  pcpcntr: remove percpu_counter_sum_all()
  fork: remove use of percpu_counter_sum_all
  pcpcntrs: fix dying cpu summation race
  cpumask: introduce for_each_cpu_or
2023-03-25 12:57:34 -07:00
Linus Torvalds
65aca32efd 21 hotfixes, 8 of which are cc:stable. 11 are for MM, the remainder are
for other subsystems.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZB48xAAKCRDdBJ7gKXxA
 js2rAP4zvcMn90vBJhWNElsA7pBgDYD66QCK6JBDHGe3J1qdeQEA8D606pjMBWkL
 ly7NifwCjOtFhfDRgEHOXu8g8g1k1QM=
 =Cswg
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-03-24-17-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "21 hotfixes, 8 of which are cc:stable. 11 are for MM, the remainder
  are for other subsystems"

* tag 'mm-hotfixes-stable-2023-03-24-17-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
  mm: mmap: remove newline at the end of the trace
  mailmap: add entries for Richard Leitner
  kcsan: avoid passing -g for test
  kfence: avoid passing -g for test
  mm: kfence: fix using kfence_metadata without initialization in show_object()
  lib: dhry: fix unstable smp_processor_id(_) usage
  mailmap: add entry for Enric Balletbo i Serra
  mailmap: map Sai Prakash Ranjan's old address to his current one
  mailmap: map Rajendra Nayak's old address to his current one
  Revert "kasan: drop skip_kasan_poison variable in free_pages_prepare"
  mailmap: add entry for Tobias Klauser
  kasan, powerpc: don't rename memintrinsics if compiler adds prefixes
  mm/ksm: fix race with VMA iteration and mm_struct teardown
  kselftest: vm: fix unused variable warning
  mm: fix error handling for map_deny_write_exec
  mm: deduplicate error handling for map_deny_write_exec
  checksyscalls: ignore fstat to silence build warning on LoongArch
  nilfs2: fix kernel-infoleak in nilfs_ioctl_wrap_copy()
  test_maple_tree: add more testing for mas_empty_area()
  maple_tree: fix mas_skip_node() end slot detection
  ...
2023-03-24 18:06:11 -07:00
Linus Torvalds
608f1b1366 Including fixes from bpf, wifi and bluetooth.
Current release - regressions:
 
  - wifi: mt76: mt7915: add back 160MHz channel width support for MT7915
 
  - libbpf: revert poisoning of strlcpy, it broke uClibc-ng
 
 Current release - new code bugs:
 
  - bpf: improve the coverage of the "allow reads from uninit stack"
    feature to fix verification complexity problems
 
  - eth: am65-cpts: reset PPS genf adj settings on enable
 
 Previous releases - regressions:
 
  - wifi: mac80211: serialize ieee80211_handle_wake_tx_queue()
 
  - wifi: mt76: do not run mt76_unregister_device() on unregistered hw,
    fix null-deref
 
  - Bluetooth: btqcomsmd: fix command timeout after setting BD address
 
  - eth: igb: revert rtnl_lock() that causes a deadlock
 
  - dsa: mscc: ocelot: fix device specific statistics
 
 Previous releases - always broken:
 
  - xsk: add missing overflow check in xdp_umem_reg()
 
  - wifi: mac80211:
    - fix QoS on mesh interfaces
    - fix mesh path discovery based on unicast packets
 
  - Bluetooth:
    - ISO: fix timestamped HCI ISO data packet parsing
    - remove "Power-on" check from Mesh feature
 
  - usbnet: more fixes to drivers trusting packet length
 
  - wifi: iwlwifi: mvm: fix mvmtxq->stopped handling
 
  - Bluetooth: btintel: iterate only bluetooth device ACPI entries
 
  - eth: iavf: fix inverted Rx hash condition leading to disabled hash
 
  - eth: igc: fix the validation logic for taprio's gate list
 
  - dsa: tag_brcm: legacy: fix daisy-chained switches
 
 Misc:
 
  - bpf: adjust insufficient default bpf_jit_limit to account for
    growth of BPF use over the last 5 years
 
  - xdp: bpf_xdp_metadata() use EOPNOTSUPP as unique errno indicating
    no driver support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmQc4vkACgkQMUZtbf5S
 IruG/w//XixBtdFMHE0/fcGv77jTovlJNiDYeaa+KtyjvIseieYwOKW5F31r3xvl
 Mf/YhNEjAc++V8Zna/1UM5i/WOj1PJdHgSC+wMUGUXjMF+MfzL57nM83CllOpUB5
 Z9YtUqGfolf2Vtx03wnV14qawmVnJWYKHn3AU11cueE5dUu6KNyBTCefQ7uzgcJN
 zMtHAxw96MRQIDxSfKvZsePk4FnQ4qoSOLkslji5iikcMnKePaqZaxQla2oTcEIR
 zue9V+ILmi62Y8mPcdT4ePpZQsjB39bpemh+9EL6l03/cjsjqmuiCw/d1+6g9kuy
 ZD5LgZzUOb6xalhSseiwJL+vj8x2gQhshEfoHQvgp7fzr6agta6sisRX611wtmJl
 hv4k2PMRqFrMv2S+8m8XC177bXIaGbiWh4vBFOWjf4u0lG55cGlzclbXWWQ80njy
 C5cE4V7qPRk8Cl/+uT10CLNQx6JmaX8kcddtFrYpu0PZHKx1WfUYKIpgkiiMPRKT
 njLkDQbFRa8Y3p7UX0wU1TbeuMzzLz+aTBrFEN864IJmbnUnWimeluQzD60WbkSx
 6dciqq11LtvYDsR1HZ1pb7IoHYuDsDrO2Rx4zuqsB/SyfrGdRKJoKOnYvsk+AdCL
 N/e4wivie8s6b+G3yL6p+IdlpEaVo2ZiLINp7JSW8jhW1hRcZUI=
 =XBLi
 -----END PGP SIGNATURE-----

Merge tag 'net-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, wifi and bluetooth.

  Current release - regressions:

   - wifi: mt76: mt7915: add back 160MHz channel width support for
     MT7915

   - libbpf: revert poisoning of strlcpy, it broke uClibc-ng

  Current release - new code bugs:

   - bpf: improve the coverage of the "allow reads from uninit stack"
     feature to fix verification complexity problems

   - eth: am65-cpts: reset PPS genf adj settings on enable

  Previous releases - regressions:

   - wifi: mac80211: serialize ieee80211_handle_wake_tx_queue()

   - wifi: mt76: do not run mt76_unregister_device() on unregistered hw,
     fix null-deref

   - Bluetooth: btqcomsmd: fix command timeout after setting BD address

   - eth: igb: revert rtnl_lock() that causes a deadlock

   - dsa: mscc: ocelot: fix device specific statistics

  Previous releases - always broken:

   - xsk: add missing overflow check in xdp_umem_reg()

   - wifi: mac80211:
      - fix QoS on mesh interfaces
      - fix mesh path discovery based on unicast packets

   - Bluetooth:
      - ISO: fix timestamped HCI ISO data packet parsing
      - remove "Power-on" check from Mesh feature

   - usbnet: more fixes to drivers trusting packet length

   - wifi: iwlwifi: mvm: fix mvmtxq->stopped handling

   - Bluetooth: btintel: iterate only bluetooth device ACPI entries

   - eth: iavf: fix inverted Rx hash condition leading to disabled hash

   - eth: igc: fix the validation logic for taprio's gate list

   - dsa: tag_brcm: legacy: fix daisy-chained switches

  Misc:

   - bpf: adjust insufficient default bpf_jit_limit to account for
     growth of BPF use over the last 5 years

   - xdp: bpf_xdp_metadata() use EOPNOTSUPP as unique errno indicating
     no driver support"

* tag 'net-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
  Bluetooth: HCI: Fix global-out-of-bounds
  Bluetooth: mgmt: Fix MGMT add advmon with RSSI command
  Bluetooth: btsdio: fix use after free bug in btsdio_remove due to unfinished work
  Bluetooth: L2CAP: Fix responding with wrong PDU type
  Bluetooth: btqcomsmd: Fix command timeout after setting BD address
  Bluetooth: btinel: Check ACPI handle for NULL before accessing
  net: mdio: thunder: Add missing fwnode_handle_put()
  net: dsa: mt7530: move setting ssc_delta to PHY_INTERFACE_MODE_TRGMII case
  net: dsa: mt7530: move lowering TRGMII driving to mt7530_setup()
  net: dsa: mt7530: move enabling disabling core clock to mt7530_pll_setup()
  net: asix: fix modprobe "sysfs: cannot create duplicate filename"
  gve: Cache link_speed value from device
  tools: ynl: Fix genlmsg header encoding formats
  net: enetc: fix aggregate RMON counters not showing the ranges
  Bluetooth: Remove "Power-on" check from Mesh feature
  Bluetooth: Fix race condition in hci_cmd_sync_clear
  Bluetooth: btintel: Iterate only bluetooth device ACPI entries
  Bluetooth: ISO: fix timestamped HCI ISO data packet parsing
  Bluetooth: btusb: Remove detection of ISO packets over bulk
  Bluetooth: hci_core: Detect if an ACL packet is in fact an ISO packet
  ...
2023-03-24 08:48:12 -07:00
Marco Elver
5eb39cde1e kcsan: avoid passing -g for test
Nathan reported that when building with GNU as and a version of clang that
defaults to DWARF5, the assembler will complain with:

  Error: non-constant .uleb128 is not supported

This is because `-g` defaults to the compiler debug info default. If the
assembler does not support some of the directives used, the above errors
occur. To fix, remove the explicit passing of `-g`.

All the test wants is that stack traces print valid function names, and
debug info is not required for that. (I currently cannot recall why I
added the explicit `-g`.)

Link: https://lkml.kernel.org/r/20230316224705.709984-2-elver@google.com
Fixes: 1fe84fd4a4 ("kcsan: Add test suite")
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23 17:18:35 -07:00
Jakub Kicinski
1b4ae19e43 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZBzSGQAKCRDbK58LschI
 g+dhAP95enbrlwaQ+9aoqrU+GqCq+uo4SkaqnUtq6GSvRNiVBQD8C6iZxrAjyXnm
 1wRr3JN/HszPBzgjl3HvDc9y69I/PAI=
 =8JwR
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-03-23

We've added 8 non-merge commits during the last 13 day(s) which contain
a total of 21 files changed, 238 insertions(+), 161 deletions(-).

The main changes are:

1) Fix verification issues in some BPF programs due to their stack usage
   patterns, from Eduard Zingerman.

2) Fix to add missing overflow checks in xdp_umem_reg and return an error
   in such case, from Kal Conley.

3) Fix and undo poisoning of strlcpy in libbpf given it broke builds for
   libcs which provided the former like uClibc-ng, from Jesus Sanchez-Palencia.

4) Fix insufficient bpf_jit_limit default to avoid users running into hard
   to debug seccomp BPF errors, from Daniel Borkmann.

5) Fix driver return code when they don't support a bpf_xdp_metadata kfunc
   to make it unambiguous from other errors, from Jesper Dangaard Brouer.

6) Two BPF selftest fixes to address compilation errors from recent changes
   in kernel structures, from Alexei Starovoitov.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  xdp: bpf_xdp_metadata use EOPNOTSUPP for no driver support
  bpf: Adjust insufficient default bpf_jit_limit
  xsk: Add missing overflow check in xdp_umem_reg
  selftests/bpf: Fix progs/test_deny_namespace.c issues.
  selftests/bpf: Fix progs/find_vma_fail1.c build error.
  libbpf: Revert poisoning of strlcpy
  selftests/bpf: Tests for uninitialized stack reads
  bpf: Allow reads from uninit stack
====================

Link: https://lore.kernel.org/r/20230323225221.6082-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-23 16:03:33 -07:00
Daniel Borkmann
10ec8ca8ec bpf: Adjust insufficient default bpf_jit_limit
We've seen recent AWS EKS (Kubernetes) user reports like the following:

  After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS
  clusters after a few days a number of the nodes have containers stuck
  in ContainerCreating state or liveness/readiness probes reporting the
  following error:

    Readiness probe errored: rpc error: code = Unknown desc = failed to
    exec in container: failed to start exec "4a11039f730203ffc003b7[...]":
    OCI runtime exec failed: exec failed: unable to start container process:
    unable to init seccomp: error loading seccomp filter into kernel:
    error loading seccomp filter: errno 524: unknown

  However, we had not been seeing this issue on previous AMIs and it only
  started to occur on v20230217 (following the upgrade from kernel 5.4 to
  5.10) with no other changes to the underlying cluster or workloads.

  We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528)
  which helped to immediately allow containers to be created and probes to
  execute but after approximately a day the issue returned and the value
  returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}'
  was steadily increasing.

I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem
their sizes passed in as well as bpf_jit_current under tcpdump BPF filter,
seccomp BPF and native (e)BPF programs, and the behavior all looks sane
and expected, that is nothing "leaking" from an upstream perspective.

The bpf_jit_limit knob was originally added in order to avoid a situation
where unprivileged applications loading BPF programs (e.g. seccomp BPF
policies) consuming all the module memory space via BPF JIT such that loading
of kernel modules would be prevented. The default limit was defined back in
2018 and while good enough back then, we are generally seeing far more BPF
consumers today.

Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the
module memory space to better reflect today's needs and avoid more users
running into potentially hard to debug issues.

Fixes: fdadd04931 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
Reported-by: Stephen Haynes <sh@synk.net>
Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://github.com/awslabs/amazon-eks-ami/issues/1179
Link: https://github.com/awslabs/amazon-eks-ami/issues/1219
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230320143725.8394-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-21 12:43:05 -07:00
Frederic Weisbecker
b416514054 entry/rcu: Check TIF_RESCHED _after_ delayed RCU wake-up
RCU sometimes needs to perform a delayed wake up for specific kthreads
handling offloaded callbacks (RCU_NOCB).  These wakeups are performed
by timers and upon entry to idle (also to guest and to user on nohz_full).

However the delayed wake-up on kernel exit is actually performed after
the thread flags are fetched towards the fast path check for work to
do on exit to user. As a result, and if there is no other pending work
to do upon that kernel exit, the current task will resume to userspace
with TIF_RESCHED set and the pending wake up ignored.

Fix this with fetching the thread flags _after_ the delayed RCU-nocb
kthread wake-up.

Fixes: 47b8ff194c ("entry: Explicitly flush pending rcuog wakeup before last rescheduling point")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230315194349.10798-3-joel@joelfernandes.org
2023-03-21 15:13:15 +01:00
Vincent Guittot
a53ce18cac sched/fair: Sanitize vruntime of entity being migrated
Commit 829c1651e9 ("sched/fair: sanitize vruntime of entity being placed")
fixes an overflowing bug, but ignore a case that se->exec_start is reset
after a migration.

For fixing this case, we delay the reset of se->exec_start after
placing the entity which se->exec_start to detect long sleeping task.

In order to take into account a possible divergence between the clock_task
of 2 rqs, we increase the threshold to around 104 days.

Fixes: 829c1651e9 ("sched/fair: sanitize vruntime of entity being placed")
Originally-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
Link: https://lore.kernel.org/r/20230317160810.107988-1-vincent.guittot@linaro.org
2023-03-21 14:43:04 +01:00
Josh Poimboeuf
f87d28673b entry: Fix noinstr warning in __enter_from_user_mode()
__enter_from_user_mode() is triggering noinstr warnings with
CONFIG_DEBUG_PREEMPT due to its call of preempt_count_add() via
ct_state().

The preemption disable isn't needed as interrupts are already disabled.
And the context_tracking_enabled() check in ct_state() also isn't needed
as that's already being done by the CT_WARN_ON().

Just use __ct_state() instead.

Fixes the following warnings:

  vmlinux.o: warning: objtool: enter_from_user_mode+0xba: call to preempt_count_add() leaves .noinstr.text section
  vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0xf9: call to preempt_count_add() leaves .noinstr.text section
  vmlinux.o: warning: objtool: syscall_enter_from_user_mode_prepare+0xc7: call to preempt_count_add() leaves .noinstr.text section
  vmlinux.o: warning: objtool: irqentry_enter_from_user_mode+0xba: call to preempt_count_add() leaves .noinstr.text section

Fixes: 171476775d ("context_tracking: Convert state to atomic_t")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/d8955fa6d68dc955dda19baf13ae014ae27926f5.1677369694.git.jpoimboe@kernel.org
2023-03-21 11:53:16 +01:00
Linus Torvalds
eaba52d63b Tracing fixes for 6.3:
- Fix setting affinity of hwlat threads in containers
   Using sched_set_affinity() has unwanted side effects when being
   called within a container. Use set_cpus_allowed_ptr() instead.
 
 - Fix per cpu thread management of the hwlat tracer
   * Do not start per_cpu threads if one is already running for the CPU.
   * When starting per_cpu threads, do not clear the kthread variable
     as it may already be set to running per cpu threads
 
 - Fix return value for test_gen_kprobe_cmd()
   On error the return value was overwritten by being set to
   the result of the call from kprobe_event_delete(), which would
   likely succeed, and thus have the function return success.
 
 - Fix splice() reads from the trace file that was broken by
   36e2c7421f ("fs: don't allow splice read/write without explicit ops")
 
 - Remove obsolete and confusing comment in ring_buffer.c
   The original design of the ring buffer used struct page flags
   for tricks to optimize, which was shortly removed due to them
   being tricks. But a comment for those tricks remained.
 
 - Set local functions and variables to static
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZBdIkBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qti2AP49s1GM8teFgDF/CO3oa45BoYq1lMJO
 1Z+x+mS/bdUAZgEAw+3KGo8oZDuvWu/nr04JPeoy0GL1/JnbQ6JNCCjzhQc=
 =Yk66
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix setting affinity of hwlat threads in containers

   Using sched_set_affinity() has unwanted side effects when being
   called within a container. Use set_cpus_allowed_ptr() instead

 - Fix per cpu thread management of the hwlat tracer:
    - Do not start per_cpu threads if one is already running for the CPU
    - When starting per_cpu threads, do not clear the kthread variable
      as it may already be set to running per cpu threads

 - Fix return value for test_gen_kprobe_cmd()

   On error the return value was overwritten by being set to the result
   of the call from kprobe_event_delete(), which would likely succeed,
   and thus have the function return success

 - Fix splice() reads from the trace file that was broken by commit
   36e2c7421f ("fs: don't allow splice read/write without explicit
   ops")

 - Remove obsolete and confusing comment in ring_buffer.c

   The original design of the ring buffer used struct page flags for
   tricks to optimize, which was shortly removed due to them being
   tricks. But a comment for those tricks remained

 - Set local functions and variables to static

* tag 'trace-v6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/hwlat: Replace sched_setaffinity with set_cpus_allowed_ptr
  ring-buffer: remove obsolete comment for free_buffer_page()
  tracing: Make splice_read available again
  ftrace: Set direct_ops storage-class-specifier to static
  trace/hwlat: Do not start per-cpu thread if it is already running
  trace/hwlat: Do not wipe the contents of per-cpu thread data
  tracing/osnoise: set several trace_osnoise.c variables storage-class-specifier to static
  tracing: Fix wrong return in kprobe_event_gen_test.c
2023-03-19 10:46:02 -07:00
Costa Shulyupin
71c7a30442 tracing/hwlat: Replace sched_setaffinity with set_cpus_allowed_ptr
There is a problem with the behavior of hwlat in a container,
resulting in incorrect output. A warning message is generated:
"cpumask changed while in round-robin mode, switching to mode none",
and the tracing_cpumask is ignored. This issue arises because
the kernel thread, hwlatd, is not a part of the container, and
the function sched_setaffinity is unable to locate it using its PID.
Additionally, the task_struct of hwlatd is already known.
Ultimately, the function set_cpus_allowed_ptr achieves
the same outcome as sched_setaffinity, but employs task_struct
instead of PID.

Test case:

  # cd /sys/kernel/tracing
  # echo 0 > tracing_on
  # echo round-robin > hwlat_detector/mode
  # echo hwlat > current_tracer
  # unshare --fork --pid bash -c 'echo 1 > tracing_on'
  # dmesg -c

Actual behavior:

[573502.809060] hwlat_detector: cpumask changed while in round-robin mode, switching to mode none

Link: https://lore.kernel.org/linux-trace-kernel/20230316144535.1004952-1-costa.shul@redhat.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 0330f7aa8e ("tracing: Have hwlat trace migrate across tracing_cpumask CPUs")
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 13:23:22 -04:00
Vlastimil Babka
a98151ad53 ring-buffer: remove obsolete comment for free_buffer_page()
The comment refers to mm/slob.c which is being removed. It comes from
commit ed56829cb3 ("ring_buffer: reset buffer page when freeing") and
according to Steven the borrowed code was a page mapcount and mapping
reset, which was later removed by commit e4c2ce82ca ("ring_buffer:
allocate buffer page pointer"). Thus the comment is not accurate anyway,
remove it.

Link: https://lore.kernel.org/linux-trace-kernel/20230315142446.27040-1-vbabka@suse.cz

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Reported-by: Mike Rapoport <mike.rapoport@gmail.com>
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Fixes: e4c2ce82ca ("ring_buffer: allocate buffer page pointer")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 13:23:22 -04:00
Sung-hun Kim
e400be674a tracing: Make splice_read available again
Since the commit 36e2c7421f ("fs: don't allow splice read/write
without explicit ops") is applied to the kernel, splice() and
sendfile() calls on the trace file (/sys/kernel/debug/tracing
/trace) return EINVAL.

This patch restores these system calls by initializing splice_read
in file_operations of the trace file. This patch only enables such
functionalities for the read case.

Link: https://lore.kernel.org/linux-trace-kernel/20230314013707.28814-1-sfoon.kim@samsung.com

Cc: stable@vger.kernel.org
Fixes: 36e2c7421f ("fs: don't allow splice read/write without explicit ops")
Signed-off-by: Sung-hun Kim <sfoon.kim@samsung.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 13:16:21 -04:00
Dave Chinner
7ba85fba47 fork: remove use of percpu_counter_sum_all
This effectively reverts the change made in commit f689054aac
("percpu_counter: add percpu_counter_sum_all interface") as the
race condition percpu_counter_sum_all() was invented to avoid is
now handled directly in percpu_counter_sum() and nobody needs to
care about summing racing with cpu unplug anymore.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-03-19 10:02:04 -07:00
Linus Torvalds
80102f2ee7 - Check whether sibling events have been deactivated before adding them
to groups
 
 - Update the proper event time tracking variable depending on the
   event type
 
 - Fix a memory overwrite issue due to using the wrong function argument
   when outputting perf events
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQXBiQACgkQEsHwGGHe
 VUrTHg//dTd+7Oo1vao32lZCkVW0bBvx3KzoyAcvEPh7Yu3Cw+tZkfI8lWY8dgU3
 lgecN1pUa9IiZNULFbulXqRumcq3HIRKCp/RejEmR3W30KCwxnEAwGdakkCPEcMS
 2vXl4SZn7rU8avKJBd/ZUUS5lDWz6YHPNbLX3iamFH+7oN56Vf2LFJuuO0WtZSgr
 Sqv25oaV1ZZjeEAI5KrPM04hcldj4BXGPoIPvP30/c3z6aPnwt6v5H2gw5IUJNHl
 Qm0ycgW7An6iNR4bFpm/NP9cSAI7FgmAB3VuhQAExl6NroUxE8R+HEO04PsJAOgd
 BpNB1eXIjrTgw8r+TsDq4Z9VY8fgMtHbwutAp348XwF6//5F/WOzojwSGAAX7LCx
 5+fK4eA4gLkQJmuyI5/3fNCC0EnUTsK/YCC5eFyas8KFERTmmq8hafCBQO4W9VPr
 8A7TD2X6fWNWvS/UF1RqVv/gCa0ub7ifz+eOx42/vK2PXn4vsjMv0JtwnARj8PzZ
 Ymo/3ArmBWhuP+94FdVP4AduBuEdHvWO7EZG5buRdOo//sb2MwX05zufowLvR9YQ
 E+VkZxf0RFPaDQdPnoS/SwFE206ii2Z5MtgsQX7XpoBo06R5AzZQlhlMl/StFvN0
 65ut1dWwqrVG0uSN0GOzTIZl5x5YZASm5I49ItZsnZyoMPTwDOc=
 =l7PF
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.3_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - Check whether sibling events have been deactivated before adding them
   to groups

 - Update the proper event time tracking variable depending on the event
   type

 - Fix a memory overwrite issue due to using the wrong function argument
   when outputting perf events

* tag 'perf_urgent_for_v6.3_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix check before add_event_to_groups() in perf_group_detach()
  perf: fix perf_event_context->time
  perf/core: Fix perf_output_begin parameter is incorrectly invoked in perf_event_bpf_output
2023-03-19 09:47:55 -07:00
Tom Rix
8732565549 ftrace: Set direct_ops storage-class-specifier to static
smatch reports this warning
kernel/trace/ftrace.c:2594:19: warning:
  symbol 'direct_ops' was not declared. Should it be static?

The variable direct_ops is only used in ftrace.c, so it should be static

Link: https://lore.kernel.org/linux-trace-kernel/20230311135113.711824-1-trix@redhat.com

Signed-off-by: Tom Rix <trix@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 12:20:49 -04:00
Tero Kristo
08697bca9b trace/hwlat: Do not start per-cpu thread if it is already running
The hwlatd tracer will end up starting multiple per-cpu threads with
the following script:

    #!/bin/sh
    cd /sys/kernel/debug/tracing
    echo 0 > tracing_on
    echo hwlat > current_tracer
    echo per-cpu > hwlat_detector/mode
    echo 100000 > hwlat_detector/width
    echo 200000 > hwlat_detector/window
    echo 1 > tracing_on

To fix the issue, check if the hwlatd thread for the cpu is already
running, before starting a new one. Along with the previous patch, this
avoids running multiple instances of the same CPU thread on the system.

Link: https://lore.kernel.org/all/20230302113654.2984709-1-tero.kristo@linux.intel.com/
Link: https://lkml.kernel.org/r/20230310100451.3948583-3-tero.kristo@linux.intel.com

Cc: stable@vger.kernel.org
Fixes: f46b16520a ("trace/hwlat: Implement the per-cpu mode")
Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 12:20:49 -04:00
Tero Kristo
4c42f5f0d1 trace/hwlat: Do not wipe the contents of per-cpu thread data
Do not wipe the contents of the per-cpu kthread data when starting the
tracer, as this will completely forget about already running instances
and can later start new additional per-cpu threads.

Link: https://lore.kernel.org/all/20230302113654.2984709-1-tero.kristo@linux.intel.com/
Link: https://lkml.kernel.org/r/20230310100451.3948583-2-tero.kristo@linux.intel.com

Cc: stable@vger.kernel.org
Fixes: f46b16520a ("trace/hwlat: Implement the per-cpu mode")
Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 12:20:48 -04:00
Tom Rix
7a025e066e tracing/osnoise: set several trace_osnoise.c variables storage-class-specifier to static
smatch reports several similar warnings
kernel/trace/trace_osnoise.c:220:1: warning:
  symbol '__pcpu_scope_per_cpu_osnoise_var' was not declared. Should it be static?
kernel/trace/trace_osnoise.c:243:1: warning:
  symbol '__pcpu_scope_per_cpu_timerlat_var' was not declared. Should it be static?
kernel/trace/trace_osnoise.c:335:14: warning:
  symbol 'interface_lock' was not declared. Should it be static?
kernel/trace/trace_osnoise.c:2242:5: warning:
  symbol 'timerlat_min_period' was not declared. Should it be static?
kernel/trace/trace_osnoise.c:2243:5: warning:
  symbol 'timerlat_max_period' was not declared. Should it be static?

These variables are only used in trace_osnoise.c, so it should be static

Link: https://lore.kernel.org/linux-trace-kernel/20230309150414.4036764-1-trix@redhat.com

Signed-off-by: Tom Rix <trix@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 12:20:48 -04:00
Anton Gusev
bc4f359b3b tracing: Fix wrong return in kprobe_event_gen_test.c
Overwriting the error code with the deletion result may cause the
function to return 0 despite encountering an error. Commit b111545d26
("tracing: Remove the useless value assignment in
test_create_synth_event()") solves a similar issue by
returning the original error code, so this patch does the same.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Link: https://lore.kernel.org/linux-trace-kernel/20230131075818.5322-1-aagusev@ispras.ru

Signed-off-by: Anton Gusev <aagusev@ispras.ru>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19 12:20:48 -04:00
Budimir Markovic
fd0815f632 perf: Fix check before add_event_to_groups() in perf_group_detach()
Events should only be added to a groups rb tree if they have not been
removed from their context by list_del_event(). Since remove_on_exec
made it possible to call list_del_event() on individual events before
they are detached from their group, perf_group_detach() should check each
sibling's attach_state before calling add_event_to_groups() on it.

Fixes: 2e498d0a74 ("perf: Add support for event removal on exec")
Signed-off-by: Budimir Markovic <markovicbudimir@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/ZBFzvQV9tEqoHEtH@gentoo
2023-03-15 21:49:47 +01:00
Song Liu
baf1b12a67 perf: fix perf_event_context->time
Time readers rely on perf_event_context->[time|timestamp|timeoffset] to get
accurate time_enabled and time_running for an event. The difference between
ctx->timestamp and ctx->time is the among of time when the context is not
enabled. __update_context_time(ctx, false) is used to increase timestamp,
but not time. Therefore, it should only be called in ctx_sched_in() when
EVENT_TIME was not enabled.

Fixes: 09f5e7dc7a ("perf: Fix perf_event_read_local() time")
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lkml.kernel.org/r/20230313171608.298734-1-song@kernel.org
2023-03-15 21:49:46 +01:00
Yang Jihong
eb81a2ed4f perf/core: Fix perf_output_begin parameter is incorrectly invoked in perf_event_bpf_output
syzkaller reportes a KASAN issue with stack-out-of-bounds.
The call trace is as follows:
  dump_stack+0x9c/0xd3
  print_address_description.constprop.0+0x19/0x170
  __kasan_report.cold+0x6c/0x84
  kasan_report+0x3a/0x50
  __perf_event_header__init_id+0x34/0x290
  perf_event_header__init_id+0x48/0x60
  perf_output_begin+0x4a4/0x560
  perf_event_bpf_output+0x161/0x1e0
  perf_iterate_sb_cpu+0x29e/0x340
  perf_iterate_sb+0x4c/0xc0
  perf_event_bpf_event+0x194/0x2c0
  __bpf_prog_put.constprop.0+0x55/0xf0
  __cls_bpf_delete_prog+0xea/0x120 [cls_bpf]
  cls_bpf_delete_prog_work+0x1c/0x30 [cls_bpf]
  process_one_work+0x3c2/0x730
  worker_thread+0x93/0x650
  kthread+0x1b8/0x210
  ret_from_fork+0x1f/0x30

commit 267fb27352 ("perf: Reduce stack usage of perf_output_begin()")
use on-stack struct perf_sample_data of the caller function.

However, perf_event_bpf_output uses incorrect parameter to convert
small-sized data (struct perf_bpf_event) into large-sized data
(struct perf_sample_data), which causes memory overwriting occurs in
__perf_event_header__init_id.

Fixes: 267fb27352 ("perf: Reduce stack usage of perf_output_begin()")
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230314044735.56551-1-yangjihong1@huawei.com
2023-03-15 21:49:46 +01:00
Linus Torvalds
6015b1aca1 sched_getaffinity: don't assume 'cpumask_size()' is fully initialized
The getaffinity() system call uses 'cpumask_size()' to decide how big
the CPU mask is - so far so good.  It is indeed the allocation size of a
cpumask.

But the code also assumes that the whole allocation is initialized
without actually doing so itself.  That's wrong, because we might have
fixed-size allocations (making copying and clearing more efficient), but
not all of it is then necessarily used if 'nr_cpu_ids' is smaller.

Having checked other users of 'cpumask_size()', they all seem to be ok,
either using it purely for the allocation size, or explicitly zeroing
the cpumask before using the size in bytes to copy it.

See for example the ublk_ctrl_get_queue_affinity() function that uses
the proper 'zalloc_cpumask_var()' to make sure that the whole mask is
cleared, whether the storage is on the stack or if it was an external
allocation.

Fix this by just zeroing the allocation before using it.  Do the same
for the compat version of sched_getaffinity(), which had the same logic.

Also, for consistency, make sched_getaffinity() use 'cpumask_bits()' to
access the bits.  For a cpumask_var_t, it ends up being a pointer to the
same data either way, but it's just a good idea to treat it like you
would a 'cpumask_t'.  The compat case already did that.

Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/lkml/7d026744-6bd6-6827-0471-b5e8eae0be3f@arm.com/
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-14 19:32:38 -07:00
Linus Torvalds
29db00c252 Tracing fixes for v6.3
- Do not allow histogram values to have modifies.
   Can cause a NULL pointer dereference if they do.
 
 - Warn if hist_field_name() is passed a NULL.
   Prevent the NULL pointer dereference mentioned above.
 
 - Fix invalid address look up race in lookup_rec()
 
 - Define ftrace_stub_graph conditionally to prevent linker errors
 
 - Always check if RCU is watching at all tracepoint locations
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZBDuTBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qsboAP4yfrFYvIIKM5EkzkEiPI+V2hdlA12x
 bt839jO5AWCmhAEAiY8FmKatpBJQKsiGqSOab8aHOMnhGFZwltCHAPa9PAI=
 =vtA2
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Do not allow histogram values to have modifies. They can cause a NULL
   pointer dereference if they do.

 - Warn if hist_field_name() is passed a NULL. Prevent the NULL pointer
   dereference mentioned above.

 - Fix invalid address look up race in lookup_rec()

 - Define ftrace_stub_graph conditionally to prevent linker errors

 - Always check if RCU is watching at all tracepoint locations

* tag 'trace-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Make tracepoint lockdep check actually test something
  ftrace,kcfi: Define ftrace_stub_graph conditionally
  ftrace: Fix invalid address access in lookup_rec() when index is 0
  tracing: Check field value in hist_field_name()
  tracing: Do not let histogram values have some modifiers
2023-03-14 17:07:54 -07:00
Alexei Starovoitov
a33a6eaa19 Merge branch 'bpf: Allow reads from uninit stack'
Merge commit bf9bec4cb3 ("Merge branch 'bpf: Allow reads from uninit stack'")
from bpf-next to bpf tree to address verification issues in some programs
due to stack usage.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-13 13:21:22 -07:00
Linus Torvalds
f5eded1f5f kernel.fork.v6.3-rc2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZA2l2QAKCRCRxhvAZXjc
 okR1AP9UjPVvVTU3DRp7Giqyv1rdv/iaCVRtEDQmhzDflksioQEAyJXTt+3YOTNl
 sSocNYBhVBsijelICeq7hZrmVP9CrgM=
 =C0cC
 -----END PGP SIGNATURE-----

Merge tag 'kernel.fork.v6.3-rc2' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux

Pull clone3 fix from Christian Brauner:
 "A simple fix for the clone3() system call.

  The CLONE_NEWTIME allows the creation of time namespaces. The flag
  reuses a bit from the CSIGNAL bits that are used in the legacy clone()
  system call to set the signal that gets sent to the parent after the
  child exits.

  The clone3() system call doesn't rely on CSIGNAL anymore as it uses a
  dedicated .exit_signal field in struct clone_args. So we blocked all
  CSIGNAL bits in clone3_args_valid(). When CLONE_NEWTIME was introduced
  and reused a CSIGNAL bit we forgot to adapt clone3_args_valid()
  causing CLONE_NEWTIME with clone3() to be rejected. Fix this"

* tag 'kernel.fork.v6.3-rc2' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux:
  selftests/clone3: test clone3 with CLONE_NEWTIME
  fork: allow CLONE_NEWTIME in clone3 flags
2023-03-12 09:04:28 -07:00
Linus Torvalds
3b11717f95 vfs.misc.v6.3-rc2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZA2yXAAKCRCRxhvAZXjc
 opK7AP0fqkk75P1bRZL36iNOgCV0RDiSN/Ynk/oMYpsOyBndlAD7BKCEZFF2OKzP
 aeJrY0F+guwL67X+18X+yiLZrk2rag4=
 =2Wa/
 -----END PGP SIGNATURE-----

Merge tag 'vfs.misc.v6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull vfs fixes from Christian Brauner:

 - When allocating pages for a watch queue failed, we didn't return an
   error causing userspace to proceed even though all subsequent
   notifcations would be lost. Make sure to return an error.

 - Fix a misformed tree entry for the idmapping maintainers entry.

 - When setting file leases from an idmapped mount via
   generic_setlease() we need to take the idmapping into account
   otherwise taking a lease would fail from an idmapped mount.

 - Remove two redundant assignments, one in splice code and the other in
   locks code, that static checkers complained about.

* tag 'vfs.misc.v6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  filelocks: use mount idmapping for setlease permission check
  fs/locks: Remove redundant assignment to cmd
  splice: Remove redundant assignment to ret
  MAINTAINERS: repair a malformed T: entry in IDMAPPED MOUNTS
  watch_queue: fix IOC_WATCH_QUEUE_SET_SIZE alloc error paths
2023-03-12 09:00:54 -07:00
Chen Zhongjin
ee92fa4433 ftrace: Fix invalid address access in lookup_rec() when index is 0
KASAN reported follow problem:

 BUG: KASAN: use-after-free in lookup_rec
 Read of size 8 at addr ffff000199270ff0 by task modprobe
 CPU: 2 Comm: modprobe
 Call trace:
  kasan_report
  __asan_load8
  lookup_rec
  ftrace_location
  arch_check_ftrace_location
  check_kprobe_address_safe
  register_kprobe

When checking pg->records[pg->index - 1].ip in lookup_rec(), it can get a
pg which is newly added to ftrace_pages_start in ftrace_process_locs().
Before the first pg->index++, index is 0 and accessing pg->records[-1].ip
will cause this problem.

Don't check the ip when pg->index is 0.

Link: https://lore.kernel.org/linux-trace-kernel/20230309080230.36064-1-chenzhongjin@huawei.com

Cc: stable@vger.kernel.org
Fixes: 9644302e33 ("ftrace: Speed up search by skipping pages by address")
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-09 22:17:06 -05:00
Steven Rostedt (Google)
9f116f76fa tracing: Check field value in hist_field_name()
The function hist_field_name() cannot handle being passed a NULL field
parameter. It should never be NULL, but due to a previous bug, NULL was
passed to the function and the kernel crashed due to a NULL dereference.
Mark Rutland reported this to me on IRC.

The bug was fixed, but to prevent future bugs from crashing the kernel,
check the field and add a WARN_ON() if it is NULL.

Link: https://lkml.kernel.org/r/20230302020810.762384440@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Fixes: c6afad49d1 ("tracing: Add hist trigger 'sym' and 'sym-offset' modifiers")
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-09 22:17:06 -05:00
Steven Rostedt (Google)
e0213434fe tracing: Do not let histogram values have some modifiers
Histogram values can not be strings, stacktraces, graphs, symbols,
syscalls, or grouped in buckets or log. Give an error if a value is set to
do so.

Note, the histogram code was not prepared to handle these modifiers for
histograms and caused a bug.

Mark Rutland reported:

 # echo 'p:copy_to_user __arch_copy_to_user n=$arg2' >> /sys/kernel/tracing/kprobe_events
 # echo 'hist:keys=n:vals=hitcount.buckets=8:sort=hitcount' > /sys/kernel/tracing/events/kprobes/copy_to_user/trigger
 # cat /sys/kernel/tracing/events/kprobes/copy_to_user/hist
[  143.694628] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  143.695190] Mem abort info:
[  143.695362]   ESR = 0x0000000096000004
[  143.695604]   EC = 0x25: DABT (current EL), IL = 32 bits
[  143.695889]   SET = 0, FnV = 0
[  143.696077]   EA = 0, S1PTW = 0
[  143.696302]   FSC = 0x04: level 0 translation fault
[  143.702381] Data abort info:
[  143.702614]   ISV = 0, ISS = 0x00000004
[  143.702832]   CM = 0, WnR = 0
[  143.703087] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000448f9000
[  143.703407] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[  143.704137] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  143.704714] Modules linked in:
[  143.705273] CPU: 0 PID: 133 Comm: cat Not tainted 6.2.0-00003-g6fc512c10a7c #3
[  143.706138] Hardware name: linux,dummy-virt (DT)
[  143.706723] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  143.707120] pc : hist_field_name.part.0+0x14/0x140
[  143.707504] lr : hist_field_name.part.0+0x104/0x140
[  143.707774] sp : ffff800008333a30
[  143.707952] x29: ffff800008333a30 x28: 0000000000000001 x27: 0000000000400cc0
[  143.708429] x26: ffffd7a653b20260 x25: 0000000000000000 x24: ffff10d303ee5800
[  143.708776] x23: ffffd7a6539b27b0 x22: ffff10d303fb8c00 x21: 0000000000000001
[  143.709127] x20: ffff10d303ec2000 x19: 0000000000000000 x18: 0000000000000000
[  143.709478] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[  143.709824] x14: 0000000000000000 x13: 203a6f666e692072 x12: 6567676972742023
[  143.710179] x11: 0a230a6d6172676f x10: 000000000000002c x9 : ffffd7a6521e018c
[  143.710584] x8 : 000000000000002c x7 : 7f7f7f7f7f7f7f7f x6 : 000000000000002c
[  143.710915] x5 : ffff10d303b0103e x4 : ffffd7a653b20261 x3 : 000000000000003d
[  143.711239] x2 : 0000000000020001 x1 : 0000000000000001 x0 : 0000000000000000
[  143.711746] Call trace:
[  143.712115]  hist_field_name.part.0+0x14/0x140
[  143.712642]  hist_field_name.part.0+0x104/0x140
[  143.712925]  hist_field_print+0x28/0x140
[  143.713125]  event_hist_trigger_print+0x174/0x4d0
[  143.713348]  hist_show+0xf8/0x980
[  143.713521]  seq_read_iter+0x1bc/0x4b0
[  143.713711]  seq_read+0x8c/0xc4
[  143.713876]  vfs_read+0xc8/0x2a4
[  143.714043]  ksys_read+0x70/0xfc
[  143.714218]  __arm64_sys_read+0x24/0x30
[  143.714400]  invoke_syscall+0x50/0x120
[  143.714587]  el0_svc_common.constprop.0+0x4c/0x100
[  143.714807]  do_el0_svc+0x44/0xd0
[  143.714970]  el0_svc+0x2c/0x84
[  143.715134]  el0t_64_sync_handler+0xbc/0x140
[  143.715334]  el0t_64_sync+0x190/0x194
[  143.715742] Code: a9bd7bfd 910003fd a90153f3 aa0003f3 (f9400000)
[  143.716510] ---[ end trace 0000000000000000 ]---
Segmentation fault

Link: https://lkml.kernel.org/r/20230302020810.559462599@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: c6afad49d1 ("tracing: Add hist trigger 'sym' and 'sym-offset' modifiers")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-09 22:17:06 -05:00
Linus Torvalds
44889ba56c Networking fixes for 6.3-rc2, including fixes from netfilter, bpf
Current release - regressions:
 
   - core: avoid skb end_offset change in __skb_unclone_keeptruesize()
 
   - sched:
     - act_connmark: handle errno on tcf_idr_check_alloc
     - flower: fix fl_change() error recovery path
 
   - ieee802154: prevent user from crashing the host
 
 Current release - new code bugs:
 
   - eth: bnxt_en: fix the double free during device removal
 
   - tools: ynl:
     - fix enum-as-flags in the generic CLI
     - fully inherit attrs in subsets
     - re-license uniformly under GPL-2.0 or BSD-3-clause
 
 Previous releases - regressions:
 
   - core: use indirect calls helpers for sk_exit_memory_pressure()
 
   - tls:
     - fix return value for async crypto
     - avoid hanging tasks on the tx_lock
 
   - eth: ice: copy last block omitted in ice_get_module_eeprom()
 
 Previous releases - always broken:
 
   - core: avoid double iput when sock_alloc_file fails
 
   - af_unix: fix struct pid leaks in OOB support
 
   - tls:
     - fix possible race condition
     - fix device-offloaded sendpage straddling records
 
   - bpf:
     - sockmap: fix an infinite loop error
     - test_run: fix &xdp_frame misplacement for LIVE_FRAMES
     - fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTR
 
   - netfilter: tproxy: fix deadlock due to missing BH disable
 
   - phylib: get rid of unnecessary locking
 
   - eth: bgmac: fix *initial* chip reset to support BCM5358
 
   - eth: nfp: fix csum for ipsec offload
 
   - eth: mtk_eth_soc: fix RX data corruption issue
 
 Misc:
 
   - usb: qmi_wwan: add telit 0x1080 composition
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmQJzQISHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOky5YP/04Dbsbeqpk0Q94axmjoaS0J/4rW49js
 RaA7v8ci7sL1omW8k5tILPXniAouN4YHNOCW1KbLBMR5O7lyn9qM1RteHOIpOmte
 TLzAw+6Wl7CyGiiqirv2GU96Wd/jZoZpPXFZz/gXP59GnkChSHzQcpexmz0nrmxI
 eCRSs+qm+re3wmDKTYm5C+g+420PNXu9JItPnTNf+nTkTBxpmOEMyry03I0taXKS
 wceQHB2q5E0sSWXDfkxG/pmUuYTj3AdRSQ+vo+FLevSs/LWeThs2I6pT5sn8XS76
 1S8Lh6FytfBhyalFmRtrpqIJYyGae5MwEXQ29ddfmF4OFFLedx3IH0+JFQxTE9So
 i4gaXmM5SUI7c5vhib097xUISoLxKqqXQVQQSQ1MPZRfXtVubbA2gCv+vh6fXVoj
 zQYatZOLM7KT9q4Pw8A+9bJPof/FV+ObC67pbGQbJJgBoy+oOixDuP+x5DYT384L
 /5XS+23OZiFe7bvQoE/0SQMeRk3lF2XkS5l9gSbdSnGQPiaOqKhDgkoCmdkn1jvg
 qtkBS6+tRRoOBNsjC4r4eFXBVOQ1+myyjZetBnEOaSp22FaTJFQh9qX3AMFIHbUy
 m0jDi9OJZSWHICd6KNWPm3JK43cMjiyZbGftYqOHhuY5HN30vQN6sl7DXIJ0rIcE
 myHMfizwqmGT
 =hSXM
 -----END PGP SIGNATURE-----

Merge tag 'net-6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from netfilter and bpf.

  Current release - regressions:

   - core: avoid skb end_offset change in __skb_unclone_keeptruesize()

   - sched:
      - act_connmark: handle errno on tcf_idr_check_alloc
      - flower: fix fl_change() error recovery path

   - ieee802154: prevent user from crashing the host

  Current release - new code bugs:

   - eth: bnxt_en: fix the double free during device removal

   - tools: ynl:
      - fix enum-as-flags in the generic CLI
      - fully inherit attrs in subsets
      - re-license uniformly under GPL-2.0 or BSD-3-clause

  Previous releases - regressions:

   - core: use indirect calls helpers for sk_exit_memory_pressure()

   - tls:
      - fix return value for async crypto
      - avoid hanging tasks on the tx_lock

   - eth: ice: copy last block omitted in ice_get_module_eeprom()

  Previous releases - always broken:

   - core: avoid double iput when sock_alloc_file fails

   - af_unix: fix struct pid leaks in OOB support

   - tls:
      - fix possible race condition
      - fix device-offloaded sendpage straddling records

   - bpf:
      - sockmap: fix an infinite loop error
      - test_run: fix &xdp_frame misplacement for LIVE_FRAMES
      - fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTR

   - netfilter: tproxy: fix deadlock due to missing BH disable

   - phylib: get rid of unnecessary locking

   - eth: bgmac: fix *initial* chip reset to support BCM5358

   - eth: nfp: fix csum for ipsec offload

   - eth: mtk_eth_soc: fix RX data corruption issue

  Misc:

   - usb: qmi_wwan: add telit 0x1080 composition"

* tag 'net-6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits)
  tools: ynl: fix enum-as-flags in the generic CLI
  tools: ynl: move the enum classes to shared code
  net: avoid double iput when sock_alloc_file fails
  af_unix: fix struct pid leaks in OOB support
  eth: fealnx: bring back this old driver
  net: dsa: mt7530: permit port 5 to work without port 6 on MT7621 SoC
  net: microchip: sparx5: fix deletion of existing DSCP mappings
  octeontx2-af: Unlock contexts in the queue context cache in case of fault detection
  net/smc: fix fallback failed while sendmsg with fastopen
  ynl: re-license uniformly under GPL-2.0 OR BSD-3-Clause
  mailmap: update entries for Stephen Hemminger
  mailmap: add entry for Maxim Mikityanskiy
  nfc: change order inside nfc_se_io error path
  ethernet: ice: avoid gcc-9 integer overflow warning
  ice: don't ignore return codes in VSI related code
  ice: Fix DSCP PFC TLV creation
  net: usb: qmi_wwan: add Telit 0x1080 composition
  net: usb: cdc_mbim: avoid altsetting toggling for Telit FE990
  netfilter: conntrack: adopt safer max chain length
  net: tls: fix device-offloaded sendpage straddling records
  ...
2023-03-09 10:56:58 -08:00
Tobias Klauser
a402f1e353
fork: allow CLONE_NEWTIME in clone3 flags
Currently, calling clone3() with CLONE_NEWTIME in clone_args->flags
fails with -EINVAL. This is because CLONE_NEWTIME intersects with
CSIGNAL. However, CSIGNAL was deprecated when clone3 was introduced in
commit 7f192e3cd3 ("fork: add clone3"), allowing re-use of that part
of clone flags.

Fix this by explicitly allowing CLONE_NEWTIME in clone3_args_valid. This
is also in line with the respective check in check_unshare_flags which
allow CLONE_NEWTIME for unshare().

Fixes: 769071ac9f ("ns: Introduce Time Namespace")
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-03-08 12:31:34 +01:00
David Disseldorp
03e1d60e17
watch_queue: fix IOC_WATCH_QUEUE_SET_SIZE alloc error paths
The watch_queue_set_size() allocation error paths return the ret value
set via the prior pipe_resize_ring() call, which will always be zero.

As a result, IOC_WATCH_QUEUE_SET_SIZE callers such as "keyctl watch"
fail to detect kernel wqueue->notes allocation failures and proceed to
KEYCTL_WATCH_KEY, with any notifications subsequently lost.

Fixes: c73be61ced ("pipe: Add general notification queue support")
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-03-08 11:44:45 +01:00
Jakub Kicinski
757b56a6c7 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZAZZ1wAKCRDbK58LschI
 g4fcAQDYVsICeBDmhdBdZs7Kb91/s6SrU6B0jy4zs0gOIBBOhgD7B3jt3dMTD2tp
 rPLHlv6uUoYS7mbZsrZi/XjVw8UmewM=
 =VUnr
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-03-06

We've added 8 non-merge commits during the last 7 day(s) which contain
a total of 9 files changed, 64 insertions(+), 18 deletions(-).

The main changes are:

1) Fix BTF resolver for DATASEC sections when a VAR points at a modifier,
   that is, keep resolving such instances instead of bailing out,
   from Lorenz Bauer.

2) Fix BPF test framework with regards to xdp_frame info misplacement
   in the "live packet" code, from Alexander Lobakin.

3) Fix an infinite loop in BPF sockmap code for TCP/UDP/AF_UNIX,
   from Liu Jian.

4) Fix a build error for riscv BPF JIT under PERF_EVENTS=n,
   from Randy Dunlap.

5) Several BPF doc fixes with either broken links or external instead
   of internal doc links, from Bagas Sanjaya.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: check that modifier resolves after pointer
  btf: fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTR
  bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES
  bpf, doc: Link to submitting-patches.rst for general patch submission info
  bpf, doc: Do not link to docs.kernel.org for kselftest link
  bpf, sockmap: Fix an infinite loop error when len is 0 in tcp_bpf_recvmsg_parser()
  riscv, bpf: Fix patch_text implicit declaration
  bpf, docs: Fix link to BTF doc
====================

Link: https://lore.kernel.org/r/20230306215944.11981-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-06 20:28:00 -08:00
Lorenz Bauer
9b459804ff btf: fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTR
btf_datasec_resolve contains a bug that causes the following BTF
to fail loading:

    [1] DATASEC a size=2 vlen=2
        type_id=4 offset=0 size=1
        type_id=7 offset=1 size=1
    [2] INT (anon) size=1 bits_offset=0 nr_bits=8 encoding=(none)
    [3] PTR (anon) type_id=2
    [4] VAR a type_id=3 linkage=0
    [5] INT (anon) size=1 bits_offset=0 nr_bits=8 encoding=(none)
    [6] TYPEDEF td type_id=5
    [7] VAR b type_id=6 linkage=0

This error message is printed during btf_check_all_types:

    [1] DATASEC a size=2 vlen=2
        type_id=7 offset=1 size=1 Invalid type

By tracing btf_*_resolve we can pinpoint the problem:

    btf_datasec_resolve(depth: 1, type_id: 1, mode: RESOLVE_TBD) = 0
        btf_var_resolve(depth: 2, type_id: 4, mode: RESOLVE_TBD) = 0
            btf_ptr_resolve(depth: 3, type_id: 3, mode: RESOLVE_PTR) = 0
        btf_var_resolve(depth: 2, type_id: 4, mode: RESOLVE_PTR) = 0
    btf_datasec_resolve(depth: 1, type_id: 1, mode: RESOLVE_PTR) = -22

The last invocation of btf_datasec_resolve should invoke btf_var_resolve
by means of env_stack_push, instead it returns EINVAL. The reason is that
env_stack_push is never executed for the second VAR.

    if (!env_type_is_resolve_sink(env, var_type) &&
        !env_type_is_resolved(env, var_type_id)) {
        env_stack_set_next_member(env, i + 1);
        return env_stack_push(env, var_type, var_type_id);
    }

env_type_is_resolve_sink() changes its behaviour based on resolve_mode.
For RESOLVE_PTR, we can simplify the if condition to the following:

    (btf_type_is_modifier() || btf_type_is_ptr) && !env_type_is_resolved()

Since we're dealing with a VAR the clause evaluates to false. This is
not sufficient to trigger the bug however. The log output and EINVAL
are only generated if btf_type_id_size() fails.

    if (!btf_type_id_size(btf, &type_id, &type_size)) {
        btf_verifier_log_vsi(env, v->t, vsi, "Invalid type");
        return -EINVAL;
    }

Most types are sized, so for example a VAR referring to an INT is not a
problem. The bug is only triggered if a VAR points at a modifier. Since
we skipped btf_var_resolve that modifier was also never resolved, which
means that btf_resolved_type_id returns 0 aka VOID for the modifier.
This in turn causes btf_type_id_size to return NULL, triggering EINVAL.

To summarise, the following conditions are necessary:

- VAR pointing at PTR, STRUCT, UNION or ARRAY
- Followed by a VAR pointing at TYPEDEF, VOLATILE, CONST, RESTRICT or
  TYPE_TAG

The fix is to reset resolve_mode to RESOLVE_TBD before attempting to
resolve a VAR from a DATASEC.

Fixes: 1dc9285184 ("bpf: kernel side support for BTF Var and DataSec")
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230306112138.155352-2-lmb@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-06 11:44:13 -08:00
Linus Torvalds
4e9c542c7a A set of updates for the interrupt susbsystem:
- Prevent possible NULL pointer derefences in irq_data_get_affinity_mask()
     and irq_domain_create_hierarchy().
 
   - Take the per device MSI lock before invoking code which relies
     on it being hold.
 
   - Make sure that MSI descriptors are unreferenced before freeing
     them. This was overlooked when the platform MSI code was converted to
     use core infrastructure and results in a fals positive warning.
 
   - Remove dead code in the MSI subsystem.
 
   - Clarify the documentation for pci_msix_free_irq().
 
   - More kobj_type constification.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmQEVToTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodc9EACg5HBOGsh5OV8pwnuEDqfThK/dOZ5k
 LJ8xQjGAx29JeNWu4gkkHaSDEGjZhwLiZlB6qUeH+LPQ1NgmAlLL3T2NxEOOWa6y
 z/xQv+1Ceu6XxazpCSFRWR/6w4Nyup92jhlsUIkmmsWkVvKH/pV6Uo+3ta0WagWg
 heb3vqts6J0AOJaMepF8azYGbwAPSIElNLI1UtiEuQYEKU55N8jLK20VJTL6lzJ2
 FyRg/0ghNWDAaBdnv4cZCQ/MzoG5UkoU3f2cqhdSce5mqnq2fKRfgBjzllNgaRgA
 zxOxIR88QaKTMHIr+WKD1dyWxDQlotFbBOkmVW39XAa13rn42s4GIeW88VCjJGww
 RAm52SbC48cCIyNQlh4A6Vhb4vjPx2DndWbWnnVj5fWlUevdAPRxSlm4BjfxFxe+
 LbuZCRRL1jjlC0fXmhVXTTxeE1/K7jarAZwRV7Nxhr3g0gT+Zv1jyaaW9rWuHq5U
 3pS+xBl89LA/VYp9tv6jDfJlocmRwgrFbGX4UlfikqtObdTFqcH0FtmqisE61fZS
 n0194BMWNDfPSibSpDohf/CDPoHZ6pNxeuqkVDiisUJHPpIYOt8+lH+8//DgBL7a
 oi9zS0JazPIn2VM6NB4f/WXOYmS9GZq5+loiYEWb52AYtodKUKmOoWG0SUyy6XFr
 E7yJzemsUwrJVg==
 =jWot
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2023-03-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "A set of updates for the interrupt susbsystem:

   - Prevent possible NULL pointer derefences in
     irq_data_get_affinity_mask() and irq_domain_create_hierarchy()

   - Take the per device MSI lock before invoking code which relies on
     it being hold

   - Make sure that MSI descriptors are unreferenced before freeing
     them. This was overlooked when the platform MSI code was converted
     to use core infrastructure and results in a fals positive warning

   - Remove dead code in the MSI subsystem

   - Clarify the documentation for pci_msix_free_irq()

   - More kobj_type constification"

* tag 'irq-urgent-2023-03-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/msi, platform-msi: Ensure that MSI descriptors are unreferenced
  genirq/msi: Drop dead domain name assignment
  irqdomain: Add missing NULL pointer check in irq_domain_create_hierarchy()
  genirq/irqdesc: Make kobj_type structures constant
  PCI/MSI: Clarify usage of pci_msix_free_irq()
  genirq/msi: Take the per-device MSI lock before validating the control structure
  genirq/ipi: Fix NULL pointer deref in irq_data_get_affinity_mask()
2023-03-05 11:19:16 -08:00