Commit Graph

45887 Commits

Author SHA1 Message Date
Huang Ying
b4afe4183e resource: fix region_intersects() vs add_memory_driver_managed()
On a system with CXL memory, the resource tree (/proc/iomem) related to
CXL memory may look like something as follows.

490000000-50fffffff : CXL Window 0
  490000000-50fffffff : region0
    490000000-50fffffff : dax0.0
      490000000-50fffffff : System RAM (kmem)

Because drivers/dax/kmem.c calls add_memory_driver_managed() during
onlining CXL memory, which makes "System RAM (kmem)" a descendant of "CXL
Window X".  This confuses region_intersects(), which expects all "System
RAM" resources to be at the top level of iomem_resource.  This can lead to
bugs.

For example, when the following command line is executed to write some
memory in CXL memory range via /dev/mem,

 $ dd if=data of=/dev/mem bs=$((1 << 10)) seek=$((0x490000000 >> 10)) count=1
 dd: error writing '/dev/mem': Bad address
 1+0 records in
 0+0 records out
 0 bytes copied, 0.0283507 s, 0.0 kB/s

the command fails as expected.  However, the error code is wrong.  It
should be "Operation not permitted" instead of "Bad address".  More
seriously, the /dev/mem permission checking in devmem_is_allowed() passes
incorrectly.  Although the accessing is prevented later because ioremap()
isn't allowed to map system RAM, it is a potential security issue.  During
command executing, the following warning is reported in the kernel log for
calling ioremap() on system RAM.

 ioremap on RAM at 0x0000000490000000 - 0x0000000490000fff
 WARNING: CPU: 2 PID: 416 at arch/x86/mm/ioremap.c:216 __ioremap_caller.constprop.0+0x131/0x35d
 Call Trace:
  memremap+0xcb/0x184
  xlate_dev_mem_ptr+0x25/0x2f
  write_mem+0x94/0xfb
  vfs_write+0x128/0x26d
  ksys_write+0xac/0xfe
  do_syscall_64+0x9a/0xfd
  entry_SYSCALL_64_after_hwframe+0x4b/0x53

The details of command execution process are as follows.  In the above
resource tree, "System RAM" is a descendant of "CXL Window 0" instead of a
top level resource.  So, region_intersects() will report no System RAM
resources in the CXL memory region incorrectly, because it only checks the
top level resources.  Consequently, devmem_is_allowed() will return 1
(allow access via /dev/mem) for CXL memory region incorrectly. 
Fortunately, ioremap() doesn't allow to map System RAM and reject the
access.

So, region_intersects() needs to be fixed to work correctly with the
resource tree with "System RAM" not at top level as above.  To fix it, if
we found a unmatched resource in the top level, we will continue to search
matched resources in its descendant resources.  So, we will not miss any
matched resources in resource tree anymore.

In the new implementation, an example resource tree

|------------- "CXL Window 0" ------------|
|-- "System RAM" --|

will behave similar as the following fake resource tree for
region_intersects(, IORESOURCE_SYSTEM_RAM, ),

|-- "System RAM" --||-- "CXL Window 0a" --|

Where "CXL Window 0a" is part of the original "CXL Window 0" that
isn't covered by "System RAM".

Link: https://lkml.kernel.org/r/20240906030713.204292-2-ying.huang@intel.com
Fixes: c221c0b030 ("device-dax: "Hotplug" persistent memory for use like normal RAM")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-17 00:58:04 -07:00
Linus Torvalds
c903327d32 printk changes for 6.12
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmbi598ACgkQUqAMR0iA
 lPL3lw//WaRDKJ1Cb/bKAn3nRpjdqiNBI//K1gRJp0LgLE7qEudE25t4j3F9tvvP
 pc9AB81g1Au8Br6iOd+NiGXXW5KWJHaZ3rUAdeo6co4NQCbrY6qTA78ItZSQImBH
 A9fhWWr1TGRX8L/N/gR2eYBnpbDGIbRahUOQraUpBn4kEPyR47KEx7Njjo48GcmR
 Ye8dIYwUOWEgQeIuIxIAwNf6KyNjo5tQpgve+M8HGwy8mZqP9XV6UjXUACVwQNx6
 +CK+IGM+94tCq5KalOaJ5BtsXGKlabHIs7y9QpLS45M2QoHIlDIvpaxzLf0FTsPI
 CppqedAGN2jU0NyjfbFk1c+SNQfDZEAZVyF6vKFelP7t2jzAx301RyB2S+Cm7Hh+
 PajFty41UT0/y17V4sZawfMqpFyp7Wr6RKQYYKMBRdSQQkToh/dmebBvqPAHW9cJ
 LInQQf+XdzbonKa+CTmT/Tg+eM2R124FWeMVnEMdtyXpKUV9qdKWfngtzyRMQiYI
 q54ZwKd3VJ9kRIfb7Fp0TBr2NErdnEQE5hh9QhI8SAWENskw5+GmYaQit734U9wA
 SU7t9rir7NS4Rc1jHP9SQ9oWWI9HT4hthRGkLh2Knx0O2c6AwOuEI4wkjzSWI3GX
 /eeofnbZiUpi7fESf9qmTGtQZ4/9ogQ7fNaroWCSfQzq3+wl+2o=
 =28sV
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk updates from Petr Mladek:
 "This is the "last" part of the support for the new nbcon consoles.
  Where "nbcon" stays for "No Big console lock CONsoles" aka not under
  the console_lock.

  New callbacks are added to struct console:

   - write_thread() for flushing nbcon consoles in task context.

   - write_atomic() for flushing nbcon consoles in atomic context,
     including NMI.

   - con->device_lock() and device_unlock() for taking the driver
     specific lock, for example, port->lock.

  New printk-specific kthreads are created:

   - per-console kthreads which get responsible for flushing normal
     priority messages on nbcon consoles.

   - thread which gets responsible for flushing normal priority messages
     on all consoles when CONFIG_RT enabled.

  The new callbacks are called under a special per-console lock which
  has already been added back in v6.7. It allows to distinguish three
  severities: normal, emergency, and panic. A context with a higher
  priority could take over the ownership when it is safe even in the
  middle of handling a record. The panic context could do it even when
  it is not safe. But it is allowed only for the final desperate flush
  before entering the infinite loop.

  The new lock helps to flush the messages directly in emergency and
  panic contexts. But it is not enough in all situations:

   - console_lock() is still need for synchronization against boot
     consoles.

   - con->device_lock() is need for synchronization against other
     operations on the same HW, e.g. serial port speed setting,
     non-printk related read/write.

  The dependency on con->device_lock() is mutual. Any code taking the
  driver specific lock has to acquire the related nbcon console context
  as well. For example, see the new uart_port_lock() API. It provides
  the necessary synchronization against emergency and panic contexts
  where the messages are flushed only under the new per-console lock.

  Maybe surprisingly, a quite tricky part is the decision how to flush
  the consoles in various situations. It has to take into account:

   - message priority:    normal, emergency, panic

   - scheduling context:  task, atomic, deferred_legacy

   - registered consoles: boot, legacy, nbcon

   - threads are running: early boot, suspend, shutdown, panic

   - caller:              printk(), pr_flush(), printk_flush_in_panic(),
                          console_unlock(), console_start(), ...

  The primary decision is made in printk_get_console_flush_type(). It
  creates a hint what the caller should do:

   - flush nbcon consoles directly or via the kthread

   - call the legacy loop (console_unlock()) directly or via irq_work

  The existing behavior is preserved for the legacy consoles. The only
  exception is that they are not longer flushed directly from printk()
  in panic() before CPUs are stopped. But this blocking happens only
  when at least one nbcon console is registered. The motivation is to
  increase a chance to produce the crash dump. They legacy consoles
  might create a deadlock in compare with nbcon consoles. The nbcon
  console should allow to see the messages even when the crash dump
  fails.

  There are three possible ways how nbcon consoles are flushed:

   - The per-nbcon-console kthread is responsible for flushing messages
     added with the normal priority. This is the default mode.

   - The legacy loop, aka console_unlock(), is used when there is still
     a boot console registered. There is no easy way how to match an
     early console driver with a nbcon console driver. And the
     console_lock() provides the only reliable serialization at the
     moment.

     The legacy loop uses either con->write_atomic() or
     con->write_thread() callbacks depending on whether it is allowed to
     schedule. The atomic variant has to be used from printk().

   - In other situations, the messages are flushed directly using
     write_atomic() which can be called in any context, including NMI.
     It is primary needed during early boot or shutdown, in emergency
     situations, and panic.

  The emergency priority is used by a code called within
  nbcon_cpu_emergency_enter()/exit(). At the moment, it is used in four
  situations: WARN(), Oops, lockdep, and RCU stall reports.

  Finally, there is no nbcon console at the moment. It means that the
  changes should _not_ modify the existing behavior. The only exception
  is CONFIG_RT which would force offloading the legacy loop, for normal
  priority context, into the dedicated kthread"

* tag 'printk-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (54 commits)
  printk: Avoid false positive lockdep report for legacy printing
  printk: nbcon: Assign nice -20 for printing threads
  printk: Implement legacy printer kthread for PREEMPT_RT
  tty: sysfs: Add nbcon support for 'active'
  proc: Add nbcon support for /proc/consoles
  proc: consoles: Add notation to c_start/c_stop
  printk: nbcon: Show replay message on takeover
  printk: Provide helper for message prepending
  printk: nbcon: Rely on kthreads for normal operation
  printk: nbcon: Use thread callback if in task context for legacy
  printk: nbcon: Relocate nbcon_atomic_emit_one()
  printk: nbcon: Introduce printer kthreads
  printk: nbcon: Init @nbcon_seq to highest possible
  printk: nbcon: Add context to usable() and emit()
  printk: Flush console on unregister_console()
  printk: Fail pr_flush() if before SYSTEM_SCHEDULING
  printk: nbcon: Add function for printers to reacquire ownership
  printk: nbcon: Use raw_cpu_ptr() instead of open coding
  printk: Use the BITS_PER_LONG macro
  lockdep: Mark emergency sections in lockdep splats
  ...
2024-09-17 08:52:28 +02:00
Linus Torvalds
9ea925c806 Updates for timers and timekeeping:
- Core:
 
 	- Overhaul of posix-timers in preparation of removing the
 	  workaround for periodic timers which have signal delivery
 	  ignored.
 
         - Remove the historical extra jiffie in msleep()
 
 	  msleep() adds an extra jiffie to the timeout value to ensure
 	  minimal sleep time. The timer wheel ensures minimal sleep
 	  time since the large rewrite to a non-cascading wheel, but the
 	  extra jiffie in msleep() remained unnoticed. Remove it.
 
         - Make the timer slack handling correct for realtime tasks.
 
 	  The procfs interface is inconsistent and does neither reflect
 	  reality nor conforms to the man page. Show the correct 0 slack
 	  for real time tasks and enforce it at the core level instead of
 	  having inconsistent individual checks in various timer setup
 	  functions.
 
         - The usual set of updates and enhancements all over the place.
 
   - Drivers:
 
         - Allow the ACPI PM timer to be turned off during suspend
 
 	- No new drivers
 
 	- The usual updates and enhancements in various drivers
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn7jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYobqnD/9COlU0nwsulABI/aNIrsh6iYvnCC9v
 14CcNta7Qn+157Wfw9BWOyHdNhR1/fPCXE8jJ71zTyIOeW27HV2JyTtxTwe9ZcdK
 ViHAaj7YcIjcVUEC3StCoRCPnvLslEw4qJA5AOQuDyMivdQn+YVa2c0baJxKaXZt
 xk4HZdMj4NAS0jRKnoZSwtKW/+Oz6rR4GAWrZo+Zs1/8ur3HfqnQfi8lJ1hJtLLW
 V7XDCVRvamVi6Ah3ocYPPp/1P6yeQDA1ge9aMddqaza5STWISXRtSnFMUmYP3rbS
 FaL8TyL+ilfny8pkGB2WlG6nLuSbtvogtdEh1gG1k1RmZt44kAtk8ba/KiWFPBSb
 zK9cjojRMBS71f9G4kmb5F4rnXoLsg1YbD1Nzhz3wq2Cs1Z90dc2QwMren0zoQ1x
 Fn56ueRyAiagBlnrSaKyso/2RvqJTNoSdi3RkpjYeAph0UoDCqvTvKjGAf1mWiw1
 T/1lUWSVqWHnzZbM7XXzzajIN9bl6A7bbqlcAJ2O9vZIDt7273DG+bQym9Vh6Why
 0LTGGERHxzKBsG7WRg+2Gmvv6S18UPKRo8tLtlA758rHlFuPTZCShWrIriwSNl1K
 Hxon+d4BparSnm1h9W/NHPKJA574UbWRCBjdk58IkAj8DxZZY4ORD9SMP+ggkV7G
 F6p9cgoDNP9KFg==
 =jE0N
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "Core:

   - Overhaul of posix-timers in preparation of removing the workaround
     for periodic timers which have signal delivery ignored.

   - Remove the historical extra jiffie in msleep()

     msleep() adds an extra jiffie to the timeout value to ensure
     minimal sleep time. The timer wheel ensures minimal sleep time
     since the large rewrite to a non-cascading wheel, but the extra
     jiffie in msleep() remained unnoticed. Remove it.

   - Make the timer slack handling correct for realtime tasks.

     The procfs interface is inconsistent and does neither reflect
     reality nor conforms to the man page. Show the correct 0 slack for
     real time tasks and enforce it at the core level instead of having
     inconsistent individual checks in various timer setup functions.

   - The usual set of updates and enhancements all over the place.

  Drivers:

   - Allow the ACPI PM timer to be turned off during suspend

   - No new drivers

   - The usual updates and enhancements in various drivers"

* tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  ntp: Make sure RTC is synchronized when time goes backwards
  treewide: Fix wrong singular form of jiffies in comments
  cpu: Use already existing usleep_range()
  timers: Rename next_expiry_recalc() to be unique
  platform/x86:intel/pmc: Fix comment for the pmc_core_acpi_pm_timer_suspend_resume function
  clocksource/drivers/jcore: Use request_percpu_irq()
  clocksource/drivers/cadence-ttc: Add missing clk_disable_unprepare in ttc_setup_clockevent
  clocksource/drivers/asm9260: Add missing clk_disable_unprepare in asm9260_timer_init
  clocksource/drivers/qcom: Add missing iounmap() on errors in msm_dt_timer_init()
  clocksource/drivers/ingenic: Use devm_clk_get_enabled() helpers
  platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended
  clocksource: acpi_pm: Add external callback for suspend/resume
  clocksource/drivers/arm_arch_timer: Using for_each_available_child_of_node_scoped()
  dt-bindings: timer: rockchip: Add rk3576 compatible
  timers: Annotate possible non critical data race of next_expiry
  timers: Remove historical extra jiffie for timeout in msleep()
  hrtimer: Use and report correct timerslack values for realtime tasks
  hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.
  timers: Add sparse annotation for timer_sync_wait_running().
  signal: Replace BUG_ON()s
  ...
2024-09-17 07:25:37 +02:00
Linus Torvalds
cb69d86550 Updates for the interrupt subsystem:
- Core:
 	- Remove a global lock in the affinity setting code
 
 	  The lock protects a cpumask for intermediate results and the lock
 	  causes a bottleneck on simultaneous start of multiple virtual
 	  machines. Replace the lock and the static cpumask with a per CPU
 	  cpumask which is nicely serialized by raw spinlock held when
 	  executing this code.
 
 	- Provide support for giving a suffix to interrupt domain names.
 
 	  That's required to support devices with subfunctions so that the
 	  domain names are distinct even if they originate from the same
 	  device node.
 
 	- The usual set of cleanups and enhancements all over the place
 
   - Drivers:
 
 	- Support for longarch AVEC interrupt chip
 
 	- Refurbishment of the Armada driver so it can be extended for new
           variants.
 
 	- The usual set of cleanups and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn5p8THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRFtD/43eB3h5usY2OPW0JmDqrE6qnzsvjPZ
 1H52BcmMcOuI6yCfTnbi/fBB52mwSEGq9Dmt1GXradyq9/CJDIqZ1ajI1rA2jzW2
 YdbeTDpKm1rS2ddzfp2LT2BryrNt+7etrRO7qHn4EKSuOcNuV2f58WPbIIqasvaK
 uPbUDVDPrvXxLNcjoab6SqaKrEoAaHSyKpd0MvDd80wHrtcSC/QouW7JDSUXv699
 RwvLebN1OF6mQ2J8Z3DLeCQpcbAs+UT8UvID7kYUJi1g71J/ZY+xpMLoX/gHiDNr
 isBtsuEAiZeNaFpksc7A6Jgu5ljZf2/aLCqbPLlHaduHFNmo94x9KUbIF2cpEMN+
 rsf5Ff7AVh1otz3cUwLLsm+cFLWRRoZdLuncn7rrgB4Yg0gll7qzyLO6YGvQHr8U
 Ocj1RXtvvWsMk4XzhgCt1AH/42cO6go+bhA4HspeYykNpsIldIUl1MeFbO8sWiDJ
 kybuwiwHp3oaMLjEK4Lpq65u7Ll8Lju2zRde65YUJN2nbNmJFORrOLmeC1qsr6ri
 dpend6n2qD9UD1oAt32ej/uXnG160nm7UKescyxiZNeTm1+ez8GW31hY128ifTY3
 4R3urGS38p3gazXBsfw6eqkeKx0kEoDNoQqrO5gBvb8kowYTvoZtkwMGAN9OADwj
 w6vvU0i+NIyVMA==
 =JlJ2
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "Core:

   - Remove a global lock in the affinity setting code

     The lock protects a cpumask for intermediate results and the lock
     causes a bottleneck on simultaneous start of multiple virtual
     machines. Replace the lock and the static cpumask with a per CPU
     cpumask which is nicely serialized by raw spinlock held when
     executing this code.

   - Provide support for giving a suffix to interrupt domain names.

     That's required to support devices with subfunctions so that the
     domain names are distinct even if they originate from the same
     device node.

   - The usual set of cleanups and enhancements all over the place

  Drivers:

   - Support for longarch AVEC interrupt chip

   - Refurbishment of the Armada driver so it can be extended for new
     variants.

   - The usual set of cleanups and enhancements all over the place"

* tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits)
  genirq: Use cpumask_intersects()
  genirq/cpuhotplug: Use cpumask_intersects()
  irqchip/apple-aic: Only access system registers on SoCs which provide them
  irqchip/apple-aic: Add a new "Global fast IPIs only" feature level
  irqchip/apple-aic: Skip unnecessary enabling of use_fast_ipi
  dt-bindings: apple,aic: Document A7-A11 compatibles
  irqdomain: Use IS_ERR_OR_NULL() in irq_domain_trim_hierarchy()
  genirq/msi: Use kmemdup_array() instead of kmemdup()
  genirq/proc: Change the return value for set affinity permission error
  genirq/proc: Use irq_move_pending() in show_irq_affinity()
  genirq/proc: Correctly set file permissions for affinity control files
  genirq: Get rid of global lock in irq_do_set_affinity()
  genirq: Fix typo in struct comment
  irqchip/loongarch-avec: Add AVEC irqchip support
  irqchip/loongson-pch-msi: Prepare get_pch_msi_handle() for AVECINTC
  irqchip/loongson-eiointc: Rename CPUHP_AP_IRQ_LOONGARCH_STARTING
  LoongArch: Architectural preparation for AVEC irqchip
  LoongArch: Move irqchip function prototypes to irq-loongson.h
  irqchip/loongson-pch-msi: Switch to MSI parent domains
  softirq: Remove unused 'action' parameter from action callback
  ...
2024-09-17 07:09:17 +02:00
Linus Torvalds
a64405b78b Updates for the clocksource watchdog:
- Make the uncertainty margin handling more robust to prevent false
     positives
 
   - Clarify comments
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn6xETHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoafvEAC30/0ePYUhwnJuhm1DboYLWWVGJlSI
 WpdwmL1YgBUTAjJC0wuYA+Fc8XV9grvxY86bPAsU2Lep85znCSIK/YEO4tsN5Tnq
 +aWGR8Q+2j80MxGCP2mFtG1P0K35P4dopRJgvGmE/xtqKxJ3IPryhfjegHVREDoe
 PNQt59lW1WZhydLXB60S9FCAJeLFRWAoc2tOIr9T2CnsRkc8aXdWnpEU40bN+nFH
 rX09ynPPOlxo5f6z5YWBOuv7LgjvNjmJP3qf8d4Sp4stj7+kI3ipZh6mlolek4l+
 sMlbBAelQk+eiPYPIbthZvMvhM3J+YFXy4nh1LPYSRqWIMsgtobxwNUvWV0CqZaQ
 ZEImJqh+QJA27RAD13Uiv3N68prgW7yj/65b3EbjiK9tka2+67JVWTnlxuuyYu+S
 JxetHfxLKqohGhKAgAeO11efcf4HBh82afqDvMlWvTwDxyqMVLXM2NeT2iOgeYZk
 eH85R7ophcVdQAHZlYagsYA1oQMyT+UD40kiZzrX6WVHqX5ohVpnV/X8cJUVcUGa
 phOdX03ARzTSYdbE+J61bfAMCS2Tme34LkBMgZmooVfKwmnJOCGcskljtLOssm7N
 pTU7dI5dJLt6P+HZ6dbxf0xd0z6mNLXbVDlibEfR1ISfhLOHyH+2ywM0OJFDI0sP
 Ydt3U0n8sNFiXw==
 =Y38F
 -----END PGP SIGNATURE-----

Merge tag 'timers-clocksource-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull clocksource watchdog updates from Thomas Gleixner:

 - Make the uncertainty margin handling more robust to prevent false
   positives

 - Clarify comments

* tag 'timers-clocksource-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin
  clocksource: Fix comments on WATCHDOG_THRESHOLD & WATCHDOG_MAX_SKEW
  clocksource: Improve comments for watchdog skew bounds
2024-09-17 07:05:08 +02:00
Linus Torvalds
97e17c08a4 A set of updates for CPU hotplug:
- Prepare the core for supporting parallel hotplug on loongarch
 
   - A small set of cleanups and enhancements
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn6qETHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZutD/94s3G8D3xrgQ6DwMHVtMtIAbzLtlBt
 SeKpIiSCnSy8bnQ+sAqOw4VjmbpB0dOlcJRii701D6hY+48TEgsL3dLn1/ws4ECc
 /5PapLihQgIquiAqk9iQH2BOrsFVqOGp4jbU95+ppBQtiDIB+3KeQPxxws57xb7E
 EUXzgCMTSsqlHCt40UCbsn7atbj0AfkV12uPKsNZT7WxPjxGK3OLuttMA6a+4xHm
 nBxxy/Vp9ll3J+uRGQobLFgZiIEiUsHI/+pGwltYxXC7jdN3joGqD3LuwypqLuly
 Ir8yXP+NhpOeNMn3iSVE8sm39bp8Sm1UslrbXvlQHGuP1JsUnIZzyevdneUp5Je7
 zDKHzfn04Ls3uK1XiuQGUTvLYuiHPQ/UHP8ZeWFlkapFFDtl3fu2FU9r+LlkwKZK
 /0zQF6R5eBaGl3F1YKn7nPcfNf1jTLQlYq+eZT2DnSSeOb7ammjxVGgIMzWRWidG
 ZFNZhkjusRi3MH4aYLF8mQl7nyepy4+XQF4K0PusQ8B/NQxYRoI66mFsKhtufn5e
 7T9vpYTazmhazl9SO1wQ9NNXYub+bjVj3fyRl5WSsTdS5d9pz9yqgC+xBIAJXTaq
 9kN+NlP/nJ6HAzTgO074znUYR/tjlki22hNHVa6JyEh0/h0AVG53CFH5/hSONJ3P
 jvnhjxM/X/kLwg==
 =0FVL
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull CPU hotplug updates from Thomas Gleixner:

 - Prepare the core for supporting parallel hotplug on loongarch

 - A small set of cleanups and enhancements

* tag 'smp-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Mark smp_prepare_boot_cpu() __init
  cpu: Fix W=1 build kernel-doc warning
  cpu/hotplug: Provide weak fallback for arch_cpuhp_init_parallel_bringup()
  cpu/hotplug: Make HOTPLUG_PARALLEL independent of HOTPLUG_SMT
2024-09-17 06:56:31 +02:00
Linus Torvalds
dc644fba3c audit/stable-6.12 PR 20240911
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmbiGCwUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNi8w//dHA2tkspz6tZsNLzwUD/IgZBr/uc
 HgOszt1zstPdEPS8kS4kP99db2dTU9gzAv7vCLoKZr5KLborRRYw9PHAXflygkNc
 IBkgfgzVfQJ5ftusRHOe8PUyhTjgu/Li20pLwfiAAPnhWGpK1LioBuXqy9Al2KOl
 36UKCLSd+YJ6fOPbZG/+eys5H40ev/eEmWk6EkC1ZOJhaJePBnlB29yf8l8rkdyj
 8xDArtbJEGrE40w+wDAXOV2F/DXICY6jXUfiWKMePVM8W0ZoUDfNggfB74qKqPsK
 FiKXns0P48TMBIdD1tNzOMC+9QmfP6ZTY8n7bR/2usy+hW6pkBxihcfyj5k0PpGa
 i3+7x3Q43B09oDtDH77In1+7RgOlvZh6ZtJYXhB1MnOzMVyzuLHvrlEMob34d/vU
 qZ1Ol5dqhhdtGq0WkaIkvCiAU9qpx0XpCggEKdNo72Ekg9x4YNHDhhMOo7dLiR05
 jql4Wje5Ur/T64rUIux7Yldjx3k8zQ2JAMQQRK3xzLU0CWhz62ihF5hEQpIVaC70
 rfiUtoRDpogIU/yGLTddYfG2UB/8OPmw+j9hv0+jMCLAIqYRKegboNGIh+YdtLnE
 kKuFGMlfeNCyKJcG63E+CJ/lCplic7p2gdgJFqPGxdA/WtSLTotMw3ZkhwBJobVR
 7xNRDcIwMkHg5b0=
 =XN1y
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20240911' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:

 - Fix some remaining problems with PID/TGID reporting

   When most users think about PIDs, what they are really thinking about
   is the TGID. This commit shifts the audit PID logging and filtering
   to use the TGID value which should provide a more meaningful audit
   stream and filtering experience for users.

 - Migrate to the str_enabled_disabled() helper

   Evidently we have helper functions that help ensure if we mistype
   "enabled" or "disabled" it is now caught at compile time. I guess
   we're fancy now.

* tag 'audit-pr-20240911' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: Make use of str_enabled_disabled() helper
  audit: use task_tgid_nr() instead of task_pid_nr()
2024-09-16 16:52:37 +02:00
Linus Torvalds
8f72c31f45 vfs-6.12.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEGwAKCRCRxhvAZXjc
 ojIuAQC433+hBkvjvmQ7H0r5rgZSjUuCTG3bSmdU7RJmPHUHhwEA85v/NGq53f+W
 IhandK6t+Cf0JYpFZ3N0bT88hDYVhQQ=
 =9zGL
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual pile of misc updates:

  Features:

   - Add F_CREATED_QUERY fcntl() that allows userspace to query whether
     a file was actually created. Often userspace wants to know whether
     an O_CREATE request did actually create a file without using
     O_EXCL. The current logic is that to first attempts to open the
     file without O_CREAT | O_EXCL and if ENOENT is returned userspace
     tries again with both flags. If that succeeds all is well. If it
     now reports EEXIST it retries.

     That works fairly well but some corner cases make this more
     involved. If this operates on a dangling symlink the first openat()
     without O_CREAT | O_EXCL will return ENOENT but the second openat()
     with O_CREAT | O_EXCL will fail with EEXIST.

     The reason is that openat() without O_CREAT | O_EXCL follows the
     symlink while O_CREAT | O_EXCL doesn't for security reasons. So
     it's not something we can really change unless we add an explicit
     opt-in via O_FOLLOW which seems really ugly.

     All available workarounds are really nasty (fanotify, bpf lsm etc)
     so add a simple fcntl().

   - Try an opportunistic lookup for O_CREAT. Today, when opening a file
     we'll typically do a fast lookup, but if O_CREAT is set, the kernel
     always takes the exclusive inode lock. This was likely done with
     the expectation that O_CREAT means that we always expect to do the
     create, but that's often not the case. Many programs set O_CREAT
     even in scenarios where the file already exists (see related
     F_CREATED_QUERY patch motivation above).

     The series contained in the pr rearranges the pathwalk-for-open
     code to also attempt a fast_lookup in certain O_CREAT cases. If a
     positive dentry is found, the inode_lock can be avoided altogether
     and it can stay in rcuwalk mode for the last step_into.

   - Expose the 64 bit mount id via name_to_handle_at()

     Now that we provide a unique 64-bit mount ID interface in statx(2),
     we can now provide a race-free way for name_to_handle_at(2) to
     provide a file handle and corresponding mount without needing to
     worry about racing with /proc/mountinfo parsing or having to open a
     file just to do statx(2).

     While this is not necessary if you are using AT_EMPTY_PATH and
     don't care about an extra statx(2) call, users that pass full paths
     into name_to_handle_at(2) need to know which mount the file handle
     comes from (to make sure they don't try to open_by_handle_at a file
     handle from a different filesystem) and switching to AT_EMPTY_PATH
     would require allocating a file for every name_to_handle_at(2) call

   - Add a per dentry expire timeout to autofs

     There are two fairly well known automounter map formats, the autofs
     format and the amd format (more or less System V and Berkley).

     Some time ago Linux autofs added an amd map format parser that
     implemented a fair amount of the amd functionality. This was done
     within the autofs infrastructure and some functionality wasn't
     implemented because it either didn't make sense or required extra
     kernel changes. The idea was to restrict changes to be within the
     existing autofs functionality as much as possible and leave changes
     with a wider scope to be considered later.

     One of these changes is implementing the amd options:
      1) "unmount", expire this mount according to a timeout (same as
         the current autofs default).
      2) "nounmount", don't expire this mount (same as setting the
         autofs timeout to 0 except only for this specific mount) .
      3) "utimeout=<seconds>", expire this mount using the specified
         timeout (again same as setting the autofs timeout but only for
         this mount)

     To implement these options per-dentry expire timeouts need to be
     implemented for autofs indirect mounts. This is because all map
     keys (mounts) for autofs indirect mounts use an expire timeout
     stored in the autofs mount super block info. structure and all
     indirect mounts use the same expire timeout.

  Fixes:

   - Fix missing fput for FSCONFIG_SET_FD in autofs

   - Use param->file for FSCONFIG_SET_FD in coda

   - Delete the 'fs/netfs' proc subtreee when netfs module exits

   - Make sure that struct uid_gid_map fits into a single cacheline

   - Don't flush in-flight wb switches for superblocks without cgroup
     writeback

   - Correcting the idmapping mount example in the idmapping
     documentation

   - Fix a race between evice_inodes() and find_inode() and iput()

   - Refine the show_inode_state() macro definition in writeback code

   - Prevent dump_mapping() from accessing invalid dentry.d_name.name

   - Show actual source for debugfs in /proc/mounts

   - Annotate data-race of busy_poll_usecs in eventpoll

   - Don't WARN for racy path_noexec check in exec code

   - Handle OOM on mnt_warn_timestamp_expiry()

   - Fix some spelling in the iomap design documentation

   - Fix typo in procfs comment

   - Fix typo in fs/namespace.c comment

  Cleanups:

   - Add the VFS git tree to the MAINTAINERS file

   - Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode
     bit in struct file bringing us to 5 free f_mode bits

   - Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify
     the wait mechanism

   - Remove the unused path_put_init() helper

   - Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi
     specific

   - Replace the unsigned long i_state member with a u32 i_state member
     in struct inode freeing up 4 bytes in struct inode. Instead of
     using the bit based wait apis we're now using the var event apis
     and using the individual bytes of the i_state member to wait on
     state changes

   - Explain how per-syscall AT_* flags should be allocated

   - Use in_group_or_capable() helper to simplify the posix acl mode
     update code

   - Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code

   - Removed comment about d_rcu_to_refcount() as that function doesn't
     exist anymore

   - Add kernel documentation for lookup_fast()

   - Don't re-zero evenpoll fields

   - Remove outdated comment after close_fd()

   - Fix imprecise wording in comment about the pipe filesystem

   - Drop GFP_NOFAIL mode from alloc_page_buffers

   - Missing blank line warnings and struct declaration improved in
     file_table

   - Annotate struct poll_list with __counted_by()

   - Remove the unused read parameter in percpu-rwsem

   - Remove linux/prefetch.h include from direct-io code

   - Use kmemdup_array instead of kmemdup for multiple allocation in
     mnt_idmapping code

   - Remove unused mnt_cursor_del() declaration

  Performance tweaks:

   - Dodge smp_mb in break_lease and break_deleg in the common case

   - Only read fops once in fops_{get,put}()

   - Use RCU in ilookup()

   - Elide smp_mb in iversion handling in the common case

   - Drop one lock trip in evict()"

* tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits)
  uidgid: make sure we fit into one cacheline
  proc: Fix typo in the comment
  fs/pipe: Correct imprecise wording in comment
  fhandle: expose u64 mount id to name_to_handle_at(2)
  uapi: explain how per-syscall AT_* flags should be allocated
  fs: drop GFP_NOFAIL mode from alloc_page_buffers
  writeback: Refine the show_inode_state() macro definition
  fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name
  mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation
  netfs: Delete subtree of 'fs/netfs' when netfs module exits
  fs: use LIST_HEAD() to simplify code
  inode: make i_state a u32
  inode: port __I_LRU_ISOLATING to var event
  vfs: fix race between evice_inodes() and find_inode()&iput()
  inode: port __I_NEW to var event
  inode: port __I_SYNC to var event
  fs: reorder i_state bits
  fs: add i_state helpers
  MAINTAINERS: add the VFS git tree
  fs: s/__u32/u32/ for s_fsnotify_mask
  ...
2024-09-16 08:35:09 +02:00
Linus Torvalds
02824a5fd1 Power management updates for 6.12-rc1
- Remove LATENCY_MULTIPLIER from cpufreq (Qais Yousef).
 
  - Add support for Granite Rapids and Sierra Forest in OOB mode to the
    intel_pstate cpufreq driver (Srinivas Pandruvada).
 
  - Add basic support for CPU capacity scaling on x86 and make the
    intel_pstate driver set asymmetric CPU capacity on hybrid systems
    without SMT (Rafael Wysocki).
 
  - Add missing MODULE_DESCRIPTION() macros to the powerpc cpufreq
    driver (Jeff Johnson).
 
  - Several OF related cleanups in cpufreq drivers (Rob Herring).
 
  - Enable COMPILE_TEST for ARM drivers (Rob Herrring).
 
  - Introduce quirks for syscon failures and use socinfo to get revision
    for TI cpufreq driver (Dhruva Gole, Nishanth Menon).
 
  - Minor cleanups in amd-pstate driver (Anastasia Belova, Dhananjay
    Ugwekar).
 
  - Minor cleanups for loongson, cpufreq-dt and powernv cpufreq drivers
    (Danila Tikhonov, Huacai Chen, and Liu Jing).
 
  - Make amd-pstate validate return of any attempt to update EPP limits,
    which fixes the masking hardware problems (Mario Limonciello).
 
  - Move the calculation of the AMD boost numerator outside of amd-pstate,
    correcting acpi-cpufreq on systems with preferred cores (Mario
    Limonciello).
 
  - Harden preferred core detection in amd-pstate to avoid potential
    false positives (Mario Limonciello).
 
  - Add extra unit test coverage for mode state machine (Mario
    Limonciello).
 
  - Fix an "Uninitialized variables" issue in amd-pstste (Qianqiang Liu).
 
  - Add Granite Rapids Xeon support to intel_idle (Artem Bityutskiy).
 
  - Disable promotion to C1E on Jasper Lake and Elkhart Lake in
    intel_idle (Kai-Heng Feng).
 
  - Use scoped device node handling to fix missing of_node_put() and
    simplify walking OF children in the riscv-sbi cpuidle driver (Krzysztof
    Kozlowski).
 
  - Remove dead code from cpuidle_enter_state() (Dhruva Gole).
 
  - Change an error pointer to NULL to fix error handling in the
    intel_rapl power capping driver (Dan Carpenter).
 
  - Fix off by one in get_rpi() in the intel_rapl power capping
    driver (Dan Carpenter).
 
  - Add support for ArrowLake-U to the intel_rapl power capping
    driver (Sumeet Pawnikar).
 
  - Fix the energy-pkg event for AMD CPUs in the intel_rapl power capping
    driver (Dhananjay Ugwekar).
 
  - Add support for AMD family 1Ah processors to the intel_rapl power
    capping driver (Dhananjay Ugwekar).
 
  - Remove unused stub for saveable_highmem_page() and remove deprecated
    macros from power management documentation (Andy Shevchenko).
 
  - Use ysfs_emit() and sysfs_emit_at() in "show" functions in the PM
    sysfs interface (Xueqin Luo).
 
  - Update the maintainers information for the operating-points-v2-ti-cpu DT
    binding (Dhruva Gole).
 
  - Drop unnecessary of_match_ptr() from ti-opp-supply (Rob Herring).
 
  - Add missing MODULE_DESCRIPTION() macros to devfreq governors (Jeff
    Johnson).
 
  - Use devm_clk_get_enabled() in the exynos-bus devfreq driver (Anand
    Moon).
 
  - Use of_property_present() instead of of_get_property() in the imx-bus
    devfreq driver (Rob Herring).
 
  - Update directory handling and installation process in the pm-graph
    Makefile and add .gitignore to ignore sleepgraph.py artifacts to
    pm-graph (Amit Vadhavana, Yo-Jung Lin).
 
  - Make cpupower display residency value in idle-info (Aboorva
    Devarajan).
 
  - Add missing powercap_set_enabled() stub function to cpupower (John
    B. Wyatt IV).
 
  - Add SWIG support to cpupower (John B. Wyatt IV).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmbjKEQSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRx8g8P/1RqL6NuCxH4eobwZigeyBS6/sLHPmKo
 wqHcerZsU7EH8DOlmBU0SH1Br2WBQAbaP8d1ukT5qkGBrZ+IM/A2ipZct0yAHH2D
 aBKwg7V3LvXo2mPuLve0knpM6W7zibPHJJlcjh8DmGQJabhWO7jr+p/0eS4JE2ek
 iE5FCXTxhvbcNJ9yWSt7+3HHmvj74P81As7txysLSzhWSZDcqXb0XJRgVJnWDt+x
 OyTAMEEAY2BuqmijHzqxxHcA1fxOBK/pa9yfPdKP7ePynLnpP7xd9A5oLbXQ4BL9
 PHqpD06ZBdSMQzKkyCODypZt8PL+FcEALE4u9chV/nzVwp7TrtDneXWA7RA0GXgq
 mp9hm51GmdptRayePR3s4TCA6a2BUw3Ue4fgs6XF/bexNpc3nx0wtP8HEevcuy8q
 Z7XQkpqW942vOohfoN42JwTjfDJhYTwSH3dcIY8UghHtzwZ5YKV1M4f97kNR7V2i
 QLJvaGJ5yTTcaHndkpc4EKknPyLRaWPh8h/yVmMRBcAaGBWaImul3a5NI07f0wLM
 LTenlpEcls7WSu9n3uvFXvT7nSS2CBV0huTbg449X4T2J0T6EooYsVuHNsFMNFLy
 Xm3lUtdm5QjAXFf+azOCO+26XQt8wObC0ttZtCC2j1b8D+9Riuwh5QHLr99rRTzn
 7Ic4U5Lkimzx
 =JM+K
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "By the number of new lines of code, the most visible change here is
  the addition of hybrid CPU capacity scaling support to the
  intel_pstate driver. Next are the amd-pstate driver changes related to
  the calculation of the AMD boost numerator and preferred core
  detection.

  As far as new hardware support is concerned, the intel_idle driver
  will now handle Granite Rapids Xeon processors natively, the
  intel_rapl power capping driver will recognize family 1Ah of AMD
  processors and Intel ArrowLake-U chipos, and intel_pstate will handle
  Granite Rapids and Sierra Forest chips in the out-of-band (OOB) mode.

  Apart from the above, there is a usual collection of assorted fixes
  and code cleanups in many places and there are tooling updates.

  Specifics:

   - Remove LATENCY_MULTIPLIER from cpufreq (Qais Yousef)

   - Add support for Granite Rapids and Sierra Forest in OOB mode to the
     intel_pstate cpufreq driver (Srinivas Pandruvada)

   - Add basic support for CPU capacity scaling on x86 and make the
     intel_pstate driver set asymmetric CPU capacity on hybrid systems
     without SMT (Rafael Wysocki)

   - Add missing MODULE_DESCRIPTION() macros to the powerpc cpufreq
     driver (Jeff Johnson)

   - Several OF related cleanups in cpufreq drivers (Rob Herring)

   - Enable COMPILE_TEST for ARM drivers (Rob Herrring)

   - Introduce quirks for syscon failures and use socinfo to get
     revision for TI cpufreq driver (Dhruva Gole, Nishanth Menon)

   - Minor cleanups in amd-pstate driver (Anastasia Belova, Dhananjay
     Ugwekar)

   - Minor cleanups for loongson, cpufreq-dt and powernv cpufreq drivers
     (Danila Tikhonov, Huacai Chen, and Liu Jing)

   - Make amd-pstate validate return of any attempt to update EPP
     limits, which fixes the masking hardware problems (Mario
     Limonciello)

   - Move the calculation of the AMD boost numerator outside of
     amd-pstate, correcting acpi-cpufreq on systems with preferred cores
     (Mario Limonciello)

   - Harden preferred core detection in amd-pstate to avoid potential
     false positives (Mario Limonciello)

   - Add extra unit test coverage for mode state machine (Mario
     Limonciello)

   - Fix an "Uninitialized variables" issue in amd-pstste (Qianqiang
     Liu)

   - Add Granite Rapids Xeon support to intel_idle (Artem Bityutskiy)

   - Disable promotion to C1E on Jasper Lake and Elkhart Lake in
     intel_idle (Kai-Heng Feng)

   - Use scoped device node handling to fix missing of_node_put() and
     simplify walking OF children in the riscv-sbi cpuidle driver
     (Krzysztof Kozlowski)

   - Remove dead code from cpuidle_enter_state() (Dhruva Gole)

   - Change an error pointer to NULL to fix error handling in the
     intel_rapl power capping driver (Dan Carpenter)

   - Fix off by one in get_rpi() in the intel_rapl power capping driver
     (Dan Carpenter)

   - Add support for ArrowLake-U to the intel_rapl power capping driver
     (Sumeet Pawnikar)

   - Fix the energy-pkg event for AMD CPUs in the intel_rapl power
     capping driver (Dhananjay Ugwekar)

   - Add support for AMD family 1Ah processors to the intel_rapl power
     capping driver (Dhananjay Ugwekar)

   - Remove unused stub for saveable_highmem_page() and remove
     deprecated macros from power management documentation (Andy
     Shevchenko)

   - Use ysfs_emit() and sysfs_emit_at() in "show" functions in the PM
     sysfs interface (Xueqin Luo)

   - Update the maintainers information for the
     operating-points-v2-ti-cpu DT binding (Dhruva Gole)

   - Drop unnecessary of_match_ptr() from ti-opp-supply (Rob Herring)

   - Add missing MODULE_DESCRIPTION() macros to devfreq governors (Jeff
     Johnson)

   - Use devm_clk_get_enabled() in the exynos-bus devfreq driver (Anand
     Moon)

   - Use of_property_present() instead of of_get_property() in the
     imx-bus devfreq driver (Rob Herring)

   - Update directory handling and installation process in the pm-graph
     Makefile and add .gitignore to ignore sleepgraph.py artifacts to
     pm-graph (Amit Vadhavana, Yo-Jung Lin)

   - Make cpupower display residency value in idle-info (Aboorva
     Devarajan)

   - Add missing powercap_set_enabled() stub function to cpupower (John
     B. Wyatt IV)

   - Add SWIG support to cpupower (John B. Wyatt IV)"

* tag 'pm-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (62 commits)
  cpufreq/amd-pstate-ut: Fix an "Uninitialized variables" issue
  cpufreq/amd-pstate-ut: Add test case for mode switches
  cpufreq/amd-pstate: Export symbols for changing modes
  amd-pstate: Add missing documentation for `amd_pstate_prefcore_ranking`
  cpufreq: amd-pstate: Add documentation for `amd_pstate_hw_prefcore`
  cpufreq: amd-pstate: Optimize amd_pstate_update_limits()
  cpufreq: amd-pstate: Merge amd_pstate_highest_perf_set() into amd_get_boost_ratio_numerator()
  x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()
  x86/amd: Move amd_get_highest_perf() out of amd-pstate
  ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn
  ACPI: CPPC: Drop check for non zero perf ratio
  x86/amd: Rename amd_get_highest_perf() to amd_get_boost_ratio_numerator()
  ACPI: CPPC: Adjust return code for inline functions in !CONFIG_ACPI_CPPC_LIB
  x86/amd: Move amd_get_highest_perf() from amd.c to cppc.c
  PM: hibernate: Remove unused stub for saveable_highmem_page()
  pm:cpupower: Add error warning when SWIG is not installed
  MAINTAINERS: Add Maintainers for SWIG Python bindings
  pm:cpupower: Include test_raw_pylibcpupower.py
  pm:cpupower: Add SWIG bindings files for libcpupower
  pm:cpupower: Add missing powercap_set_enabled() stub function
  ...
2024-09-16 07:47:50 +02:00
Linus Torvalds
114143a595 arm64 updates for 6.12
ACPI:
 * Enable PMCG erratum workaround for HiSilicon HIP10 and 11 platforms.
 * Ensure arm64-specific IORT header is covered by MAINTAINERS.
 
 CPU Errata:
 * Enable workaround for hardware access/dirty issue on Ampere-1A cores.
 
 Memory management:
 * Define PHYSMEM_END to fix a crash in the amdgpu driver.
 * Avoid tripping over invalid kernel mappings on the kexec() path.
 * Userspace support for the Permission Overlay Extension (POE) using
   protection keys.
 
 Perf and PMUs:
 * Add support for the "fixed instruction counter" extension in the CPU
   PMU architecture.
 * Extend and fix the event encodings for Apple's M1 CPU PMU.
 * Allow LSM hooks to decide on SPE permissions for physical profiling.
 * Add support for the CMN S3 and NI-700 PMUs.
 
 Confidential Computing:
 * Add support for booting an arm64 kernel as a protected guest under
   Android's "Protected KVM" (pKVM) hypervisor.
 
 Selftests:
 * Fix vector length issues in the SVE/SME sigreturn tests
 * Fix build warning in the ptrace tests.
 
 Timers:
 * Add support for PR_{G,S}ET_TSC so that 'rr' can deal with
   non-determinism arising from the architected counter.
 
 Miscellaneous:
 * Rework our IPI-based CPU stopping code to try NMIs if regular IPIs
   don't succeed.
 * Minor fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmbkVNEQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNKeIB/9YtbN7JMgsXktM94GP03r3tlFF36Y1S51S
 +zdDZclAVZCTCZN+PaFeAZ/+ah2EQYrY6rtDoHUSEMQdF9kH+ycuIPDTwaJ4Qkam
 QKXMpAgtY/4yf2rX4lhDF8rEvkhLDsu7oGDhqUZQsA33GrMBHfgA3oqpYwlVjvGq
 gkm7olTo9LdWAxkPpnjGrjB6Mv5Dq8dJRhW+0Q5AntI5zx3RdYGJZA9GUSzyYCCt
 FIYOtMmWPkQ0kKxIVxOxAOm/ubhfyCs2sjSfkaa3vtvtt+Yjye1Xd81rFciIbPgP
 QlK/Mes2kBZmjhkeus8guLI5Vi7tx3DQMkNqLXkHAAzOoC4oConE
 =6osL
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "The highlights are support for Arm's "Permission Overlay Extension"
  using memory protection keys, support for running as a protected guest
  on Android as well as perf support for a bunch of new interconnect
  PMUs.

  Summary:

  ACPI:
   - Enable PMCG erratum workaround for HiSilicon HIP10 and 11
     platforms.
   - Ensure arm64-specific IORT header is covered by MAINTAINERS.

  CPU Errata:
   - Enable workaround for hardware access/dirty issue on Ampere-1A
     cores.

  Memory management:
   - Define PHYSMEM_END to fix a crash in the amdgpu driver.
   - Avoid tripping over invalid kernel mappings on the kexec() path.
   - Userspace support for the Permission Overlay Extension (POE) using
     protection keys.

  Perf and PMUs:
   - Add support for the "fixed instruction counter" extension in the
     CPU PMU architecture.
   - Extend and fix the event encodings for Apple's M1 CPU PMU.
   - Allow LSM hooks to decide on SPE permissions for physical
     profiling.
   - Add support for the CMN S3 and NI-700 PMUs.

  Confidential Computing:
   - Add support for booting an arm64 kernel as a protected guest under
     Android's "Protected KVM" (pKVM) hypervisor.

  Selftests:
   - Fix vector length issues in the SVE/SME sigreturn tests
   - Fix build warning in the ptrace tests.

  Timers:
   - Add support for PR_{G,S}ET_TSC so that 'rr' can deal with
     non-determinism arising from the architected counter.

  Miscellaneous:
   - Rework our IPI-based CPU stopping code to try NMIs if regular IPIs
     don't succeed.
   - Minor fixes and cleanups"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
  perf: arm-ni: Fix an NULL vs IS_ERR() bug
  arm64: hibernate: Fix warning for cast from restricted gfp_t
  arm64: esr: Define ESR_ELx_EC_* constants as UL
  arm64: pkeys: remove redundant WARN
  perf: arm_pmuv3: Use BR_RETIRED for HW branch event if enabled
  MAINTAINERS: List Arm interconnect PMUs as supported
  perf: Add driver for Arm NI-700 interconnect PMU
  dt-bindings/perf: Add Arm NI-700 PMU
  perf/arm-cmn: Improve format attr printing
  perf/arm-cmn: Clean up unnecessary NUMA_NO_NODE check
  arm64/mm: use lm_alias() with addresses passed to memblock_free()
  mm: arm64: document why pte is not advanced in contpte_ptep_set_access_flags()
  arm64: Expose the end of the linear map in PHYSMEM_END
  arm64: trans_pgd: mark PTEs entries as valid to avoid dead kexec()
  arm64/mm: Delete __init region from memblock.reserved
  perf/arm-cmn: Support CMN S3
  dt-bindings: perf: arm-cmn: Add CMN S3
  perf/arm-cmn: Refactor DTC PMU register access
  perf/arm-cmn: Make cycle counts less surprising
  perf/arm-cmn: Improve build-time assertion
  ...
2024-09-16 06:55:07 +02:00
Linus Torvalds
85ffc6e4ed This update includes the following changes:
API:
 
 - Make self-test asynchronous.
 
 Algorithms:
 
 - Remove MPI functions added for SM3.
 - Add allocation error checks to remaining MPI functions (introduced for SM3).
 - Set default Jitter RNG OSR to 3.
 
 Drivers:
 
 - Add hwrng driver for Rockchip RK3568 SoC.
 - Allow disabling SR-IOV VFs through sysfs in qat.
 - Fix device reset bugs in hisilicon.
 - Fix authenc key parsing by using generic helper in octeontx*.
 
 Others:
 
 - Fix xor benchmarking on parisc.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmbnq/wACgkQxycdCkmx
 i6cyXw//cBgngKOuCv7tLqMPeSLC39jDJEKkP9tS9ZilYyxzg1b9cbnDLlKNk4Yq
 4A6rRqqh8PD3/yJT58pGXaU5Is5sVMQRqqgwFutXrkD+hsMLk2nlgzsWYhg6aUsY
 /THTfmKTwEgfc3qDLZq6xGOShmMdO6NiOGsH3MeEWhbgfrDuJlOkHXd7QncNa7q8
 NEP7kI3vBc0xFcOxzbjy8tSGYEmPft1LECXAKsgOycWj9Q0SkzPocmo59iSbM21b
 HfV0p3hgAEa5VgKv0Rc5/6PevAqJqOSjGNfRBSPZ97o7dop8sl/z/cOWiy8dM7wO
 xhd9g7XXtmML6UO2MpJPMJzsLgMsjmUTWO2UyEpIyst6RVfJlniOL/jGzWmZ/P2+
 vw/F/mX8k60Zv1du46PC3p6eBeH4Yx/2fEPvPTJus+DQHS9GchXtAKtMToyyUHc2
 6TAy0nOihVQK2Q3QuQ1B/ghQS4tkdOenOIYHSCf9a9nJamub+PqP8jWDw0Y2RcY6
 jSs+tk6hwHJaKnj/T/Mr0gVPX9L8KHCYBtZD7Qbr0NhoXOT6w47m6bbew/dzTN+0
 pmFsgz32fNm8vb8R8D0kZDF63s6uz6CN+P9Dx6Tab4X+87HxNdeaBPS/Le9tYgOC
 0MmE5oIquooqedpM5tW55yuyOHhLPGLQS2SDiA+Ke+WYbAC8SQc=
 =rG1X
 -----END PGP SIGNATURE-----

Merge tag 'v6.12-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto update from Herbert Xu"
 "API:
   - Make self-test asynchronous

  Algorithms:
   - Remove MPI functions added for SM3
   - Add allocation error checks to remaining MPI functions (introduced
     for SM3)
   - Set default Jitter RNG OSR to 3

  Drivers:
   - Add hwrng driver for Rockchip RK3568 SoC
   - Allow disabling SR-IOV VFs through sysfs in qat
   - Fix device reset bugs in hisilicon
   - Fix authenc key parsing by using generic helper in octeontx*

  Others:
   - Fix xor benchmarking on parisc"

* tag 'v6.12-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (96 commits)
  crypto: n2 - Set err to EINVAL if snprintf fails for hmac
  crypto: camm/qi - Use ERR_CAST() to return error-valued pointer
  crypto: mips/crc32 - Clean up useless assignment operations
  crypto: qcom-rng - rename *_of_data to *_match_data
  crypto: qcom-rng - fix support for ACPI-based systems
  dt-bindings: crypto: qcom,prng: document support for SA8255p
  crypto: aegis128 - Fix indentation issue in crypto_aegis128_process_crypt()
  crypto: octeontx* - Select CRYPTO_AUTHENC
  crypto: testmgr - Hide ENOENT errors
  crypto: qat - Remove trailing space after \n newline
  crypto: hisilicon/sec - Remove trailing space after \n newline
  crypto: algboss - Pass instance creation error up
  crypto: api - Fix generic algorithm self-test races
  crypto: hisilicon/qm - inject error before stopping queue
  crypto: hisilicon/hpre - mask cluster timeout error
  crypto: hisilicon/qm - reset device before enabling it
  crypto: hisilicon/trng - modifying the order of header files
  crypto: hisilicon - add a lock for the qp send operation
  crypto: hisilicon - fix missed error branch
  crypto: ccp - do not request interrupt on cmd completion when irqs disabled
  ...
2024-09-16 06:28:28 +02:00
Linus Torvalds
9410645520 Networking changes for 6.12.
The zero-copy changes are relatively significant, but regression risk
 should be contained. The feature needs to be used to cause trouble.
 The new code did trigger a PowerPC64 bug with GCC 14:
 
   https://lore.kernel.org/netdev/20240913125302.0a06b4c7@canb.auug.org.au/
 
 a fix for which Michael will bring via his tree:
 
   https://lore.kernel.org/all/87jzffq9ge.fsf@mail.lhotse/
 
 Unideal, not sure if you'll be willing to pull without that fix but
 since we caught this recently I figured we'll defer to you during
 the MW instead of trying to fix it cross-tree.
 
 Also it feels like we got an order of magnitude more semi-automated
 "refactoring" chaff than usual, I wonder if it's just us.
 
 Core & protocols
 ----------------
 
  - Support Device Memory TCP, ability to zero-copy receive TCP payloads
    to a DMABUF region of memory while packet headers land separately
    in normal kernel buffers, and TCP processes then as usual.
 
  - The ability to read the PTP PHC (Physical Hardware Clock) alongside
    MONOTONIC_RAW timestamps with PTP_SYS_OFFSET_EXTENDED. Previously
    only CLOCK_REALTIME was supported.
 
  - Allow matching on all bits of IP DSCP for routing decisions.
    Previously we only supported on matching TOS bits in IPv4 which
    is a narrower interpretation of the same header field.
 
  - Increase the range of weights used for multi-path routing from
    8 bits to 16 bits.
 
  - Add support for IPv6 PIO p flag in the Prefix Information Option
    per draft-ietf-6man-pio-pflag.
 
  - IPv6 IOAM6 support for new tunsrc encap mode for better performance.
 
  - Detect destinations which blackhole MPTCP traffic and avoid initiating
    MPTCP connections to them for a certain period of time, 1h by default.
 
  - Improve IPsec control path performance by removing the inexact
    policies list.
 
  - AF_VSOCK: add support for SIOCOUTQ ioctl.
 
  - Add enum for reasons TCP reset was sent for easier tracing.
 
  - Add SMC ringbufs usage statistics.
 
 Drivers
 -------
 
  - Handle netconsole setup failures more gracefully, don't fail loading,
    retain the specified target as disabled.
 
  - Extend bonding's IPsec offload pass thru capabilities (ESN, stats).
 
 Filtering
 ---------
 
  - Add TCP_BPF_SOCK_OPS_CB_FLAGS to bpf_*sockopt() to address the case
    when long-lived sockets miss a chance to set additional callbacks
    if a sockops program was not attached early in their lifetime.
 
  - Support using BPF skb helpers in tracepoints.
 
  - Conntrack Netlink: support CTA_FILTER for flush.
 
  - Improve SCTP support in nfnetlink_queue.
 
  - Improve performance of large nftables flush transactions.
 
 Things we sprinkled into general kernel code
 --------------------------------------------
 
  - selftests: support setting an "interpreter" for script files;
    make it easy to run as separate cases tests where one "interpreter"
    is fed various test descriptions (in our case packet sequences).
 
 Driver API
 ----------
 
  - Extend core and ethtool APIs to support many PHYs connected to a single
    interface (PHY topologies).
 
  - Extend cable diagnostics to specify whether Time Domain Reflectometry
    (TDR) or Active Link Cable Diagnostic (ALCD) was used.
 
  - Add library for implementing MAC-PHY Ethernet drivers for SPI devices
    compatible with Open Alliance 10BASE-T1x MAC-PHY Serial Interface (TC6)
    standard.
 
  - Add helpers to the PHY framework, for PHYs following the Open Alliance
    standards:
    - 1000BaseT1 link settings
    - cable test and diagnostics
 
  - Support listing / dumping all allocated RSS contexts.
 
  - Add configuration for frequency Embedded SYNC in DPLL, which magically
    embeds sync pulses into Ethernet signaling.
 
 Device drivers
 --------------
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
      - use better FW APIs for queue reset
      - support QOS and TPID settings for the SR-IOV VLAN
      - support dynamic MSI-X allocation
    - Intel (100G, ice, idpf):
      - ice: support PCIe subfunctions
      - iavf: add support for TC U32 filters on VFs
      - ice: support Embedded SYNC in DPLL
    - nVidia/Mellanox (mlx5):
      - support HW managed steering tables
      - support PCIe PTM cross timestamping
    - AMD/Pensando:
      - ionic: use page_pool to increase Rx performance
    - Cisco (enic):
      - report per-queue statistics
 
  - Ethernet virtual:
    - Microsoft vNIC:
      - mana: support configuring ring length
      - netvsc: enable more channels on systems with many CPUs
    - IBM veth:
      - optimize polling to improve TCP_RR performance
      - optimize performance of Tx handling
    - VirtIO net:
      - synchronize the operstate with the admin state to allow a lower
        virtio-net to propagate the link status to an upper device like
        macvlan
 
  - Ethernet NICs consumer, and embedded:
    - Add driver for Realtek automotive PCIe devices (RTL9054, RTL9068,
      RTL9072, RTL9075, RTL9068, RTL9071)
    - Add driver for Microchip LAN8650/1 10BASE-T1S MAC-PHY.
    - Microchip:
      - lan743x: use phylink - support WOL, EEE, pause, link settings
      - add Wake-on-LAN support for KSZ87xx family
      - add KSZ8895/KSZ8864 switch support
      - factor out FDMA code and use it in sparx5 and lan966x
        (including DCB support in both)
    - Synopsys (stmmac):
      - support frame preemption (configured using TC and ethtool)
      - support Loongson DWMAC (GMAC v3.73)
      - support RockChips RK3576 DWMAC
    - TI:
      - am65-cpsw: add multi queue RX support
      - icssg-prueth: HSR offload support
    - Cadence (macb):
      - enable software (hrtimer based) IRQ coalescing by default
    - Xilinx (axinet):
      - expose HW statistics
      - improve multicast filtering
      - relax Rx checksum offload constraints
    - MediaTek:
      - mt7530: add EN7581 support
    - Aspeed (ftgmac100):
      - report link speed and duplex
    - Intel:
      - igc: add mqprio offload
      - igc: report EEE configuration
    - RealTek (r8169):
      - add support for RTL8126A rev.b
    - Vitesse (vsc73xx):
      - implement FDB add/del/dump operations
    - Freescale (fs_enet):
      - use phylink
 
  - Ethernet PHYs:
    - vitesse: implement downshift and MDI-X in vsc73xx PHYs
    - microchip: support LAN887x, supporting IEEE 802.3bw (100BASE-T1)
      and IEEE 802.3bp (1000BASE-T1) specifications
    - add Applied Micro QT2025 PHY driver (in Rust)
    - add Motorcomm yt8821 2.5G Ethernet PHY driver
 
  - CAN:
    - add driver for Rockchip RK3568 CAN-FD controller
    - flexcan: add wakeup support for imx95
    - kvaser_usb: set hardware timestamp on transmitted packets
 
  - WiFi:
    - mac80211/cfg80211:
      - EHT rate support in AQL airtime fairness
      - handle DFS (radar detection) per link in Multi-Link Operation
    - RealTek (rtw89):
      - support RTL8852BT and 8852BE-VT (WiFi 6)
      - support hardware rfkill
      - support HW encryption in unicast management frames
      - support Wake-on-WLAN with supported network detection
    - RealTek (rtw89):
      - improve Rx performance by using USB frame aggregation
      - support USB 3 with RTL8822CU/RTL8822BU
    - Intel (iwlwifi/mvm):
      - offload RLC/SMPS functionality to firmware
    - Marvell (mwifiex):
      - add host based MLME to enable WPA3
 
  - Bluetooth:
    - add support for Amlogic HCI UART protocol
    - add support for ISO data/packets to Intel and NXP drivers
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIyBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmbnFW0ACgkQMUZtbf5S
 IrvA8A/4yxw9SFLFZVFn2c1kRssssSUENAljnP29MaINjr74BT2B324e5V5xiCK/
 yT+hr9M+mlFDZVlZYAxo7Z64X6EwmjXewaH+2/tIsZf9LFySnkNq3sCxCuZWQNtE
 WjVdT/t+7rS8sGQefSggchXrSqZg1Rw/oCI3cKjQl8jB/CvDs7n1ivjtNz409jHy
 MKvcvf4cfG/olN0SnXh8kHHmz4d1rnPOi2OmC/dNAU8ErcDgC1t7PmMAzTfJWzND
 Akyxe4BvMkoKjL+kzIdpaf6EoLjUENPqu9/KKseP37HtYZmE4M0ENJOJnr7FVWwP
 GHymKwyp+VyI3RLNPIWrMJyCOwyUg4n4N44tGDn5bC3fYi1qK7U14pTP1vSZfXsK
 K8D6kpkVNllTLvf2z+FbweHu6CSh87vgdt1p7aNKpkEO0jISJBDFxLAen1buayKt
 9VYXclcM7ZdjDd6w/53woieYizNeV10L5917htJCh/BbQ+XM0IjDR9wiJuj3aZ1s
 BrmsTK/7VuKxJ4LQKFkWnqnB02/GUHDbGVQoQCUBF7uaSPcPv4FWW6ibqIUz8zq5
 HyGFOIL1Lc/J4s7D3mvAEhs6AKcVd9eU29TIcgLAUFyAYvSq7Y50ZeFtZrCysv2y
 Uy43qagPl4jKcFlHCriD2b/vFHttppL1ijLs2bvydMQkhY9eoQ==
 =ZEaS
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "The zero-copy changes are relatively significant, but regression risk
  should be contained. The feature needs to be used to cause trouble.

  Also it feels like we got an order of magnitude more semi-automated
  "refactoring" chaff than usual, I wonder if it's just us.

  Core & protocols:

   - Support Device Memory TCP, ability to zero-copy receive TCP
     payloads to a DMABUF region of memory while packet headers land
     separately in normal kernel buffers, and TCP processes then as
     usual.

   - The ability to read the PTP PHC (Physical Hardware Clock) alongside
     MONOTONIC_RAW timestamps with PTP_SYS_OFFSET_EXTENDED. Previously
     only CLOCK_REALTIME was supported.

   - Allow matching on all bits of IP DSCP for routing decisions.
     Previously we only supported on matching TOS bits in IPv4 which is
     a narrower interpretation of the same header field.

   - Increase the range of weights used for multi-path routing from
     8 bits to 16 bits.

   - Add support for IPv6 PIO p flag in the Prefix Information Option
     per draft-ietf-6man-pio-pflag.

   - IPv6 IOAM6 support for new tunsrc encap mode for better
     performance.

   - Detect destinations which blackhole MPTCP traffic and avoid
     initiating MPTCP connections to them for a certain period of time,
     1h by default.

   - Improve IPsec control path performance by removing the inexact
     policies list.

   - AF_VSOCK: add support for SIOCOUTQ ioctl.

   - Add enum for reasons TCP reset was sent for easier tracing.

   - Add SMC ringbufs usage statistics.

  Drivers:

   - Handle netconsole setup failures more gracefully, don't fail
     loading, retain the specified target as disabled.

   - Extend bonding's IPsec offload pass thru capabilities (ESN, stats).

  Filtering:

   - Add TCP_BPF_SOCK_OPS_CB_FLAGS to bpf_*sockopt() to address the case
     when long-lived sockets miss a chance to set additional callbacks
     if a sockops program was not attached early in their lifetime.

   - Support using BPF skb helpers in tracepoints.

   - Conntrack Netlink: support CTA_FILTER for flush.

   - Improve SCTP support in nfnetlink_queue.

   - Improve performance of large nftables flush transactions.

  Things we sprinkled into general kernel code:

   - selftests: support setting an "interpreter" for script files; make
     it easy to run as separate cases tests where one "interpreter" is
     fed various test descriptions (in our case packet sequences).

  Driver API:

   - Extend core and ethtool APIs to support many PHYs connected to a
     single interface (PHY topologies).

   - Extend cable diagnostics to specify whether Time Domain
     Reflectometry (TDR) or Active Link Cable Diagnostic (ALCD) was
     used.

   - Add library for implementing MAC-PHY Ethernet drivers for SPI
     devices compatible with Open Alliance 10BASE-T1x MAC-PHY Serial
     Interface (TC6) standard.

   - Add helpers to the PHY framework, for PHYs following the Open
     Alliance standards:
       - 1000BaseT1 link settings
       - cable test and diagnostics

   - Support listing / dumping all allocated RSS contexts.

   - Add configuration for frequency Embedded SYNC in DPLL, which
     magically embeds sync pulses into Ethernet signaling.

  Device drivers:

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - use better FW APIs for queue reset
         - support QOS and TPID settings for the SR-IOV VLAN
         - support dynamic MSI-X allocation
      - Intel (100G, ice, idpf):
         - ice: support PCIe subfunctions
         - iavf: add support for TC U32 filters on VFs
         - ice: support Embedded SYNC in DPLL
      - nVidia/Mellanox (mlx5):
         - support HW managed steering tables
         - support PCIe PTM cross timestamping
      - AMD/Pensando:
         - ionic: use page_pool to increase Rx performance
      - Cisco (enic):
         - report per-queue statistics

   - Ethernet virtual:
      - Microsoft vNIC:
         - mana: support configuring ring length
         - netvsc: enable more channels on systems with many CPUs
      - IBM veth:
         - optimize polling to improve TCP_RR performance
         - optimize performance of Tx handling
      - VirtIO net:
         - synchronize the operstate with the admin state to allow a
           lower virtio-net to propagate the link status to an upper
           device like macvlan

   - Ethernet NICs consumer, and embedded:
      - Add driver for Realtek automotive PCIe devices (RTL9054,
        RTL9068, RTL9072, RTL9075, RTL9068, RTL9071)
      - Add driver for Microchip LAN8650/1 10BASE-T1S MAC-PHY.
      - Microchip:
         - lan743x: use phylink - support WOL, EEE, pause, link settings
         - add Wake-on-LAN support for KSZ87xx family
         - add KSZ8895/KSZ8864 switch support
         - factor out FDMA code and use it in sparx5 and lan966x
           (including DCB support in both)
      - Synopsys (stmmac):
         - support frame preemption (configured using TC and ethtool)
         - support Loongson DWMAC (GMAC v3.73)
         - support RockChips RK3576 DWMAC
      - TI:
         - am65-cpsw: add multi queue RX support
         - icssg-prueth: HSR offload support
      - Cadence (macb):
         - enable software (hrtimer based) IRQ coalescing by default
      - Xilinx (axinet):
         - expose HW statistics
         - improve multicast filtering
         - relax Rx checksum offload constraints
      - MediaTek:
         - mt7530: add EN7581 support
      - Aspeed (ftgmac100):
         - report link speed and duplex
      - Intel:
         - igc: add mqprio offload
         - igc: report EEE configuration
      - RealTek (r8169):
         - add support for RTL8126A rev.b
      - Vitesse (vsc73xx):
         - implement FDB add/del/dump operations
      - Freescale (fs_enet):
         - use phylink

   - Ethernet PHYs:
      - vitesse: implement downshift and MDI-X in vsc73xx PHYs
      - microchip: support LAN887x, supporting IEEE 802.3bw (100BASE-T1)
        and IEEE 802.3bp (1000BASE-T1) specifications
      - add Applied Micro QT2025 PHY driver (in Rust)
      - add Motorcomm yt8821 2.5G Ethernet PHY driver

   - CAN:
      - add driver for Rockchip RK3568 CAN-FD controller
      - flexcan: add wakeup support for imx95
      - kvaser_usb: set hardware timestamp on transmitted packets

   - WiFi:
      - mac80211/cfg80211:
         - EHT rate support in AQL airtime fairness
         - handle DFS (radar detection) per link in Multi-Link Operation
      - RealTek (rtw89):
         - support RTL8852BT and 8852BE-VT (WiFi 6)
         - support hardware rfkill
         - support HW encryption in unicast management frames
         - support Wake-on-WLAN with supported network detection
      - RealTek (rtw89):
         - improve Rx performance by using USB frame aggregation
         - support USB 3 with RTL8822CU/RTL8822BU
      - Intel (iwlwifi/mvm):
         - offload RLC/SMPS functionality to firmware
      - Marvell (mwifiex):
         - add host based MLME to enable WPA3

   - Bluetooth:
      - add support for Amlogic HCI UART protocol
      - add support for ISO data/packets to Intel and NXP drivers"

* tag 'net-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1303 commits)
  net/mlx5: HWS, check the correct variable in hws_send_ring_alloc_sq()
  netfilter: nft_socket: Fix a NULL vs IS_ERR() bug in nft_socket_cgroup_subtree_level()
  ice: Fix a NULL vs IS_ERR() check in probe()
  ice: Fix a couple NULL vs IS_ERR() bugs
  net: ethernet: fs_enet: Make the per clock optional
  net: ti: icssg-prueth: Add multicast filtering support in HSR mode
  net: ti: icssg-prueth: Enable HSR Tx duplication, Tx Tag and Rx Tag offload
  net: ti: icssg-prueth: Add support for HSR frame forward offload
  net: ti: icssg-prueth: Stop hardcoding def_inc
  net: ti: icss-iep: Move icss_iep structure
  net: ibm: emac: get rid of wol_irq
  net: ibm: emac: remove all waiting code
  net: ibm: emac: replace of_get_property
  net: ibm: emac: use netdev's phydev directly
  net: ibm: emac: use devm for register_netdev
  net: ibm: emac: remove mii_bus with devm
  net: ibm: emac: use devm for of_iomap
  net: ibm: emac: manage emac_irq with devm
  net: ibm: emac: use devm for alloc_etherdev
  octeontx2-af: debugfs: Add Channel info to RPM map
  ...
2024-09-16 06:02:27 +02:00
Hou Tao
986deb297d bpf: Call the missed kfree() when there is no special field in btf
Call the missed kfree() in btf_parse_struct_metas() when there is no
special field in btf, otherwise will get the following kmemleak report:

unreferenced object 0xffff888101033620 (size 8):
  comm "test_progs", pid 604, jiffies 4295127011
  ......
  backtrace (crc e77dc444):
    [<00000000186f90f3>] kmemleak_alloc+0x4b/0x80
    [<00000000ac8e9c4d>] __kmalloc_cache_noprof+0x2a1/0x310
    [<00000000d99d68d6>] btf_new_fd+0x72d/0xe90
    [<00000000f010b7f8>] __sys_bpf+0xec3/0x2410
    [<00000000e077ed6f>] __x64_sys_bpf+0x1f/0x30
    [<00000000a12f9e55>] x64_sys_call+0x199/0x9f0
    [<00000000f3029ea6>] do_syscall_64+0x3b/0xc0
    [<000000005640913a>] entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fixes: 7a851ecb18 ("bpf: Search for kptrs in prog BTF structs")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240912012845.3458483-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13 16:51:08 -07:00
Hou Tao
87e9675a0d bpf: Call the missed btf_record_free() when map creation fails
When security_bpf_map_create() in map_create() fails, map_create() will
call btf_put() and ->map_free() callback to free the map. It doesn't
free the btf_record of map value, so add the missed btf_record_free()
when map creation fails.

However btf_record_free() needs to be called after ->map_free() just
like bpf_map_free_deferred() did, because ->map_free() may use the
btf_record to free the special fields in preallocated map value. So
factor out bpf_map_free() helper to free the map, btf_record, and btf
orderly and use the helper in both map_create() and
bpf_map_free_deferred().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240912012845.3458483-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13 16:51:08 -07:00
Daniel Borkmann
4b3786a6c5 bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error
For all non-tracing helpers which formerly had ARG_PTR_TO_{LONG,INT} as input
arguments, zero the value for the case of an error as otherwise it could leak
memory. For tracing, it is not needed given CAP_PERFMON can already read all
kernel memory anyway hence bpf_get_func_arg() and bpf_get_func_ret() is skipped
in here.

Also, the MTU helpers mtu_len pointer value is being written but also read.
Technically, the MEM_UNINIT should not be there in order to always force init.
Removing MEM_UNINIT needs more verifier rework though: MEM_UNINIT right now
implies two things actually: i) write into memory, ii) memory does not have
to be initialized. If we lift MEM_UNINIT, it then becomes: i) read into memory,
ii) memory must be initialized. This means that for bpf_*_check_mtu() we're
readding the issue we're trying to fix, that is, it would then be able to
write back into things like .rodata BPF maps. Follow-up work will rework the
MEM_UNINIT semantics such that the intent can be better expressed. For now
just clear the *mtu_len on error path which can be lifted later again.

Fixes: 8a67f2de9b ("bpf: expose bpf_strtol and bpf_strtoul to all program types")
Fixes: d7a4cb9b67 ("bpf: Introduce bpf_strtol and bpf_strtoul helpers")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/e5edd241-59e7-5e39-0ee5-a51e31b6840a@iogearbox.net
Link: https://lore.kernel.org/r/20240913191754.13290-5-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13 13:17:56 -07:00
Daniel Borkmann
18752d73c1 bpf: Improve check_raw_mode_ok test for MEM_UNINIT-tagged types
When checking malformed helper function signatures, also take other argument
types into account aside from just ARG_PTR_TO_UNINIT_MEM.

This concerns (formerly) ARG_PTR_TO_{INT,LONG} given uninitialized memory can
be passed there, too.

The func proto sanity check goes back to commit 435faee1aa ("bpf, verifier:
add ARG_PTR_TO_RAW_STACK type"), and its purpose was to detect wrong func protos
which had more than just one MEM_UNINIT-tagged type as arguments.

The reason more than one is currently not supported is as we mark stack slots with
STACK_MISC in check_helper_call() in case of raw mode based on meta.access_size to
allow uninitialized stack memory to be passed to helpers when they just write into
the buffer.

Probing for base type as well as MEM_UNINIT tagging ensures that other types do not
get missed (as it used to be the case for ARG_PTR_TO_{INT,LONG}).

Fixes: 57c3bb725a ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types")
Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20240913191754.13290-4-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13 13:17:56 -07:00
Daniel Borkmann
32556ce93b bpf: Fix helper writes to read-only maps
Lonial found an issue that despite user- and BPF-side frozen BPF map
(like in case of .rodata), it was still possible to write into it from
a BPF program side through specific helpers having ARG_PTR_TO_{LONG,INT}
as arguments.

In check_func_arg() when the argument is as mentioned, the meta->raw_mode
is never set. Later, check_helper_mem_access(), under the case of
PTR_TO_MAP_VALUE as register base type, it assumes BPF_READ for the
subsequent call to check_map_access_type() and given the BPF map is
read-only it succeeds.

The helpers really need to be annotated as ARG_PTR_TO_{LONG,INT} | MEM_UNINIT
when results are written into them as opposed to read out of them. The
latter indicates that it's okay to pass a pointer to uninitialized memory
as the memory is written to anyway.

However, ARG_PTR_TO_{LONG,INT} is a special case of ARG_PTR_TO_FIXED_SIZE_MEM
just with additional alignment requirement. So it is better to just get
rid of the ARG_PTR_TO_{LONG,INT} special cases altogether and reuse the
fixed size memory types. For this, add MEM_ALIGNED to additionally ensure
alignment given these helpers write directly into the args via *<ptr> = val.
The .arg*_size has been initialized reflecting the actual sizeof(*<ptr>).

MEM_ALIGNED can only be used in combination with MEM_FIXED_SIZE annotated
argument types, since in !MEM_FIXED_SIZE cases the verifier does not know
the buffer size a priori and therefore cannot blindly write *<ptr> = val.

Fixes: 57c3bb725a ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types")
Reported-by: Lonial Con <kongln9170@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20240913191754.13290-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13 13:17:55 -07:00
Daniel Borkmann
7d71f59e02 bpf: Remove truncation test in bpf_strtol and bpf_strtoul helpers
Both bpf_strtol() and bpf_strtoul() helpers passed a temporary "long long"
respectively "unsigned long long" to __bpf_strtoll() / __bpf_strtoull().

Later, the result was checked for truncation via _res != ({unsigned,} long)_res
as the destination buffer for the BPF helpers was of type {unsigned,} long
which is 32bit on 32bit architectures.

Given the latter was a bug in the helper signatures where the destination buffer
got adjusted to {s,u}64, the truncation check can now be removed.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240913191754.13290-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13 13:17:55 -07:00
Daniel Borkmann
cfe69c50b0 bpf: Fix bpf_strtol and bpf_strtoul helpers for 32bit
The bpf_strtol() and bpf_strtoul() helpers are currently broken on 32bit:

The argument type ARG_PTR_TO_LONG is BPF-side "long", not kernel-side "long"
and therefore always considered fixed 64bit no matter if 64 or 32bit underlying
architecture.

This contract breaks in case of the two mentioned helpers since their BPF_CALL
definition for the helpers was added with {unsigned,}long *res. Meaning, the
transition from BPF-side "long" (BPF program) to kernel-side "long" (BPF helper)
breaks here.

Both helpers call __bpf_strtoll() with "long long" correctly, but later assigning
the result into 32-bit "*(long *)" on 32bit architectures. From a BPF program
point of view, this means upper bits will be seen as uninitialised.

Therefore, fix both BPF_CALL signatures to {s,u}64 types to fix this situation.

Now, changing also uapi/bpf.h helper documentation which generates bpf_helper_defs.h
for BPF programs is tricky: Changing signatures there to __{s,u}64 would trigger
compiler warnings (incompatible pointer types passing 'long *' to parameter of type
'__s64 *' (aka 'long long *')) for existing BPF programs.

Leaving the signatures as-is would be fine as from BPF program point of view it is
still BPF-side "long" and thus equivalent to __{s,u}64 on 64 or 32bit underlying
architectures.

Note that bpf_strtol() and bpf_strtoul() are the only helpers with this issue.

Fixes: d7a4cb9b67 ("bpf: Introduce bpf_strtol and bpf_strtoul helpers")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/481fcec8-c12c-9abb-8ecb-76c71c009959@iogearbox.net
Link: https://lore.kernel.org/r/20240913191754.13290-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13 13:17:55 -07:00
Yonghong Song
7dd34d7b7d bpf: Fix a sdiv overflow issue
Zac Ecob reported a problem where a bpf program may cause kernel crash due
to the following error:
  Oops: divide error: 0000 [#1] PREEMPT SMP KASAN PTI

The failure is due to the below signed divide:
  LLONG_MIN/-1 where LLONG_MIN equals to -9,223,372,036,854,775,808.
LLONG_MIN/-1 is supposed to give a positive number 9,223,372,036,854,775,808,
but it is impossible since for 64-bit system, the maximum positive
number is 9,223,372,036,854,775,807. On x86_64, LLONG_MIN/-1 will
cause a kernel exception. On arm64, the result for LLONG_MIN/-1 is
LLONG_MIN.

Further investigation found all the following sdiv/smod cases may trigger
an exception when bpf program is running on x86_64 platform:
  - LLONG_MIN/-1 for 64bit operation
  - INT_MIN/-1 for 32bit operation
  - LLONG_MIN%-1 for 64bit operation
  - INT_MIN%-1 for 32bit operation
where -1 can be an immediate or in a register.

On arm64, there are no exceptions:
  - LLONG_MIN/-1 = LLONG_MIN
  - INT_MIN/-1 = INT_MIN
  - LLONG_MIN%-1 = 0
  - INT_MIN%-1 = 0
where -1 can be an immediate or in a register.

Insn patching is needed to handle the above cases and the patched codes
produced results aligned with above arm64 result. The below are pseudo
codes to handle sdiv/smod exceptions including both divisor -1 and divisor 0
and the divisor is stored in a register.

sdiv:
      tmp = rX
      tmp += 1 /* [-1, 0] -> [0, 1]
      if tmp >(unsigned) 1 goto L2
      if tmp == 0 goto L1
      rY = 0
  L1:
      rY = -rY;
      goto L3
  L2:
      rY /= rX
  L3:

smod:
      tmp = rX
      tmp += 1 /* [-1, 0] -> [0, 1]
      if tmp >(unsigned) 1 goto L1
      if tmp == 1 (is64 ? goto L2 : goto L3)
      rY = 0;
      goto L2
  L1:
      rY %= rX
  L2:
      goto L4  // only when !is64
  L3:
      wY = wY  // only when !is64
  L4:

  [1] https://lore.kernel.org/bpf/tPJLTEh7S_DxFEqAI2Ji5MBSoZVg7_G-Py2iaZpAaWtM961fFTWtsnlzwvTbzBzaUzwQAoNATXKUlt0LZOFgnDcIyKCswAnAGdUF3LBrhGQ=@protonmail.com/

Reported-by: Zac Ecob <zacecob@protonmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240913150326.1187788-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13 13:07:44 -07:00
Vincent Donnefort
b319cea805 module: Refine kmemleak scanned areas
commit ac3b432839 ("module: replace module_layout with module_memory")
introduced a set of memory regions for the module layout sharing the
same attributes. However, it didn't update the kmemleak scanned areas
which intended to limit kmemleak scan to sections containing writable
data. This means sections such as .text and .rodata are scanned by
kmemleak.

Refine the scanned areas for modules by limiting it to MOD_TEXT and
MOD_INIT_TEXT mod_mem regions.

CC: Song Liu <song@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-09-13 09:55:17 -07:00
Chunhui Li
ce47f7cbbc module: abort module loading when sysfs setup suffer errors
When insmod a kernel module, if fails in add_notes_attrs or
add_sysfs_attrs such as memory allocation fail, mod_sysfs_setup
will still return success, but we can't access user interface
on android device.

Patch for make mod_sysfs_setup can check the error of
add_notes_attrs and add_sysfs_attrs

[mcgrof: the section stuff comes from linux history.git [0]]
Fixes: 3f7b0672086b ("Module section offsets in /sys/module") [0]
Fixes: 6d76013381 ("Add /sys/module/name/notes")
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409010016.3XIFSmRA-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202409072018.qfEzZbO7-lkp@intel.com/
Link: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=3f7b0672086b97b2d7f322bdc289cbfa203f10ef [0]
Signed-off-by: Xion Wang <xion.wang@mediatek.com>
Signed-off-by: Chunhui Li <chunhui.li@mediatek.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-09-13 09:55:17 -07:00
Jakub Kicinski
3b7dc7000e bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZuH9UQAKCRDbK58LschI
 g0/zAP99WOcCBp1M/jSTUOba230+eiol7l5RirDEA6wu7TqY2QEAuvMG0KfCCpTI
 I0WqStrK1QMbhwKPodJC1k+17jArKgw=
 =jfMU
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-09-11

We've added 12 non-merge commits during the last 16 day(s) which contain
a total of 20 files changed, 228 insertions(+), 30 deletions(-).

There's a minor merge conflict in drivers/net/netkit.c:
  00d066a4d4 ("netdev_features: convert NETIF_F_LLTX to dev->lltx")
  d966087948 ("netkit: Disable netpoll support")

The main changes are:

1) Enable bpf_dynptr_from_skb for tp_btf such that this can be used
   to easily parse skbs in BPF programs attached to tracepoints,
   from Philo Lu.

2) Add a cond_resched() point in BPF's sock_hash_free() as there have
   been several syzbot soft lockup reports recently, from Eric Dumazet.

3) Fix xsk_buff_can_alloc() to account for queue_empty_descs which
   got noticed when zero copy ice driver started to use it,
   from Maciej Fijalkowski.

4) Move the xdp:xdp_cpumap_kthread tracepoint before cpumap pushes skbs
   up via netif_receive_skb_list() to better measure latencies,
   from Daniel Xu.

5) Follow-up to disable netpoll support from netkit, from Daniel Borkmann.

6) Improve xsk selftests to not assume a fixed MAX_SKB_FRAGS of 17 but
   instead gather the actual value via /proc/sys/net/core/max_skb_frags,
   also from Maciej Fijalkowski.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  sock_map: Add a cond_resched() in sock_hash_free()
  selftests/bpf: Expand skb dynptr selftests for tp_btf
  bpf: Allow bpf_dynptr_from_skb() for tp_btf
  tcp: Use skb__nullable in trace_tcp_send_reset
  selftests/bpf: Add test for __nullable suffix in tp_btf
  bpf: Support __nullable argument suffix for tp_btf
  bpf, cpumap: Move xdp:xdp_cpumap_kthread tracepoint before rcv
  selftests/xsk: Read current MAX_SKB_FRAGS from sysctl knob
  xsk: Bump xsk_queue::queue_empty_descs in xp_can_alloc()
  tcp_bpf: Remove an unused parameter for bpf_tcp_ingress()
  bpf, sockmap: Correct spelling skmsg.c
  netkit: Disable netpoll support

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================

Link: https://patch.msgid.link/20240911211525.13834-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-12 20:22:44 -07:00
Al Viro
37d3dd663f bpf: convert bpf_token_create() to CLASS(fd, ...)
Keep file reference through the entire thing, don't bother with grabbing
struct path reference and while we are at it, don't confuse the hell out
of readers by random mix of path.dentry->d_sb and path.mnt->mnt_sb uses -
these two are equal, so just put one of those into a local variable and
use that.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-09-12 18:58:02 -07:00
Linus Torvalds
5da028864f workqueue: Fixes for v6.11-rc7
Contains the fix for a NULL worker->pool deref bug which can be triggered
 when a worker is created and then destroyed immediately.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZuM5ew4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGU5RAQCJ13myAx5ZhznE2fkCv8IrMP1y8BhO5eoPI6+o
 0QPgWgD/TMu7hMMZkz0vVHn0euNpwTWB0lOsz1299ukC1wO/tAw=
 =nJ2F
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.11-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fix from Tejun Heo:
 "A fix for a NULL worker->pool deref bug which can be triggered when a
  worker is created and then destroyed immediately"

* tag 'wq-for-6.11-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Clear worker->pool in the worker thread context
2024-09-12 13:11:10 -07:00
Christoph Hellwig
a5fb217f13 dma-mapping: reflow dma_supported
dma_supported has become too much spaghetti for my taste.  Reflow it to
remove the duplicate use_dma_iommu condition and make the main path more
obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
2024-09-12 16:28:00 +02:00
Christian Brauner
2077006d47
uidgid: make sure we fit into one cacheline
When I expanded uidgid mappings I intended for a struct uid_gid_map to
fit into a single cacheline on x86 as they tend to be pretty
performance sensitive (idmapped mounts etc). But a 4 byte hole was added
that brought it over 64 bytes. Fix that by adding the static extent
array and the extent counter into a substruct. C's type punning for
unions guarantees that we can access ->nr_extents even if the last
written to member wasn't within the same object. This is also what we
rely on in struct_group() and friends. This of course relies on
non-strict aliasing which we don't do.

99) If the member used to read the contents of a union object is not the
    same as the member last used to store a value in the object, the
    appropriate part of the object representation of the value is
    reinterpreted as an object representation in the new type as
    described in 6.2.6 (a process sometimes called "type punning").

Link: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf
Link: https://lore.kernel.org/r/20240910-work-uid_gid_map-v1-1-e6bc761363ed@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:16:09 +02:00
Leon Romanovsky
f45cfab28f dma-mapping: reliably inform about DMA support for IOMMU
If the DMA IOMMU path is going to be used, the appropriate check should
return that DMA is supported.

Fixes: b5c58b2fdc ("dma-mapping: direct calls for dma-iommu")
Closes: https://lore.kernel.org/all/181e06ff-35a3-434f-b505-672f430bd1cb@notapiano
Reported-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> #KernelCI
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-09-12 09:19:35 +02:00
Tejun Heo
902d67a2d4 sched: Move update_other_load_avgs() to kernel/sched/pelt.c
96fd6c65ef ("sched: Factor out update_other_load_avgs() from
__update_blocked_others()") added update_other_load_avgs() in
kernel/sched/syscalls.c right above effective_cpu_util(). This location
didn't fit that well in the first place, and with 5d871a6399 ("sched/fair:
Move effective_cpu_util() and effective_cpu_util() in fair.c") moving
effective_cpu_util() to kernel/sched/fair.c, it looks even more out of
place.

Relocate the function to kernel/sched/pelt.c where all its callees are.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
2024-09-11 20:00:21 -10:00
Lai Jiangshan
73613840a8 workqueue: Clear worker->pool in the worker thread context
Marc Hartmayer reported:
        [   23.133876] Unable to handle kernel pointer dereference in virtual kernel address space
        [   23.133950] Failing address: 0000000000000000 TEID: 0000000000000483
        [   23.133954] Fault in home space mode while using kernel ASCE.
        [   23.133957] AS:000000001b8f0007 R3:0000000056cf4007 S:0000000056cf3800 P:000000000000003d
        [   23.134207] Oops: 0004 ilc:2 [#1] SMP
	(snip)
        [   23.134516] Call Trace:
        [   23.134520]  [<0000024e326caf28>] worker_thread+0x48/0x430
        [   23.134525] ([<0000024e326caf18>] worker_thread+0x38/0x430)
        [   23.134528]  [<0000024e326d3a3e>] kthread+0x11e/0x130
        [   23.134533]  [<0000024e3264b0dc>] __ret_from_fork+0x3c/0x60
        [   23.134536]  [<0000024e333fb37a>] ret_from_fork+0xa/0x38
        [   23.134552] Last Breaking-Event-Address:
        [   23.134553]  [<0000024e333f4c04>] mutex_unlock+0x24/0x30
        [   23.134562] Kernel panic - not syncing: Fatal exception: panic_on_oops

With debuging and analysis, worker_thread() accesses to the nullified
worker->pool when the newly created worker is destroyed before being
waken-up, in which case worker_thread() can see the result detach_worker()
reseting worker->pool to NULL at the begining.

Move the code "worker->pool = NULL;" out from detach_worker() to fix the
problem.

worker->pool had been designed to be constant for regular workers and
changeable for rescuer. To share attaching/detaching code for regular
and rescuer workers and to avoid worker->pool being accessed inadvertently
when the worker has been detached, worker->pool is reset to NULL when
detached no matter the worker is rescuer or not.

To maintain worker->pool being reset after detached, move the code
"worker->pool = NULL;" in the worker thread context after detached.

It is either be in the regular worker thread context after PF_WQ_WORKER
is cleared or in rescuer worker thread context with wq_pool_attach_mutex
held. So it is safe to do so.

Cc: Marc Hartmayer <mhartmay@linux.ibm.com>
Link: https://lore.kernel.org/lkml/87wmjj971b.fsf@linux.ibm.com/
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Fixes: f4b7b53c94 ("workqueue: Detach workers directly in idle_cull_fn()")
Cc: stable@vger.kernel.org # v6.11+
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-11 19:58:20 -10:00
Yonghong Song
376bd59e2a bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
Salvatore Benedetto reported an issue that when doing syscall tracepoint
tracing the kernel stack is empty. For example, using the following
command line
  bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }'
  bpftrace -e 'tracepoint:syscalls:sys_exit_read { print("Kernel Stack\n"); print(kstack()); }'
the output for both commands is
===
  Kernel Stack
===

Further analysis shows that pt_regs used for bpf syscall tracepoint
tracing is from the one constructed during user->kernel transition.
The call stack looks like
  perf_syscall_enter+0x88/0x7c0
  trace_sys_enter+0x41/0x80
  syscall_trace_enter+0x100/0x160
  do_syscall_64+0x38/0xf0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

The ip address stored in pt_regs is from user space hence no kernel
stack is printed.

To fix the issue, kernel address from pt_regs is required.
In kernel repo, there are already a few cases like this. For example,
in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr)
instances are used to supply ip address or use ip address to construct
call stack.

Instead of allocate fake_regs in the stack which may consume
a lot of bytes, the function perf_trace_buf_alloc() in
perf_syscall_{enter, exit}() is leveraged to create fake_regs,
which will be passed to perf_call_bpf_{enter,exit}().

For the above bpftrace script, I got the following output with this patch:
for tracepoint:syscalls:sys_enter_read
===
  Kernel Stack

        syscall_trace_enter+407
        syscall_trace_enter+407
        do_syscall_64+74
        entry_SYSCALL_64_after_hwframe+75
===
and for tracepoint:syscalls:sys_exit_read
===
Kernel Stack

        syscall_exit_work+185
        syscall_exit_work+185
        syscall_exit_to_user_mode+305
        do_syscall_64+118
        entry_SYSCALL_64_after_hwframe+75
===

Reported-by: Salvatore Benedetto <salvabenedetto@meta.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240910214037.3663272-1-yonghong.song@linux.dev
2024-09-11 13:27:27 -07:00
Tao Chen
1d244784be bpf: Check percpu map value size first
Percpu map is often used, but the map value size limit often ignored,
like issue: https://github.com/iovisor/bcc/issues/2519. Actually,
percpu map value size is bound by PCPU_MIN_UNIT_SIZE, so we
can check the value size whether it exceeds PCPU_MIN_UNIT_SIZE first,
like percpu map of local_storage. Maybe the error message seems clearer
compared with "cannot allocate memory".

Signed-off-by: Jinke Han <jinkehan@didiglobal.com>
Signed-off-by: Tao Chen <chen.dylane@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240910144111.1464912-2-chen.dylane@gmail.com
2024-09-11 13:22:37 -07:00
Tejun Heo
0b1777f0fa Merge branch 'tip/sched/core' into sched_ext/for-6.12
Pull in tip/sched/core to resolve two merge conflicts:

- 96fd6c65ef ("sched: Factor out update_other_load_avgs() from __update_blocked_others()")
  5d871a6399 ("sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c")

  A simple context conflict. The former added __update_blocked_others() in
  the same #ifdef CONFIG_SMP block that effective_cpu_util() and
  sched_cpu_util() are in and the latter moved those functions to fair.c.
  This makes __update_blocked_others() more out of place. Will follow up
  with a patch to relocate.

- 96fd6c65ef ("sched: Factor out update_other_load_avgs() from __update_blocked_others()")
  84d265281d ("sched/pelt: Use rq_clock_task() for hw_pressure")

  The former factored out the body of __update_blocked_others() into
  update_other_load_avgs(). The latter changed how update_hw_load_avg() is
  called in the body. Resolved by applying the change to
  update_other_load_avgs() instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-11 08:43:26 -10:00
Linus Torvalds
914413e3ee printk fixup for 6.11
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmbhR6wACgkQUqAMR0iA
 lPKYYw/+OiLOgRhGbLY+6s4axgonblz+UGkH1omgxf1BmyAiJjKeI0XDQMwN1AyN
 MJSk6EDVm2U+VRhyu+fYu5NAzxzXu7hAuHeCcm5XdAIOBi4AApwvbcafFg5l7G1r
 WMtljSZ+PvtVl2J2S8wmnkUFcIBwkWtD5YzqCMT5QDtibzVz0p7qt+PxUfqmI9FV
 benzuSg5JtgMyN7F/GezxagNaLXT2P/8fXIyds+1ZGShyhDR1AMysCrblyM2R/cm
 e40jcxoiTHpKiX8wqTHnyxgJeL7HyhFRjjwpgWIcccpzTCLkOLToZcvdV+/u7xuS
 aTIBmbO/euitHsGfFUvRPRJcd1WvKwVjwHebRQhc8fRwwbjEpVt23/A3c7sWBPwb
 eAU4QEccthpvDVXs/4P5W8LKmOOTfZrU67cHaWi+AWy0jlr3PMGt9QnHZ4FA8axA
 GS08XgKYJ5NGhrkT2j25ERI7SMTJNnggKzx7QDlk8pfVBZJOE7v3J9GpAwF+ON8u
 dxepvrfBoM7q6la31OzvT5t0Tm+wII/bipDwl0hjSfcnYhjzxC2n/nefd3I+J4xC
 hIBXke3xtbI44PHwyjk29GrA3ZIJmWx8m9BOR224am6lEOVFmc1QDOveMrTEmWWu
 as69++4AKqMj2IPpXrXw+ZqDCptQzTKkM2cjb/F9sFdVei4eziQ=
 =y2YO
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-6.11-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk fix from Petr Mladek:

 - Fix build of serial_core as a module

* tag 'printk-for-6.11-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk: Export match_devname_and_update_preferred_console()
2024-09-11 11:13:20 -07:00
Baoquan He
b4722b8593 kernel/workqueue.c: fix DEFINE_PER_CPU_SHARED_ALIGNED expansion
Make tags always produces below annoying warnings:

ctags: Warning: kernel/workqueue.c:470: null expansion of name pattern "\1"
ctags: Warning: kernel/workqueue.c:474: null expansion of name pattern "\1"
ctags: Warning: kernel/workqueue.c:478: null expansion of name pattern "\1"

In commit 25528213fe ("tags: Fix DEFINE_PER_CPU expansions"), codes in
places have been adjusted including cpu_worker_pools definition. I noticed
in commit 4cb1ef6460 ("workqueue: Implement BH workqueues to eventually
replace tasklets"), cpu_worker_pools definition was unfolded back. Not
sure if it was intentionally done or ignored carelessly.

Makes change to mute them specifically.

Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-11 08:10:44 -10:00
Rafael J. Wysocki
0a06811d66 Merge branches 'pm-sleep', 'pm-opp' and 'pm-tools'
Merge updates related to system sleep, operating performance points
(OPP) updates, and PM tooling updates for 6.12-rc1:

 - Remove unused stub for saveable_highmem_page() and remove deprecated
   macros from power management documentation (Andy Shevchenko).

 - Use ysfs_emit() and sysfs_emit_at() in "show" functions in the PM
   sysfs interface (Xueqin Luo).

 - Update the maintainers information for the operating-points-v2-ti-cpu DT
   binding (Dhruva Gole).

 - Drop unnecessary of_match_ptr() from ti-opp-supply (Rob Herring).

 - Update directory handling and installation process in the pm-graph
   Makefile and add .gitignore to ignore sleepgraph.py artifacts to
   pm-graph (Amit Vadhavana, Yo-Jung Lin).

 - Make cpupower display residency value in idle-info (Aboorva
   Devarajan).

 - Add missing powercap_set_enabled() stub function to cpupower (John
   B. Wyatt IV).

 - Add SWIG support to cpupower (John B. Wyatt IV).

* pm-sleep:
  PM: hibernate: Remove unused stub for saveable_highmem_page()
  Documentation: PM: Discourage use of deprecated macros
  PM: sleep: Use sysfs_emit() and sysfs_emit_at() in "show" functions
  PM: hibernate: Use sysfs_emit() and sysfs_emit_at() in "show" functions

* pm-opp:
  dt-bindings: opp: operating-points-v2-ti-cpu: Update maintainers
  opp: ti: Drop unnecessary of_match_ptr()

* pm-tools:
  pm:cpupower: Add error warning when SWIG is not installed
  MAINTAINERS: Add Maintainers for SWIG Python bindings
  pm:cpupower: Include test_raw_pylibcpupower.py
  pm:cpupower: Add SWIG bindings files for libcpupower
  pm:cpupower: Add missing powercap_set_enabled() stub function
  pm-graph: Update directory handling and installation process in Makefile
  pm-graph: Make git ignore sleepgraph.py artifacts
  tools/cpupower: display residency value in idle-info
2024-09-11 19:02:23 +02:00
Andrii Nakryiko
d4dd9775ec bpf: wire up sleepable bpf_get_stack() and bpf_get_task_stack() helpers
Add sleepable implementations of bpf_get_stack() and
bpf_get_task_stack() helpers and allow them to be used from sleepable
BPF program (e.g., sleepable uprobes).

Note, the stack trace IPs capturing itself is not sleepable (that would
need to be a separate project), only build ID fetching is sleepable and
thus more reliable, as it will wait for data to be paged in, if
necessary. For that we make use of sleepable build_id_parse()
implementation.

Now that build ID related internals in kernel/bpf/stackmap.c can be used
both in sleepable and non-sleepable contexts, we need to add additional
rcu_read_lock()/rcu_read_unlock() protection around fetching
perf_callchain_entry, but with the refactoring in previous commit it's
now pretty straightforward. We make sure to do rcu_read_unlock (in
sleepable mode only) right before stack_map_get_build_id_offset() call
which can sleep. By that time we don't have any more use of
perf_callchain_entry.

Note, bpf_get_task_stack() will fail for user mode if task != current.
And for kernel mode build ID are irrelevant. So in that sense adding
sleepable bpf_get_task_stack() implementation is a no-op. It feel right
to wire this up for symmetry and completeness, but I'm open to just
dropping it until we support `user && crosstask` condition.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240829174232.3133883-10-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-11 09:58:31 -07:00
Andrii Nakryiko
4f4c4fc015 bpf: decouple stack_map_get_build_id_offset() from perf_callchain_entry
Change stack_map_get_build_id_offset() which is used to convert stack
trace IP addresses into build ID+offset pairs. Right now this function
accepts an array of u64s as an input, and uses array of
struct bpf_stack_build_id as an output.

This is problematic because u64 array is coming from
perf_callchain_entry, which is (non-sleepable) RCU protected, so once we
allows sleepable build ID fetching, this all breaks down.

But its actually pretty easy to make stack_map_get_build_id_offset()
works with array of struct bpf_stack_build_id as both input and output.
Which is what this patch is doing, eliminating the dependency on
perf_callchain_entry. We require caller to fill out
bpf_stack_build_id.ip fields (all other can be left uninitialized), and
update in place as we do build ID resolution.

We make sure to READ_ONCE() and cache locally current IP value as we
used it in a few places to find matching VMA and so on. Given this data
is directly accessible and modifiable by user's BPF code, we should make
sure to have a consistent view of it.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240829174232.3133883-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-11 09:58:31 -07:00
Andrii Nakryiko
45b8fc3096 lib/buildid: rename build_id_parse() into build_id_parse_nofault()
Make it clear that build_id_parse() assumes that it can take no page
fault by renaming it and current few users to build_id_parse_nofault().

Also add build_id_parse() stub which for now falls back to non-sleepable
implementation, but will be changed in subsequent patches to take
advantage of sleepable context. PROCMAP_QUERY ioctl() on
/proc/<pid>/maps file is using build_id_parse() and will automatically
take advantage of more reliable sleepable context implementation.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240829174232.3133883-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-11 09:58:30 -07:00
Philo Lu
8aeaed21be bpf: Support __nullable argument suffix for tp_btf
Pointers passed to tp_btf were trusted to be valid, but some tracepoints
do take NULL pointer as input, such as trace_tcp_send_reset(). Then the
invalid memory access cannot be detected by verifier.

This patch fix it by add a suffix "__nullable" to the unreliable
argument. The suffix is shown in btf, and PTR_MAYBE_NULL will be added
to nullable arguments. Then users must check the pointer before use it.

A problem here is that we use "btf_trace_##call" to search func_proto.
As it is a typedef, argument names as well as the suffix are not
recorded. To solve this, I use bpf_raw_event_map to find
"__bpf_trace##template" from "btf_trace_##call", and then we can see the
suffix.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240911033719.91468-2-lulie@linux.alibaba.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-09-11 08:56:37 -07:00
Daniel Xu
23dc986732 bpf, cpumap: Move xdp:xdp_cpumap_kthread tracepoint before rcv
cpumap takes RX processing out of softirq and onto a separate kthread.
Since the kthread needs to be scheduled in order to run (versus softirq
which does not), we can theoretically experience extra latency if the
system is under load and the scheduler is being unfair to us.

Moving the tracepoint to before passing the skb list up the stack allows
users to more accurately measure enqueue/dequeue latency introduced by
cpumap via xdp:xdp_cpumap_enqueue and xdp:xdp_cpumap_kthread tracepoints.

f9419f7bd7 ("bpf: cpumap add tracepoints") which added the tracepoints
states that the intent behind them was for general observability and for
a feedback loop to see if the queues are being overwhelmed. This change
does not mess with either of those use cases but rather adds a third
one.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/bpf/47615d5b5e302e4bd30220473779e98b492d47cd.1725585718.git.dxu@dxuuu.xyz
2024-09-11 16:32:11 +02:00
Christian Loehle
bc9057da1a sched/cpufreq: Use NSEC_PER_MSEC for deadline task
Convert the sugov deadline task attributes to use the available
definitions to make them more readable.
No functional change.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20240813144348.1180344-5-christian.loehle@arm.com
2024-09-11 11:25:22 +02:00
Petr Mladek
2c83ded8ae Merge branch 'for-6.11-fixup' into for-linus 2024-09-11 09:30:22 +02:00
Simona Vetter
b615b9c36c Linux 6.11-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmbeHCQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwfwH/ijnVvDWt0L1mpkE
 oIPmKV+2018CA5ww/Hh+ncToWn/aCmrczHc1SEUOk/SbZnGyXJj/6KiNEK6XpJyu
 Hb90y53D5B9jkEq8WPbSy5RtqCU598gYPeBxkczjj431jer9EsZVzqsKxGRzdAud
 2+Ft/qLiVL8AP5P8IPuU7G8CU6OE0fUL5PyuzMGDtstL3R6lPpG+kf/VrJGV1mp7
 DVZO8hKwIi5Vu+ciaTJv+9PMHzXRnMhLIGabtGIzM8nhmrQx/Kv/PMjiEl/OBkmk
 6PzafEkxVtBKDNK2Qhp+QMTQJATuPccZI8Kn6peZhqoNWYHBqx7d88Q/2iiAGj0z
 skPW5Gs=
 =orf8
 -----END PGP SIGNATURE-----

Merge v6.11-rc7 into drm-next

Thomas needs 5a498d4d06 ("drm/fbdev-dma: Only install deferred I/O
if necessary") in drm-misc, so start the backmerge cascade.

Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
2024-09-11 09:18:15 +02:00
Tejun Heo
513ed0c7cc sched_ext: Don't trigger ops.quiescent/runnable() on migrations
A task moving across CPUs should not trigger quiescent/runnable task state
events as the task is staying runnable the whole time and just stopping and
then starting on different CPUs. Suppress quiescent/runnable task state
events if task_on_rq_migrating().

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: David Vernet <void@manifault.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10 10:45:20 -10:00
Tejun Heo
750a40d816 sched_ext: Synchronize bypass state changes with rq lock
While the BPF scheduler is being unloaded, the following warning messages
trigger sometimes:

 NOHZ tick-stop error: local softirq work is pending, handler #80!!!

This is caused by the CPU entering idle while there are pending softirqs.
The main culprit is the bypassing state assertion not being synchronized
with rq operations. As the BPF scheduler cannot be trusted in the disable
path, the first step is entering the bypass mode where the BPF scheduler is
ignored and scheduling becomes global FIFO.

This is implemented by turning scx_ops_bypassing() true. However, the
transition isn't synchronized against anything and it's possible for enqueue
and dispatch paths to have different ideas on whether bypass mode is on.

Make each rq track its own bypass state with SCX_RQ_BYPASSING which is
modified while rq is locked.

This removes most of the NOHZ tick-stop messages but not completely. I
believe the stragglers are from the sched core bug where pick_task_scx() can
be called without preceding balance_scx(). Once that bug is fixed, we should
verify that all occurrences of this error message are gone too.

v2: scx_enabled() test moved inside the for_each_possible_cpu() loop so that
    the per-cpu states are always synchronized with the global state.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Vernet <void@manifault.com>
2024-09-10 10:43:32 -10:00
Michal Koutný
af000ce852 cgroup: Do not report unavailable v1 controllers in /proc/cgroups
This is a followup to CONFIG-urability of cpuset and memory controllers
for v1 hierarchies. Make the output in /proc/cgroups reflect that
!CONFIG_CPUSETS_V1 is like !CONFIG_CPUSETS and
!CONFIG_MEMCG_V1 is like !CONFIG_MEMCG.

The intended effect is that hiding the unavailable controllers will hint
users not to try mounting them on v1.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10 10:04:28 -10:00
Michal Koutný
3c41382e92 cgroup: Disallow mounting v1 hierarchies without controller implementation
The configs that disable some v1 controllers would still allow mounting
them but with no controller-specific files. (Making such hierarchies
equivalent to named v1 hierarchies.) To achieve behavior consistent with
actual out-compilation of a whole controller, the mounts should treat
respective controllers as non-existent.

Wrap implementation into a helper function, leverage legacy_files to
detect compiled out controllers. The effect is that mounts on v1 would
fail and produce a message like:
  [ 1543.999081] cgroup: Unknown subsys name 'memory'

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10 10:03:43 -10:00
Michal Koutný
659f90f863 cgroup/cpuset: Expose cpuset filesystem with cpuset v1 only
The cpuset filesystem is a legacy interface to cpuset controller with
(pre-)v1 features. It makes little sense to co-mount it on systems
without cpuset v1, so do not build it when cpuset v1 is not built
neither.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10 10:02:12 -10:00
Andy Shevchenko
c3565a35d9 PM: hibernate: Remove unused stub for saveable_highmem_page()
When saveable_highmem_page() is unused, it prevents kernel builds
with clang, `make W=1` and CONFIG_WERROR=y:

kernel/power/snapshot.c:1369:21: error: unused function 'saveable_highmem_page' [-Werror,-Wunused-function]
 1369 | static inline void *saveable_highmem_page(struct zone *z, unsigned long p)
      |                     ^~~~~~~~~~~~~~~~~~~~~

Fix this by removing unused stub.

See also commit 6863f5643d ("kbuild: allow Clang to find unused static
inline functions for W=1 build").

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20240905184848.318978-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-09-10 20:11:40 +02:00
Linus Torvalds
8d8d276ba2 More tracing fixes for 6.11:
- Move declaration of interface_lock outside of CONFIG_TIMERLAT_TRACER
 
   The fix to some locking races moved the declaration of the
   interface_lock up in the file, but also moved it into the
   CONFIG_TIMERLAT_TRACER #ifdef block, breaking the build when
   that wasn't set. Move it further up and out of that #ifdef block.
 
 - Remove unused function run_tracer_selftest() stub
 
   When CONFIG_FTRACE_STARTUP_TEST is not set the stub function
   run_tracer_selftest() is not used and clang is warning about it.
   Remove the function stub as it is not needed.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZt9WIRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qj2PAPsHsAHxF4oPhXi9UmGHH+l0NcWm87U2
 B5JE+73M+RaDQgD/WpdGJaQRudUwic0wu+aHXzMFae3DVd/WUjWbGnlo5gI=
 =pS08
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Move declaration of interface_lock outside of CONFIG_TIMERLAT_TRACER

   The fix to some locking races moved the declaration of the
   interface_lock up in the file, but also moved it into the
   CONFIG_TIMERLAT_TRACER #ifdef block, breaking the build when that
   wasn't set. Move it further up and out of that #ifdef block.

 - Remove unused function run_tracer_selftest() stub

   When CONFIG_FTRACE_STARTUP_TEST is not set the stub function
   run_tracer_selftest() is not used and clang is warning about it.
   Remove the function stub as it is not needed.

* tag 'trace-v6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Drop unused helper function to fix the build
  tracing/osnoise: Fix build when timerlat is not enabled
2024-09-10 09:05:20 -07:00
Benjamin ROBIN
35b603f8a7 ntp: Make sure RTC is synchronized when time goes backwards
sync_hw_clock() is normally called every 11 minutes when time is
synchronized. This issue is that this periodic timer uses the REALTIME
clock, so when time moves backwards (the NTP server jumps into the past),
the timer expires late.

If the timer expires late, which can be days later, the RTC will no longer
be updated, which is an issue if the device is abruptly powered OFF during
this period. When the device will restart (when powered ON), it will have
the date prior to the ADJ_SETOFFSET call.

A normal NTP server should not jump in the past like that, but it is
possible... Another way of reproducing this issue is to use phc2sys to
synchronize the REALTIME clock with, for example, an IRIG timecode with
the source always starting at the same date (not synchronized).

Also, if the time jump in the future by less than 11 minutes, the RTC may
not be updated immediately (minor issue). Consider the following scenario:
 - Time is synchronized, and sync_hw_clock() was just called (the timer
   expires in 11 minutes).
 - A time jump is realized in the future by a couple of minutes.
 - The time is synchronized again.
 - Users may expect that RTC to be updated as soon as possible, and not
   after 11 minutes (for the same reason, if a power loss occurs in this
   period).

Cancel periodic timer on any time jump (ADJ_SETOFFSET) greater than or
equal to 1s. The timer will be relaunched at the end of do_adjtimex() if
NTP is still considered synced. Otherwise the timer will be relaunched
later when NTP is synced. This way, when the time is synchronized again,
the RTC is updated after less than 2 seconds.

Signed-off-by: Benjamin ROBIN <dev@benjarobin.fr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240908140836.203911-1-dev@benjarobin.fr
2024-09-10 13:50:40 +02:00
Thomas Gleixner
2f7eedca6c Merge branch 'linus' into timers/core
To update with the latest fixes.
2024-09-10 13:49:53 +02:00
Waiman Long
d00b83d416 locking/rwsem: Move is_rwsem_reader_owned() and rwsem_owner() under CONFIG_DEBUG_RWSEMS
Both is_rwsem_reader_owned() and rwsem_owner() are currently only used when
CONFIG_DEBUG_RWSEMS is defined. This causes a compilation error with clang
when `make W=1` and CONFIG_WERROR=y:

kernel/locking/rwsem.c:187:20: error: unused function 'is_rwsem_reader_owned' [-Werror,-Wunused-function]
  187 | static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
      |                    ^~~~~~~~~~~~~~~~~~~~~
kernel/locking/rwsem.c:271:35: error: unused function 'rwsem_owner' [-Werror,-Wunused-function]
  271 | static inline struct task_struct *rwsem_owner(struct rw_semaphore *sem)
      |                                   ^~~~~~~~~~~

Fix this by moving these two functions under the CONFIG_DEBUG_RWSEMS define.

Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20240909182905.161156-1-longman@redhat.com
2024-09-10 12:02:33 +02:00
Peter Zijlstra
1d7f856c2c jump_label: Fix static_key_slow_dec() yet again
While commit 83ab38ef0a ("jump_label: Fix concurrency issues in
static_key_slow_dec()") fixed one problem, it created yet another,
notably the following is now possible:

  slow_dec
    if (try_dec) // dec_not_one-ish, false
    // enabled == 1
                                slow_inc
                                  if (inc_not_disabled) // inc_not_zero-ish
                                  // enabled == 2
                                    return

    guard((mutex)(&jump_label_mutex);
    if (atomic_cmpxchg(1,0)==1) // false, we're 2

                                slow_dec
                                  if (try-dec) // dec_not_one, true
                                  // enabled == 1
                                    return
    else
      try_dec() // dec_not_one, false
      WARN

Use dec_and_test instead of cmpxchg(), like it was prior to
83ab38ef0a. Add a few WARNs for the paranoid.

Fixes: 83ab38ef0a ("jump_label: Fix concurrency issues in static_key_slow_dec()")
Reported-by: "Darrick J. Wong" <djwong@kernel.org>
Tested-by: Klara Modin <klarasmodin@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-09-10 11:57:27 +02:00
Kan Liang
a48a36b316 perf: Add PERF_EV_CAP_READ_SCOPE
Usually, an event can be read from any CPU of the scope. It doesn't need
to be read from the advertised CPU.

Add a new event cap, PERF_EV_CAP_READ_SCOPE. An event of a PMU with
scope can be read from any active CPU in the scope.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240802151643.1691631-3-kan.liang@linux.intel.com
2024-09-10 11:44:13 +02:00
Kan Liang
4ba4f1afb6 perf: Generic hotplug support for a PMU with a scope
The perf subsystem assumes that the counters of a PMU are per-CPU. So
the user space tool reads a counter from each CPU in the system wide
mode. However, many PMUs don't have a per-CPU counter. The counter is
effective for a scope, e.g., a die or a socket. To address this, a
cpumask is exposed by the kernel driver to restrict to one CPU to stand
for a specific scope. In case the given CPU is removed,
the hotplug support has to be implemented for each such driver.

The codes to support the cpumask and hotplug are very similar.
- Expose a cpumask into sysfs
- Pickup another CPU in the same scope if the given CPU is removed.
- Invoke the perf_pmu_migrate_context() to migrate to a new CPU.
- In event init, always set the CPU in the cpumask to event->cpu

Similar duplicated codes are implemented for each such PMU driver. It
would be good to introduce a generic infrastructure to avoid such
duplication.

5 popular scopes are implemented here, core, die, cluster, pkg, and
the system-wide. The scope can be set when a PMU is registered. If so, a
"cpumask" is automatically exposed for the PMU.

The "cpumask" is from the perf_online_<scope>_mask, which is to track
the active CPU for each scope. They are set when the first CPU of the
scope is online via the generic perf hotplug support. When a
corresponding CPU is removed, the perf_online_<scope>_mask is updated
accordingly and the PMU will be moved to a new CPU from the same scope
if possible.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240802151643.1691631-2-kan.liang@linux.intel.com
2024-09-10 11:44:12 +02:00
Huang Shijie
2cab4bd024 sched/debug: Fix the runnable tasks output
The current runnable tasks output looks like:

  runnable tasks:
   S            task   PID         tree-key  switches  prio     wait-time             sum-exec        sum-sleep
  -------------------------------------------------------------------------------------------------------------
   Ikworker/R-rcu_g     4         0.129049 E         0.620179           0.750000         0.002920         2   100         0.000000         0.002920         0.000000         0.000000 0 0 /
   Ikworker/R-sync_     5         0.125328 E         0.624147           0.750000         0.001840         2   100         0.000000         0.001840         0.000000         0.000000 0 0 /
   Ikworker/R-slub_     6         0.120835 E         0.628680           0.750000         0.001800         2   100         0.000000         0.001800         0.000000         0.000000 0 0 /
   Ikworker/R-netns     7         0.114294 E         0.634701           0.750000         0.002400         2   100         0.000000         0.002400         0.000000         0.000000 0 0 /
   I    kworker/0:1     9       508.781746 E       511.754666           3.000000       151.575240       224   120         0.000000       151.575240         0.000000         0.000000 0 0 /

Which is messy. Remove the duplicate printing of sum_exec_runtime and
tidy up the layout to make it look like:

  runnable tasks:
   S            task   PID       vruntime   eligible    deadline             slice          sum-exec      switches  prio         wait-time        sum-sleep       sum-block  node   group-id  group-path
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   I     kworker/0:3  1698       295.001459   E         297.977619           3.000000        38.862920         9     120         0.000000         0.000000         0.000000   0      0        /
   I     kworker/0:4  1702       278.026303   E         281.026303           3.000000         9.918760         3     120         0.000000         0.000000         0.000000   0      0        /
   S  NetworkManager  2646         0.377936   E           2.598104           3.000000        98.535880       314     120         0.000000         0.000000         0.000000   0      0        /system.slice/NetworkManager.service
   S       virtqemud  2689         0.541016   E           2.440104           3.000000        50.967960        80     120         0.000000         0.000000         0.000000   0      0        /system.slice/virtqemud.service
   S   gsd-smartcard  3058        73.604144   E          76.475904           3.000000        74.033320        88     120         0.000000         0.000000         0.000000   0      0        /user.slice/user-42.slice/session-c1.scope

Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com>
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240906053019.7874-1-shijie@os.amperecomputing.com
2024-09-10 09:51:15 +02:00
Peter Zijlstra
c662e2b1e8 sched: Fix sched_delayed vs sched_core
Completely analogous to commit dfa0a574cb ("sched/uclamg: Handle
delayed dequeue"), avoid double dequeue for the sched_core entries.

Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-09-10 09:51:15 +02:00
Dietmar Eggemann
729288bc68 kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
Remove delayed tasks from util_est even they are runnable.

Exclude delayed task which are (a) migrating between rq's or (b) in a
SAVE/RESTORE dequeue/enqueue.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com
2024-09-10 09:51:15 +02:00
Chen Yu
6b9ccbc033 kthread: Fix task state in kthread worker if being frozen
When analyzing a kernel waring message, Peter pointed out that there is a race
condition when the kworker is being frozen and falls into try_to_freeze() with
TASK_INTERRUPTIBLE, which could trigger a might_sleep() warning in try_to_freeze().
Although the root cause is not related to freeze()[1], it is still worthy to fix
this issue ahead.

One possible race scenario:

        CPU 0                                           CPU 1
        -----                                           -----

        // kthread_worker_fn
        set_current_state(TASK_INTERRUPTIBLE);
                                                       suspend_freeze_processes()
                                                         freeze_processes
                                                           static_branch_inc(&freezer_active);
                                                         freeze_kernel_threads
                                                           pm_nosig_freezing = true;
        if (work) { //false
          __set_current_state(TASK_RUNNING);

        } else if (!freezing(current)) //false, been frozen

                      freezing():
                      if (static_branch_unlikely(&freezer_active))
                        if (pm_nosig_freezing)
                          return true;
          schedule()
	}

        // state is still TASK_INTERRUPTIBLE
        try_to_freeze()
          might_sleep() <--- warning

Fix this by explicitly set the TASK_RUNNING before entering
try_to_freeze().

Fixes: b56c0d8937 ("kthread: implement kthread_worker")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/Zs2ZoAcUsZMX2B%2FI@chenyu5-mobl2/ [1]
2024-09-10 09:51:14 +02:00
Chen Yu
84d265281d sched/pelt: Use rq_clock_task() for hw_pressure
commit 97450eb909 ("sched/pelt: Remove shift of thermal clock")
removed the decay_shift for hw_pressure. This commit uses the
sched_clock_task() in sched_tick() while it replaces the
sched_clock_task() with rq_clock_pelt() in __update_blocked_others().
This could bring inconsistence. One possible scenario I can think of
is in ___update_load_sum():

  u64 delta = now - sa->last_update_time

'now' could be calculated by rq_clock_pelt() from
__update_blocked_others(), and last_update_time was calculated by
rq_clock_task() previously from sched_tick(). Usually the former
chases after the latter, it cause a very large 'delta' and brings
unexpected behavior.

Fixes: 97450eb909 ("sched/pelt: Remove shift of thermal clock")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20240827112607.181206-1-yu.c.chen@intel.com
2024-09-10 09:51:14 +02:00
Vincent Guittot
5d871a6399 sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c
Move effective_cpu_util() and sched_cpu_util() functions in fair.c file
with others utilization related functions.

No functional change.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240904092417.20660-1-vincent.guittot@linaro.org
2024-09-10 09:51:14 +02:00
Peter Zijlstra
3dcac251b0 sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
Since commit b2a02fc43a ("smp: Optimize
send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
IPI without actually sending an interrupt. Even in cases where the IPI
handler does not queue a task on the idle CPU, do_idle() will call
__schedule() since need_resched() returns true in these cases.

Introduce and use SM_IDLE to identify call to __schedule() from
schedule_idle() and shorten the idle re-entry time by skipping
pick_next_task() when nr_running is 0 and the previous task is the idle
task.

With the SM_IDLE fast-path, the time taken to complete a fixed set of
IPIs using ipistorm improves noticeably. Following are the numbers
from a dual socket Intel Ice Lake Xeon server (2 x 32C/64T) and
3rd Generation AMD EPYC system (2 x 64C/128T) (boost on, C2 disabled)
running ipistorm between CPU8 and CPU16:

cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1

   ==================================================================
   Test          : ipistorm (modified)
   Units         : Normalized runtime
   Interpretation: Lower is better
   Statistic     : AMean
   ======================= Intel Ice Lake Xeon ======================
   kernel:				time [pct imp]
   tip:sched/core			1.00 [baseline]
   tip:sched/core + SM_IDLE		0.80 [20.51%]
   ==================== 3rd Generation AMD EPYC =====================
   kernel:				time [pct imp]
   tip:sched/core			1.00 [baseline]
   tip:sched/core + SM_IDLE		0.90 [10.17%]
   ==================================================================

[ kprateek: Commit message, SM_RTLOCK_WAIT fix ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240809092240.6921-1-kprateek.nayak@amd.com
2024-09-10 09:51:14 +02:00
Sean Anderson
038eb433dc dma-mapping: add tracing for dma-mapping API calls
When debugging drivers, it can often be useful to trace when memory gets
(un)mapped for DMA (and can be accessed by the device). Add some
tracepoints for this purpose.

Use u64 instead of phys_addr_t and dma_addr_t (and similarly %llx instead
of %pa) because libtraceevent can't handle typedefs in all cases.

Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-09-10 07:47:19 +03:00
Jinjie Ruan
546f02823d user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocation
Let the kmemdup_array() take care about multiplication and possible
overflows.

Link: https://lkml.kernel.org/r/20240828072340.1249310-1-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Li zeming <zeming@nfschina.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:47:42 -07:00
Tejun Heo
4c30f5ce4f sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
Once a task is put into a DSQ, the allowed operations are fairly limited.
Tasks in the built-in local and global DSQs are executed automatically and,
ignoring dequeue, there is only one way a task in a user DSQ can be
manipulated - scx_bpf_consume() moves the first task to the dispatching
local DSQ. This inflexibility sometimes gets in the way and is an area where
multiple feature requests have been made.

Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during
DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and
user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context
which dosen't hold a rq lock including BPF timers and SYSCALL programs.

This is an expansion of an earlier patch which only allowed moving into the
dispatching local DSQ:

  http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org

v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as
    they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument
    count limit and often won't be needed anyway. Instead provide
    scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called
    only when needed and override the specified parameter for the subsequent
    dispatch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: David Vernet <void@manifault.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
6462dd53a2 sched_ext: Compact struct bpf_iter_scx_dsq_kern
struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was
using 5 of them. We want to add two more u64 fields but it's better if we do
so while staying within scx_iter_scx_dsq to maintain binary compatibility.

The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node
field takes up three u64's but only one bit of the last u64 is used. Turn
the bool into u32 flags and only use the lower 16 bits freeing up 48 bits -
16 bits for flags, 32 bits for a u32 - for use by struct
bpf_iter_scx_dsq_kern.

This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern
into the cursor field reducing the struct size by a full u64.

No behavior changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-09 13:42:47 -10:00
Tejun Heo
cf3e94430d sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()
- Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq().

- Rename consume_local_task() to move_local_task_to_local_dsq() and remove
  task_unlink_from_dsq() and source DSQ unlocking from it.

This is to make the migration code easier to reuse.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
d434210e13 sched_ext: Move consume_local_task() upward
So that the local case comes first and two CONFIG_SMP blocks can be merged.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
6557133ecd sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()
All task_unlink_from_dsq() users are doing dsq_mod_nr(dsq, -1). Move it into
task_unlink_from_dsq(). Also move sanity check into it.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
1389f49098 sched_ext: Reorder args for consume_local/remote_task()
Reorder args for consistency in the order of:

  current_rq, p, src_[rq|dsq], dst_[rq|dsq].

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-09 13:42:47 -10:00
Tejun Heo
18f856991d sched_ext: Restructure dispatch_to_local_dsq()
Now that there's nothing left after the big if block, flip the if condition
and unindent the body.

No functional changes intended.

v2: Add BUG() to clarify control can't reach the end of
    dispatch_to_local_dsq() in UP kernels per David.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
0aab26309e sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
With the preceding update, the only return value which makes meaningful
difference is DTL_INVALID, for which one caller, finish_dispatch(), falls
back to the global DSQ and the other, process_ddsp_deferred_locals(),
doesn't do anything.

It should always fallback to the global DSQ. Move the global DSQ fallback
into dispatch_to_local_dsq() and remove the return value.

v2: Patch title and description updated to reflect the behavior fix for
    process_ddsp_deferred_locals().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
e683949a4b sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON
find_dsq_for_dispatch() handles all DSQ IDs except SCX_DSQ_LOCAL_ON.
Instead, each caller is hanlding SCX_DSQ_LOCAL_ON before calling it. Move
SCX_DSQ_LOCAL_ON lookup into find_dsq_for_dispatch() to remove duplicate
code in direct_dispatch() and dispatch_to_local_dsq().

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
4d3ca89bdd sched_ext: Refactor consume_remote_task()
The tricky p->scx.holding_cpu handling was split across
consume_remote_task() body and move_task_to_local_dsq(). Refactor such that:

- All the tricky part is now in the new unlink_dsq_and_lock_src_rq() with
  consolidated documentation.

- move_task_to_local_dsq() now implements straightforward task migration
  making it easier to use in other places.

- dispatch_to_local_dsq() is another user move_task_to_local_dsq(). The
  usage is updated accordingly. This makes the local and remote cases more
  symmetric.

No functional changes intended.

v2: s/task_rq/src_rq/ for consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
fdaedba2f9 sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate
Sleepables don't need to be in its own kfunc set as each is tagged with
KF_SLEEPABLE. Rename to scx_kfunc_set_unlocked indicating that rq lock is
not held and relocate right above the any set. This will be used to add
kfuncs that are allowed to be called from SYSCALL but not TRACING.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Kinsey Ho
0e40cf2a8b cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU
Patch series "Improve mem_cgroup_iter()", v4.

Incremental cgroup iteration is being used again [1]. This patchset
improves the reliability of mem_cgroup_iter(). It also improves
simplicity and code readability.

[1] https://lore.kernel.org/20240514202641.2821494-1-hannes@cmpxchg.org/


This patch (of 5):

Explicitly document that css sibling/descendant linkage is protected by
cgroup_mutex or RCU.  Also, document in css_next_descendant_pre() and
similar functions that it isn't necessary to hold a ref on @pos.

The following changes in this patchset rely on this clarification for
simplification in memcg iteration code.

Link: https://lkml.kernel.org/r/20240905003058.1859929-1-kinseyho@google.com
Link: https://lkml.kernel.org/r/20240905003058.1859929-2-kinseyho@google.com
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Kinsey Ho <kinseyho@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:15 -07:00
Sven Schnelle
08e28de116 uprobes: use vm_special_mapping close() functionality
The following KASAN splat was shown:

[   44.505448] ==================================================================                                                                      20:37:27 [3421/145075]
[   44.505455] BUG: KASAN: slab-use-after-free in special_mapping_close+0x9c/0xc8
[   44.505471] Read of size 8 at addr 00000000868dac48 by task sh/1384
[   44.505479]
[   44.505486] CPU: 51 UID: 0 PID: 1384 Comm: sh Not tainted 6.11.0-rc6-next-20240902-dirty #1496
[   44.505503] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0)
[   44.505508] Call Trace:
[   44.505511]  [<000b0324d2f78080>] dump_stack_lvl+0xd0/0x108
[   44.505521]  [<000b0324d2f5435c>] print_address_description.constprop.0+0x34/0x2e0
[   44.505529]  [<000b0324d2f5464c>] print_report+0x44/0x138
[   44.505536]  [<000b0324d1383192>] kasan_report+0xc2/0x140
[   44.505543]  [<000b0324d2f52904>] special_mapping_close+0x9c/0xc8
[   44.505550]  [<000b0324d12c7978>] remove_vma+0x78/0x120
[   44.505557]  [<000b0324d128a2c6>] exit_mmap+0x326/0x750
[   44.505563]  [<000b0324d0ba655a>] __mmput+0x9a/0x370
[   44.505570]  [<000b0324d0bbfbe0>] exit_mm+0x240/0x340
[   44.505575]  [<000b0324d0bc0228>] do_exit+0x548/0xd70
[   44.505580]  [<000b0324d0bc1102>] do_group_exit+0x132/0x390
[   44.505586]  [<000b0324d0bc13b6>] __s390x_sys_exit_group+0x56/0x60
[   44.505592]  [<000b0324d0adcbd6>] do_syscall+0x2f6/0x430
[   44.505599]  [<000b0324d2f78434>] __do_syscall+0xa4/0x170
[   44.505606]  [<000b0324d2f9454c>] system_call+0x74/0x98
[   44.505614]
[   44.505616] Allocated by task 1384:
[   44.505621]  kasan_save_stack+0x40/0x70
[   44.505630]  kasan_save_track+0x28/0x40
[   44.505636]  __kasan_kmalloc+0xa0/0xc0
[   44.505642]  __create_xol_area+0xfa/0x410
[   44.505648]  get_xol_area+0xb0/0xf0
[   44.505652]  uprobe_notify_resume+0x27a/0x470
[   44.505657]  irqentry_exit_to_user_mode+0x15e/0x1d0
[   44.505664]  pgm_check_handler+0x122/0x170
[   44.505670]
[   44.505672] Freed by task 1384:
[   44.505676]  kasan_save_stack+0x40/0x70
[   44.505682]  kasan_save_track+0x28/0x40
[   44.505687]  kasan_save_free_info+0x4a/0x70
[   44.505693]  __kasan_slab_free+0x5a/0x70
[   44.505698]  kfree+0xe8/0x3f0
[   44.505704]  __mmput+0x20/0x370
[   44.505709]  exit_mm+0x240/0x340
[   44.505713]  do_exit+0x548/0xd70
[   44.505718]  do_group_exit+0x132/0x390
[   44.505722]  __s390x_sys_exit_group+0x56/0x60
[   44.505727]  do_syscall+0x2f6/0x430
[   44.505732]  __do_syscall+0xa4/0x170
[   44.505738]  system_call+0x74/0x98

The problem is that uprobe_clear_state() kfree's struct xol_area, which
contains struct vm_special_mapping *xol_mapping. This one is passed to
_install_special_mapping() in xol_add_vma().
__mput reads:

static inline void __mmput(struct mm_struct *mm)
{
        VM_BUG_ON(atomic_read(&mm->mm_users));

        uprobe_clear_state(mm);
        exit_aio(mm);
        ksm_exit(mm);
        khugepaged_exit(mm); /* must run before exit_mmap */
        exit_mmap(mm);
        ...
}

So uprobe_clear_state() in the beginning free's the memory area
containing the vm_special_mapping data, but exit_mmap() uses this
address later via vma->vm_private_data (which was set in
_install_special_mapping().

Fix this by moving uprobe_clear_state() to uprobes.c and use it as
close() callback.

[usama.anjum@collabora.com: remove unneeded condition]
  Link: https://lkml.kernel.org/r/20240906101825.177490-1-usama.anjum@collabora.com
Link: https://lkml.kernel.org/r/20240903073629.2442754-1-svens@linux.ibm.com
Fixes: 223febc6e5 ("mm: add optional close() to struct vm_special_mapping")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:14 -07:00
Tejun Heo
3ac352797c sched_ext: Add missing static to scx_dump_data
scx_dump_data is only used inside ext.c but doesn't have static. Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409070218.RB5WsQ07-lkp@intel.com/
2024-09-09 13:34:33 -10:00
Maxim Mikityanskiy
bee109b7b3 bpf: Fix error message on kfunc arg type mismatch
When "arg#%d expected pointer to ctx, but got %s" error is printed, both
template parts actually point to the type of the argument, therefore, it
will also say "but got PTR", regardless of what was the actual register
type.

Fix the message to print the register type in the second part of the
template, change the existing test to adapt to the new format, and add a
new test to test the case when arg is a pointer to context, but reg is a
scalar.

Fixes: 00b85860fe ("bpf: Rewrite kfunc argument handling")
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20240909133909.1315460-1-maxim@isovalent.com
2024-09-09 15:58:17 -07:00
Andy Shevchenko
4e378158e5 tracing: Drop unused helper function to fix the build
A helper function defined but not used. This, in particular,
prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y:

kernel/trace/trace.c:2229:19: error: unused function 'run_tracer_selftest' [-Werror,-Wunused-function]
 2229 | static inline int run_tracer_selftest(struct tracer *type)
      |                   ^~~~~~~~~~~~~~~~~~~

Fix this by dropping unused functions.

See also commit 6863f5643d ("kbuild: allow Clang to find unused static
inline functions for W=1 build").

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Bill Wendling <morbo@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/20240909105314.928302-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-09 16:04:25 -04:00
Steven Rostedt
af17814334 tracing/osnoise: Fix build when timerlat is not enabled
To fix some critical section races, the interface_lock was added to a few
locations. One of those locations was above where the interface_lock was
declared, so the declaration was moved up before that usage.
Unfortunately, where it was placed was inside a CONFIG_TIMERLAT_TRACER
ifdef block. As the interface_lock is used outside that config, this broke
the build when CONFIG_OSNOISE_TRACER was enabled but
CONFIG_TIMERLAT_TRACER was not.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Helena Anna" <helena.anna.dubel@intel.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Link: https://lore.kernel.org/20240909103231.23a289e2@gandalf.local.home
Fixes: e6a53481da ("tracing/timerlat: Only clear timer if a kthread exists")
Reported-by: "Bityutskiy, Artem" <artem.bityutskiy@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-09 16:03:57 -04:00
Yu Liao
3e5b2e81f1 printk: Export match_devname_and_update_preferred_console()
When building serial_base as a module, modpost fails with the following
error message:

  ERROR: modpost: "match_devname_and_update_preferred_console"
  [drivers/tty/serial/serial_base.ko] undefined!

Export the symbol to allow using it from modules.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409071312.qlwtTOS1-lkp@intel.com/
Fixes: 12c91cec31 ("serial: core: Add serial_base_match_and_update_preferred_console()")
Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Link: https://lore.kernel.org/r/20240909075652.747370-1-liaoyu15@huawei.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-09 17:35:06 +02:00
Anna-Maria Behnsen
bd7c8ff9fe treewide: Fix wrong singular form of jiffies in comments
There are several comments all over the place, which uses a wrong singular
form of jiffies.

Replace 'jiffie' by 'jiffy'. No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-3-e98760256370@linutronix.de
2024-09-08 20:47:40 +02:00
Anna-Maria Behnsen
662a1bfb90 cpu: Use already existing usleep_range()
usleep_range() is a wrapper arount usleep_range_state() which hands in
TASK_UNTINTERRUPTIBLE as state argument.

Use already exising wrapper usleep_range(). No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-2-e98760256370@linutronix.de
2024-09-08 20:47:40 +02:00
Anna-Maria Behnsen
fe90c5ba88 timers: Rename next_expiry_recalc() to be unique
next_expiry_recalc is the name of a function as well as the name of a
struct member of struct timer_base. This might lead to confusion.

Rename next_expiry_recalc() to timer_recalc_next_expiry(). No functional
change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-1-e98760256370@linutronix.de
2024-09-08 20:47:40 +02:00
Neeraj Upadhyay
355debb83b Merge branches 'context_tracking.15.08.24a', 'csd.lock.15.08.24a', 'nocb.09.09.24a', 'rcutorture.14.08.24a', 'rcustall.09.09.24a', 'srcu.12.08.24a', 'rcu.tasks.14.08.24a', 'rcu_scaling_tests.15.08.24a', 'fixes.12.08.24a' and 'misc.11.08.24a' into next.09.09.24a 2024-09-09 00:09:47 +05:30
Paul E. McKenney
1ecd9d68eb rcu: Defer printing stall-warning backtrace when holding rcu_node lock
The rcu_dump_cpu_stacks() holds the leaf rcu_node structure's ->lock
when dumping the stakcks of any CPUs stalling the current grace period.
This lock is held to prevent confusion that would otherwise occur when
the stalled CPU reported its quiescent state (and then went on to do
unrelated things) just as the backtrace NMI was heading towards it.

This has worked well, but on larger systems has recently been observed
to cause severe lock contention resulting in CSD-lock stalls and other
general unhappiness.

This commit therefore does printk_deferred_enter() before acquiring
the lock and printk_deferred_exit() after releasing it, thus deferring
the overhead of actually outputting the stack trace out of that lock's
critical section.

Reported-by: Rik van Riel <riel@surriel.com>
Suggested-by: Rik van Riel <riel@surriel.com>
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-09-09 00:06:44 +05:30
Frederic Weisbecker
7562eed272 rcu/nocb: Remove superfluous memory barrier after bypass enqueue
Pre-GP accesses performed by the update side must be ordered against
post-GP accesses performed by the readers. This is ensured by the
bypass or nocb locking on enqueue time, followed by the fully ordered
rnp locking initiated while callbacks are accelerated, and then
propagated throughout the whole GP lifecyle associated with the
callbacks.

Therefore the explicit barrier advertizing ordering between bypass
enqueue and rcuo wakeup is superfluous. If anything, it would even only
order the first bypass callback enqueue against the rcuo wakeup and
ignore all the subsequent ones.

Remove the needless barrier.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-09-09 00:05:26 +05:30
Frederic Weisbecker
1b022b8763 rcu/nocb: Conditionally wake up rcuo if not already waiting on GP
A callback enqueuer currently wakes up the rcuo kthread if it is adding
the first non-done callback of a CPU, whether the kthread is waiting on
a grace period or not (unless the CPU is offline).

This looks like a desired behaviour because then the rcuo kthread
doesn't wait for the end of the current grace period to handle the
callback. It is accelerated right away and assigned to the next grace
period. The GP kthread is notified about that fact and iterates with
the upcoming GP without sleeping in-between.

However this best-case scenario is contradicted by a few details,
depending on the situation:

1) If the callback is a non-bypass one queued with IRQs enabled, the
   wake up only occurs if no other pending callbacks are on the list.
   Therefore the theoretical "optimization" actually applies on rare
   occasions.

2) If the callback is a non-bypass one queued with IRQs disabled, the
   situation is similar with even more uncertainty due to the deferred
   wake up.

3) If the callback is lazy, a few jiffies don't make any difference.

4) If the callback is bypass, the wake up timer is programmed 2 jiffies
   ahead by rcuo in case the regular pending queue has been handled
   in the meantime. The rare storm of callbacks can otherwise wait for
   the currently elapsing grace period to be flushed and handled.

For all those reasons, the optimization is only theoretical and
occasional. Therefore it is reasonable that callbacks enqueuers only
wake up the rcuo kthread when it is not already waiting on a grace
period to complete.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-09-09 00:05:26 +05:30
Frederic Weisbecker
9139f93209 rcu/nocb: Fix RT throttling hrtimer armed from offline CPU
After a CPU is marked offline and until it reaches its final trip to
idle, rcuo has several opportunities to be woken up, either because
a callback has been queued in the meantime or because
rcutree_report_cpu_dead() has issued the final deferred NOCB wake up.

If RCU-boosting is enabled, RCU kthreads are set to SCHED_FIFO policy.
And if RT-bandwidth is enabled, the related hrtimer might be armed.
However this then happens after hrtimers have been migrated at the
CPUHP_AP_HRTIMERS_DYING stage, which is broken as reported by the
following warning:

 Call trace:
  enqueue_hrtimer+0x7c/0xf8
  hrtimer_start_range_ns+0x2b8/0x300
  enqueue_task_rt+0x298/0x3f0
  enqueue_task+0x94/0x188
  ttwu_do_activate+0xb4/0x27c
  try_to_wake_up+0x2d8/0x79c
  wake_up_process+0x18/0x28
  __wake_nocb_gp+0x80/0x1a0
  do_nocb_deferred_wakeup_common+0x3c/0xcc
  rcu_report_dead+0x68/0x1ac
  cpuhp_report_idle_dead+0x48/0x9c
  do_idle+0x288/0x294
  cpu_startup_entry+0x34/0x3c
  secondary_start_kernel+0x138/0x158

Fix this with waking up rcuo using an IPI if necessary. Since the
existing API to deal with this situation only handles swait queue, rcuo
is only woken up from offline CPUs if it's not already waiting on a
grace period. In the worst case some callbacks will just wait for a
grace period to complete before being assigned to a subsequent one.

Reported-by: "Cheng-Jui Wang (王正睿)" <Cheng-Jui.Wang@mediatek.com>
Fixes: 5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-09-09 00:05:26 +05:30
Frederic Weisbecker
1fcb932c8b rcu/nocb: Simplify (de-)offloading state machine
Now that the (de-)offloading process can only apply to offline CPUs,
there is no more concurrency between rcu_core and nocb kthreads. Also
the mutation now happens on empty queues.

Therefore the state machine can be reduced to a single bit called
SEGCBLIST_OFFLOADED. Simplify the transition as follows:

* Upon offloading: queue the rdp to be added to the rcuog list and
  wait for the rcuog kthread to set the SEGCBLIST_OFFLOADED bit. Unpark
  rcuo kthread.

* Upon de-offloading: Park rcuo kthread. Queue the rdp to be removed
  from the rcuog list and wait for the rcuog kthread to clear the
  SEGCBLIST_OFFLOADED bit.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-09-09 00:03:55 +05:30
Linus Torvalds
e20398877b - Fix perf's AUX buffer serialization
- Prevent uninitialized struct members in perf's uprobes handling
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmbdaUMACgkQEsHwGGHe
 VUoo5hAAkDYx/gqFiU4Zqr4EXu6mfG5qFRnSE5PMsgGYDt1gE+dY6Xugs5vYa7uh
 AzzqcFLw46ZbrOjXv359WBxljYMQCnFI9SbP/1pAYqtUs1X1q3bMl6iuYbHU8DkB
 NHaSCmcyxPBLANezxka554pg0Yqsb/ME4tnxomVH65GosgfG4dxCOpGB8S1jB9Wt
 g8TeXn+pEYwn50wFOTA2MTy+OtwcJZxl1cPRLhJGywY20znJrU0OAFTySdZeAfjm
 3ekMau9coXErmETsiTj5+B6ornWfCvGgYMFpZxj4lkWppJEoxEovzOauUSgkxEjZ
 qM056212tqfTYHVC6SO70mKkRcGQBD3FEQFi7+Ugv9GVIhzML5UN9z0eIKCNvcvU
 dWTCaFPPG1/WwlsKXKaaCJkvt6f+rGuL2zdyZczeiiKlcyvuABSZv/9DscxmhQUh
 5n2ZfigNXTnjUj0c2LxjBuXFmHrdbLnz5IGVr/9Ux0euXSBWJR+1HNoGpWTSHFWy
 aHioF3rgPHMvV0YVzpzb5Arz+ldUEV+ymHwtWOGuxGAtyk7SydpkbKEqZ1AYXyUX
 FEeRP/ryYw8FxTOvsNvpB85X24YDG/LrUgJdX7fbYeZjlm6Nd8IpU8LKdyLTmhmg
 YuIENCa+U6RQZd1dsRW4SqdOuackRyjH4pcQqZsg5i4nNczH+Z4=
 =Alrr
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - Fix perf's AUX buffer serialization

 - Prevent uninitialized struct members in perf's uprobes handling

* tag 'perf_urgent_for_v6.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/aux: Fix AUX buffer serialization
  uprobes: Use kzalloc to allocate xol area
2024-09-08 10:20:44 -07:00
Costa Shulyupin
a6fe30d1e3 genirq: Use cpumask_intersects()
Replace `cpumask_any_and(a, b) >= nr_cpu_ids` and `cpumask_any_and(a, b) <
nr_cpu_ids` with the more readable `!cpumask_intersects(a, b)` and
`cpumask_intersects(a, b)`

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240906170142.1135207-1-costa.shul@redhat.com
2024-09-08 16:06:51 +02:00
Tejun Heo
02e65e1c12 sched_ext: Add missing static to scx_has_op[]
scx_has_op[] is only used inside ext.c but doesn't have static. Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409062337.m7qqI88I-lkp@intel.com/
2024-09-06 08:18:55 -10:00
Tejun Heo
da330f5e4c sched_ext: Temporarily work around pick_task_scx() being called without balance_scx()
pick_task_scx() must be preceded by balance_scx() but there currently is a
bug where fair could say yes on balance() but no on pick_task(), which then
ends up calling pick_task_scx() without preceding balance_scx(). Work around
by dropping WARN_ON_ONCE() and ignoring cases which don't make sense.

This isn't great and can theoretically lead to stalls. However, for
switch_all cases, this happens only while a BPF scheduler is being loaded or
unloaded, and, for partial cases, fair will likely keep triggering this CPU.

This will be reverted once the fair behavior is fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2024-09-06 08:17:09 -10:00
Thomas Gleixner
fe513c2ef0 static_call: Replace pointless WARN_ON() in static_call_module_notify()
static_call_module_notify() triggers a WARN_ON(), when memory allocation
fails in __static_call_add_module().

That's not really justified, because the failure case must be correctly
handled by the well known call chain and the error code is passed
through to the initiating userspace application.

A memory allocation fail is not a fatal problem, but the WARN_ON() takes
the machine out when panic_on_warn is set.

Replace it with a pr_warn().

Fixes: 9183c3f9ed ("static_call: Add inline static call infrastructure")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8734mf7pmb.ffs@tglx
2024-09-06 16:29:22 +02:00
Thomas Gleixner
4b30051c48 static_call: Handle module init failure correctly in static_call_del_module()
Module insertion invokes static_call_add_module() to initialize the static
calls in a module. static_call_add_module() invokes __static_call_init(),
which allocates a struct static_call_mod to either encapsulate the built-in
static call sites of the associated key into it so further modules can be
added or to append the module to the module chain.

If that allocation fails the function returns with an error code and the
module core invokes static_call_del_module() to clean up eventually added
static_call_mod entries.

This works correctly, when all keys used by the module were converted over
to a module chain before the failure. If not then static_call_del_module()
causes a #GP as it blindly assumes that key::mods points to a valid struct
static_call_mod.

The problem is that key::mods is not a individual struct member of struct
static_call_key, it's part of a union to save space:

        union {
                /* bit 0: 0 = mods, 1 = sites */
                unsigned long type;
                struct static_call_mod *mods;
                struct static_call_site *sites;
	};

key::sites is a pointer to the list of built-in usage sites of the static
call. The type of the pointer is differentiated by bit 0. A mods pointer
has the bit clear, the sites pointer has the bit set.

As static_call_del_module() blidly assumes that the pointer is a valid
static_call_mod type, it fails to check for this failure case and
dereferences the pointer to the list of built-in call sites, which is
obviously bogus.

Cure it by checking whether the key has a sites or a mods pointer.

If it's a sites pointer then the key is not to be touched. As the sites are
walked in the same order as in __static_call_init() the site walk can be
terminated because all subsequent sites have not been touched by the init
code due to the error exit.

If it was converted before the allocation fail, then the inner loop which
searches for a module match will find nothing.

A fail in the second allocation in __static_call_init() is harmless and
does not require special treatment. The first allocation succeeded and
converted the key to a module chain. That first entry has mod::mod == NULL
and mod::next == NULL, so the inner loop of static_call_del_module() will
neither find a module match nor a module chain. The next site in the walk
was either already converted, but can't match the module, or it will exit
the outer loop because it has a static_call_site pointer and not a
static_call_mod pointer.

Fixes: 9183c3f9ed ("static_call: Add inline static call infrastructure")
Closes: https://lore.kernel.org/all/20230915082126.4187913-1-ruanjinjie@huawei.com
Reported-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://lore.kernel.org/r/87zfon6b0s.ffs@tglx
2024-09-06 16:29:21 +02:00
Costa Shulyupin
87b5a153b8 genirq/cpuhotplug: Use cpumask_intersects()
Replace `cpumask_any_and(a, b) >= nr_cpu_ids`
with the more readable `!cpumask_intersects(a, b)`.

[ tglx: Massaged change log ]

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/all/20240904134823.777623-2-costa.shul@redhat.com
2024-09-06 16:28:39 +02:00