Commit Graph

45226 Commits

Author SHA1 Message Date
Thomas Gleixner
e989424899 genirq/msi: Remove platform MSI leftovers
No more users!

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Shivamurthy Shastri <shivamurthy.shastri@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20240623142235.943295676@linutronix.de
2024-07-18 20:31:21 +02:00
Thomas Gleixner
f944ffcbc2 watchdog/perf: properly initialize the turbo mode timestamp and rearm counter
For systems on which the performance counter can expire early due to turbo
modes the watchdog handler has a safety net in place which validates that
since the last watchdog event there has at least 4/5th of the watchdog
period elapsed.

This works reliably only after the first watchdog event because the per
CPU variable which holds the timestamp of the last event is never
initialized.

So a first spurious event will validate against a timestamp of 0 which
results in a delta which is likely to be way over the 4/5 threshold of the
period.  As this might happen before the first watchdog hrtimer event
increments the watchdog counter, this can lead to false positives.

Fix this by initializing the timestamp before enabling the hardware event.
Reset the rearm counter as well, as that might be non zero after the
watchdog was disabled and reenabled.

Link: https://lkml.kernel.org/r/87frsfu15a.ffs@tglx
Fixes: 7edaeb6841 ("kernel/watchdog: Prevent false positives with turbo modes")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-17 21:11:34 -07:00
Vlastimil Babka
53dabce265 mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC
This mostly reverts commit af3b854492 ("mm/page_alloc.c: allow error
injection").  The commit made should_fail_alloc_page() a noinline function
that's always called from the page allocation hotpath, even if it's empty
because CONFIG_FAIL_PAGE_ALLOC is not enabled, and there is no option to
disable it and prevent the associated function call overhead.

As with the preceding patch "mm, slab: put should_failslab back behind
CONFIG_SHOULD_FAILSLAB" and for the same reasons, put the
should_fail_alloc_page() back behind the config option.  When enabled, the
ALLOW_ERROR_INJECTION and BTF_ID records are preserved so it's not a
complete revert.

Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-2-9e2651945d68@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-17 21:05:18 -07:00
Vlastimil Babka
a7526fe8b9 mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB
Patch series "revert unconditional slab and page allocator fault injection
calls".

These two patches largely revert commits that added function call overhead
into slab and page allocation hotpaths and that cannot be currently
disabled even though related CONFIG_ options do exist.

A much more involved solution that can keep the callsites always existing
but hidden behind a static key if unused, is possible [1] and can be
pursued by anyone who believes it's necessary.  Meanwhile the fact the
should_failslab() error injection is already not functional on kernels
built with current gcc without anyone noticing [2], and lukewarm response
to [1] suggests the need is not there.  I believe it will be more fair to
have the state after this series as a baseline for possible further
optimisation, instead of the unconditional overhead.

For example a possible compromise for anyone who's fine with an empty
function call overhead but not the full CONFIG_FAILSLAB /
CONFIG_FAIL_PAGE_ALLOC overhead is to reuse patch 1 from [1] but insert a
static key check only inside should_failslab() and
should_fail_alloc_page() before performing the more expensive checks.

[1] https://lore.kernel.org/all/20240620-fault-injection-statickeys-v2-0-e23947d3d84b@suse.cz/#t
[2] https://github.com/bpftrace/bpftrace/issues/3258


This patch (of 2):

This mostly reverts commit 4f6923fbb3 ("mm: make should_failslab always
available for fault injection").  The commit made should_failslab() a
noinline function that's always called from the slab allocation hotpath,
even if it's empty because CONFIG_SHOULD_FAILSLAB is not enabled, and
there is no option to disable that call.  This is visible in profiles and
the function call overhead can be noticeable especially with cpu
mitigations.

Meanwhile the bpftrace program example in the commit silently does not
work without CONFIG_SHOULD_FAILSLAB anyway with a recent gcc, because the
empty function gets a .constprop clone that is actually being called
(uselessly) from the slab hotpath, while the error injection is hooked to
the original function that's not being called at all [1].

Thus put the whole should_failslab() function back behind
CONFIG_SHOULD_FAILSLAB.  It's not a complete revert of 4f6923fbb3 - the
int return type that returns -ENOMEM on failure is preserved, as well
ALLOW_ERROR_INJECTION annotation.  The BTF_ID() record that was meanwhile
added is also guarded by CONFIG_SHOULD_FAILSLAB.

[1] https://github.com/bpftrace/bpftrace/issues/3258

Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-0-9e2651945d68@suse.cz
Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-1-9e2651945d68@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-17 21:05:18 -07:00
Linus Torvalds
51835949dd Networking changes for 6.11. Not much excitement - a handful of large
patchsets (devmem among them) did not make it in time.
 
 Core & protocols
 ----------------
 
  - Use local_lock in addition to local_bh_disable() to protect per-CPU
    resources in networking, a step closer for local_bh_disable() not
    to act as a big lock on PREEMPT_RT.
 
  - Use flex array for netdevice priv area, ensure its cache alignment.
 
  - Add a sysctl knob to allow user to specify a default rto_min at socket
    init time. Bit of a big hammer but multiple companies were
    independently carrying such patch downstream so clearly it's useful.
 
  - Support scheduling transmission of packets based on CLOCK_TAI.
 
  - Un-pin TCP TIMEWAIT timer to avoid it firing on CPUs later cordoned off
    using cpusets.
 
  - Support multiple L2TPv3 UDP tunnels using the same 5-tuple address.
 
  - Allow configuration of multipath hash seed, to both allow synchronizing
    hashing of two routers, and preventing partial accidental sync.
 
  - Improve TCP compliance with RFC 9293 for simultaneous connect().
 
  - Support sending NAT keepalives in IPsec ESP in UDP states. Userspace
    IKE daemon had to do this before, but the kernel can better keep
    track of it.
 
  - Support sending supervision HSR frames with MAC addresses stored in
    ProxyNodeTable when RedBox (i.e. HSR-SAN) is enabled.
 
  - Introduce IPPROTO_SMC for selecting SMC when socket is created.
 
  - Allow UDP GSO transmit from devices with no checksum offload.
 
  - openvswitch: add packet sampling via psample, separating the sampled
    traffic from "upcall" packets sent to user space for forwarding.
 
  - nf_tables: shrink memory consumption for transaction objects.
 
 Things we sprinkled into general kernel code
 --------------------------------------------
 
  - Power Sequencing subsystem (used by Qualcomm Bluetooth driver
    for QCA6390).
 
  - Add IRQ information in sysfs for auxiliary bus.
 
  - Introduce guard definition for local_lock.
 
  - Add aligned flavor of __cacheline_group_{begin, end}() markings for
    grouping fields in structures.
 
 BPF
 ---
 
  - Notify user space (via epoll) when a struct_ops object is getting
    detached/unregistered.
 
  - Add new kfuncs for a generic, open-coded bits iterator.
 
  - Enable BPF programs to declare arrays of kptr, bpf_rb_root, and
    bpf_list_head.
 
  - Support resilient split BTF which cuts down on duplication and makes
    BTF as compact as possible WRT BTF from modules.
 
  - Add support for dumping kfunc prototypes from BTF which enables both
    detecting as well as dumping compilable prototypes for kfuncs.
 
  - riscv64 BPF JIT improvements in particular to add 12-argument support
    for BPF trampolines and to utilize bpf_prog_pack for the latter.
 
  - Add the capability to offload the netfilter flowtable in XDP layer
    through kfuncs.
 
 Driver API
 ----------
 
  - Allow users to configure IRQ tresholds between which automatic IRQ
    moderation can choose.
 
  - Expand Power Sourcing (PoE) status with power, class and failure
    reason. Support setting power limits.
 
  - Track additional RSS contexts in the core, make sure configuration
    changes don't break them.
 
  - Support IPsec crypto offload for IPv6 ESP and IPv4 UDP-encapsulated ESP
    data paths.
 
  - Support updating firmware on SFP modules.
 
 Tests and tooling
 -----------------
 
  - mptcp: use net/lib.sh to manage netns.
 
  - TCP-AO and TCP-MD5: replace debug prints used by tests with
    tracepoints.
 
  - openvswitch: make test self-contained (don't depend on OvS CLI tools).
 
 Drivers
 -------
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
      - increase the max total outstanding PTP TX packets to 4
      - add timestamping statistics support
      - implement netdev_queue_mgmt_ops
      - support new RSS context API
    - Intel (100G, ice, idpf):
      - implement FEC statistics and dumping signal quality indicators
      - support E825C products (with 56Gbps PHYs)
    - nVidia/Mellanox:
      - support HW-GRO
      - mlx4/mlx5: support per-queue statistics via netlink
      - obey the max number of EQs setting in sub-functions
    - AMD/Solarflare:
      - support new RSS context API
    - AMD/Pensando:
      - ionic: rework fix for doorbell miss to lower overhead
        and skip it on new HW
    - Wangxun:
      - txgbe: support Flow Director perfect filters
 
  - Ethernet NICs consumer, embedded and virtual:
    - Add driver for Tehuti Networks TN40xx chips
    - Add driver for Meta's internal NIC chips
    - Add driver for Ethernet MAC on Airoha EN7581 SoCs
    - Add driver for Renesas Ethernet-TSN devices
    - Google cloud vNIC:
      - flow steering support
    - Microsoft vNIC:
      - support page sizes other than 4KB on ARM64
    - vmware vNIC:
      - support latency measurement (update to version 9)
    - VirtIO net:
      - support for Byte Queue Limits
      - support configuring thresholds for automatic IRQ moderation
      - support for AF_XDP Rx zero-copy
    - Synopsys (stmmac):
      - support for STM32MP13 SoC
      - let platforms select the right PCS implementation
    - TI:
      - icssg-prueth: add multicast filtering support
      - icssg-prueth: enable PTP timestamping and PPS
    - Renesas:
      - ravb: improve Rx performance 30-400% by using page pool,
        theaded NAPI and timer-based IRQ coalescing
      - ravb: add MII support for R-Car V4M
    - Cadence (macb):
      - macb: add ARP support to Wake-On-LAN
    - Cortina:
      - use phylib for RX and TX pause configuration
 
  - Ethernet switches:
    - nVidia/Mellanox:
      - support configuration of multipath hash seed
      - report more accurate max MTU
      - use page_pool to improve Rx performance
    - MediaTek:
      - mt7530: add support for bridge port isolation
    - Qualcomm:
      - qca8k: add support for bridge port isolation
    - Microchip:
      - lan9371/2: add 100BaseTX PHY support
    - NXP:
      - vsc73xx: implement VLAN operations
 
  - Ethernet PHYs:
    - aquantia: enable support for aqr115c
    - aquantia: add support for PHY LEDs
    - realtek: add support for rtl8224 2.5Gbps PHY
    - xpcs: add memory-mapped device support
    - add BroadR-Reach link mode and support in Broadcom's PHY driver
 
  - CAN:
    - add document for ISO 15765-2 protocol support
    - mcp251xfd: workaround for erratum DS80000789E, use timestamps
      to catch when device returns incorrect FIFO status
 
  - WiFi:
    - mac80211/cfg80211:
      - parse Transmit Power Envelope (TPE) data in mac80211 instead of
        in drivers
      - improvements for 6 GHz regulatory flexibility
      - multi-link improvements
      - support multiple radios per wiphy
      - remove DEAUTH_NEED_MGD_TX_PREP flag
    - Intel (iwlwifi):
      - bump FW API to 91 for BZ/SC devices
      - report 64-bit radiotap timestamp
      - enable P2P low latency by default
      - handle Transmit Power Envelope (TPE) advertised by AP
      - remove support for older FW for new devices
      - fast resume (keeping the device configured)
      - mvm: re-enable Multi-Link Operation (MLO)
      - aggregation (A-MSDU) optimizations
    - MediaTek (mt76):
      - mt7925 Multi-Link Operation (MLO) support
    - Qualcomm (ath10k):
      - LED support for various chipsets
    - Qualcomm (ath12k):
      - remove unsupported Tx monitor handling
      - support channel 2 in 6 GHz band
      - support Spatial Multiplexing Power Save (SMPS) in 6 GHz band
      - supprt multiple BSSID (MBSSID) and Enhanced Multi-BSSID
        Advertisements (EMA)
      - support dynamic VLAN
      - add panic handler for resetting the firmware state
      - DebugFS support for datapath statistics
      - WCN7850: support for Wake on WLAN
    - Microchip (wilc1000):
      - read MAC address during probe to make it visible to user space
      - suspend/resume improvements
    - TI (wl18xx):
      - support newer firmware versions
    - RealTek (rtw89):
      - preparation for RTL8852BE-VT support
      - Wake on WLAN support for WiFi 6 chips
      - 36-bit PCI DMA support
    - RealTek (rtlwifi):
      - RTL8192DU support
    - Broadcom (brcmfmac):
      - Management Frame Protection support (to enable WPA3)
 
  - Bluetooth:
    - qualcomm: use the power sequencer for QCA6390
    - btusb: mediatek: add ISO data transmission functions
    - hci_bcm4377: add BCM4388 support
    - btintel: add support for BlazarU core
    - btintel: add support for Whale Peak2
    - btnxpuart: add support for AW693 A1 chipset
    - btnxpuart: add support for IW615 chipset
    - btusb: add Realtek RTL8852BE support ID 0x13d3:0x3591
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmaWjBwACgkQMUZtbf5S
 IrvuSRAAkJuEzTRqgURBCe4eNEQde6mJJig7l2CKHwCbFiHZpRkFHf8qKbcGWbL6
 uLW33SWnKtJVDhxVKWHLq635XW7BAa80YhqGw21GDi+mIEhWXZglHj3xbXNxsMfE
 4eg/kG4BkfYWFmHaXOwVWV/mr7nXf6j7WmXNeXEi32ufE1j0OL+YlQenKnMj8yP2
 j9JmYa2Chwppng1SblHmcjmGkdNVwFhStKeCG+2K7v06wdDH/QYBlbgUv9gw/cxp
 NlW//wgiaeX40U4O3kDwt9C+LDoh+0VrDDeVdQ+IsScLtY3PhAzEoKolFYTq2HSr
 I1JpoaHNnyNsJq3DZrACQ5WlH4yDn6C2EUB6dxNnFaI9F1ZPsi+7MTl6Sei1AklD
 TuQTj/lxOACBwW2Q77NU72uoxiIUauesGPHcnrAFuoCIEhZF0mso7k59BvrXhsOP
 QwcLbQdc1YHNkqv/Vc7NBY+ruMsYB+5Ubbhhj2p27dp/CWFIwxI29fze4dn2uhO6
 ejHN3mbqwPdSzg12YJtM6Iq61Cnwo2eVSvhTxl+ZVSZtI4nu2arzR+y7QTYmNrXP
 6tkgVN9UsWeLl2xJ8wyyqL5mcvNHP2rPXWZ2X56iTaa26m+UlleeQ7YRaYtQAAr0
 Ec/vlDMX64SwHhd+qwE99DXGQf2g+KklHKSLsnajJUVrWFTlRI0=
 =opz8
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Not much excitement - a handful of large patchsets (devmem among them)
  did not make it in time.

  Core & protocols:

   - Use local_lock in addition to local_bh_disable() to protect per-CPU
     resources in networking, a step closer for local_bh_disable() not
     to act as a big lock on PREEMPT_RT

   - Use flex array for netdevice priv area, ensure its cache alignment

   - Add a sysctl knob to allow user to specify a default rto_min at
     socket init time. Bit of a big hammer but multiple companies were
     independently carrying such patch downstream so clearly it's useful

   - Support scheduling transmission of packets based on CLOCK_TAI

   - Un-pin TCP TIMEWAIT timer to avoid it firing on CPUs later cordoned
     off using cpusets

   - Support multiple L2TPv3 UDP tunnels using the same 5-tuple address

   - Allow configuration of multipath hash seed, to both allow
     synchronizing hashing of two routers, and preventing partial
     accidental sync

   - Improve TCP compliance with RFC 9293 for simultaneous connect()

   - Support sending NAT keepalives in IPsec ESP in UDP states.
     Userspace IKE daemon had to do this before, but the kernel can
     better keep track of it

   - Support sending supervision HSR frames with MAC addresses stored in
     ProxyNodeTable when RedBox (i.e. HSR-SAN) is enabled

   - Introduce IPPROTO_SMC for selecting SMC when socket is created

   - Allow UDP GSO transmit from devices with no checksum offload

   - openvswitch: add packet sampling via psample, separating the
     sampled traffic from "upcall" packets sent to user space for
     forwarding

   - nf_tables: shrink memory consumption for transaction objects

  Things we sprinkled into general kernel code:

   - Power Sequencing subsystem (used by Qualcomm Bluetooth driver for
     QCA6390)           [ Already merged separately - Linus ]

   - Add IRQ information in sysfs for auxiliary bus

   - Introduce guard definition for local_lock

   - Add aligned flavor of __cacheline_group_{begin, end}() markings for
     grouping fields in structures

  BPF:

   - Notify user space (via epoll) when a struct_ops object is getting
     detached/unregistered

   - Add new kfuncs for a generic, open-coded bits iterator

   - Enable BPF programs to declare arrays of kptr, bpf_rb_root, and
     bpf_list_head

   - Support resilient split BTF which cuts down on duplication and
     makes BTF as compact as possible WRT BTF from modules

   - Add support for dumping kfunc prototypes from BTF which enables
     both detecting as well as dumping compilable prototypes for kfuncs

   - riscv64 BPF JIT improvements in particular to add 12-argument
     support for BPF trampolines and to utilize bpf_prog_pack for the
     latter

   - Add the capability to offload the netfilter flowtable in XDP layer
     through kfuncs

  Driver API:

   - Allow users to configure IRQ tresholds between which automatic IRQ
     moderation can choose

   - Expand Power Sourcing (PoE) status with power, class and failure
     reason. Support setting power limits

   - Track additional RSS contexts in the core, make sure configuration
     changes don't break them

   - Support IPsec crypto offload for IPv6 ESP and IPv4 UDP-encapsulated
     ESP data paths

   - Support updating firmware on SFP modules

  Tests and tooling:

   - mptcp: use net/lib.sh to manage netns

   - TCP-AO and TCP-MD5: replace debug prints used by tests with
     tracepoints

   - openvswitch: make test self-contained (don't depend on OvS CLI
     tools)

  Drivers:

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - increase the max total outstanding PTP TX packets to 4
         - add timestamping statistics support
         - implement netdev_queue_mgmt_ops
         - support new RSS context API
      - Intel (100G, ice, idpf):
         - implement FEC statistics and dumping signal quality indicators
         - support E825C products (with 56Gbps PHYs)
      - nVidia/Mellanox:
         - support HW-GRO
         - mlx4/mlx5: support per-queue statistics via netlink
         - obey the max number of EQs setting in sub-functions
      - AMD/Solarflare:
         - support new RSS context API
      - AMD/Pensando:
         - ionic: rework fix for doorbell miss to lower overhead and
           skip it on new HW
      - Wangxun:
         - txgbe: support Flow Director perfect filters

   - Ethernet NICs consumer, embedded and virtual:
      - Add driver for Tehuti Networks TN40xx chips
      - Add driver for Meta's internal NIC chips
      - Add driver for Ethernet MAC on Airoha EN7581 SoCs
      - Add driver for Renesas Ethernet-TSN devices
      - Google cloud vNIC:
         - flow steering support
      - Microsoft vNIC:
         - support page sizes other than 4KB on ARM64
      - vmware vNIC:
         - support latency measurement (update to version 9)
      - VirtIO net:
         - support for Byte Queue Limits
         - support configuring thresholds for automatic IRQ moderation
         - support for AF_XDP Rx zero-copy
      - Synopsys (stmmac):
         - support for STM32MP13 SoC
         - let platforms select the right PCS implementation
      - TI:
         - icssg-prueth: add multicast filtering support
         - icssg-prueth: enable PTP timestamping and PPS
      - Renesas:
         - ravb: improve Rx performance 30-400% by using page pool,
           theaded NAPI and timer-based IRQ coalescing
         - ravb: add MII support for R-Car V4M
      - Cadence (macb):
         - macb: add ARP support to Wake-On-LAN
      - Cortina:
         - use phylib for RX and TX pause configuration

   - Ethernet switches:
      - nVidia/Mellanox:
         - support configuration of multipath hash seed
         - report more accurate max MTU
         - use page_pool to improve Rx performance
      - MediaTek:
         - mt7530: add support for bridge port isolation
      - Qualcomm:
         - qca8k: add support for bridge port isolation
      - Microchip:
         - lan9371/2: add 100BaseTX PHY support
      - NXP:
         - vsc73xx: implement VLAN operations

   - Ethernet PHYs:
      - aquantia: enable support for aqr115c
      - aquantia: add support for PHY LEDs
      - realtek: add support for rtl8224 2.5Gbps PHY
      - xpcs: add memory-mapped device support
      - add BroadR-Reach link mode and support in Broadcom's PHY driver

   - CAN:
      - add document for ISO 15765-2 protocol support
      - mcp251xfd: workaround for erratum DS80000789E, use timestamps to
        catch when device returns incorrect FIFO status

   - WiFi:
      - mac80211/cfg80211:
         - parse Transmit Power Envelope (TPE) data in mac80211 instead
           of in drivers
         - improvements for 6 GHz regulatory flexibility
         - multi-link improvements
         - support multiple radios per wiphy
         - remove DEAUTH_NEED_MGD_TX_PREP flag
      - Intel (iwlwifi):
         - bump FW API to 91 for BZ/SC devices
         - report 64-bit radiotap timestamp
         - enable P2P low latency by default
         - handle Transmit Power Envelope (TPE) advertised by AP
         - remove support for older FW for new devices
         - fast resume (keeping the device configured)
         - mvm: re-enable Multi-Link Operation (MLO)
         - aggregation (A-MSDU) optimizations
      - MediaTek (mt76):
         - mt7925 Multi-Link Operation (MLO) support
      - Qualcomm (ath10k):
         - LED support for various chipsets
      - Qualcomm (ath12k):
         - remove unsupported Tx monitor handling
         - support channel 2 in 6 GHz band
         - support Spatial Multiplexing Power Save (SMPS) in 6 GHz band
         - supprt multiple BSSID (MBSSID) and Enhanced Multi-BSSID
           Advertisements (EMA)
         - support dynamic VLAN
         - add panic handler for resetting the firmware state
         - DebugFS support for datapath statistics
         - WCN7850: support for Wake on WLAN
      - Microchip (wilc1000):
         - read MAC address during probe to make it visible to user space
         - suspend/resume improvements
      - TI (wl18xx):
         - support newer firmware versions
      - RealTek (rtw89):
         - preparation for RTL8852BE-VT support
         - Wake on WLAN support for WiFi 6 chips
         - 36-bit PCI DMA support
      - RealTek (rtlwifi):
         - RTL8192DU support
      - Broadcom (brcmfmac):
         - Management Frame Protection support (to enable WPA3)

   - Bluetooth:
      - qualcomm: use the power sequencer for QCA6390
      - btusb: mediatek: add ISO data transmission functions
      - hci_bcm4377: add BCM4388 support
      - btintel: add support for BlazarU core
      - btintel: add support for Whale Peak2
      - btnxpuart: add support for AW693 A1 chipset
      - btnxpuart: add support for IW615 chipset
      - btusb: add Realtek RTL8852BE support ID 0x13d3:0x3591"

* tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1589 commits)
  eth: fbnic: Fix spelling mistake "tiggerring" -> "triggering"
  tcp: Replace strncpy() with strscpy()
  wifi: ath12k: fix build vs old compiler
  tcp: Don't access uninit tcp_rsk(req)->ao_keyid in tcp_create_openreq_child().
  eth: fbnic: Write the TCAM tables used for RSS control and Rx to host
  eth: fbnic: Add L2 address programming
  eth: fbnic: Add basic Rx handling
  eth: fbnic: Add basic Tx handling
  eth: fbnic: Add link detection
  eth: fbnic: Add initial messaging to notify FW of our presence
  eth: fbnic: Implement Rx queue alloc/start/stop/free
  eth: fbnic: Implement Tx queue alloc/start/stop/free
  eth: fbnic: Allocate a netdevice and napi vectors with queues
  eth: fbnic: Add FW communication mechanism
  eth: fbnic: Add message parsing for FW messages
  eth: fbnic: Add register init to set PCIe/Ethernet device config
  eth: fbnic: Allocate core device specific structures and devlink interface
  eth: fbnic: Add scaffolding for Meta's NIC driver
  PCI: Add Meta Platforms vendor ID
  net/sched: cls_flower: propagate tca[TCA_OPTIONS] to NL_REQ_ATTR_CHECK
  ...
2024-07-16 19:28:34 -07:00
Linus Torvalds
f8d22a3195 linux_kselftest-kunit-6.11-rc1
This KUnit next update for Linux 6.11-rc1 consists of:
 
 -- adds vm_mmap() allocation resource manager
 -- converts usercopy kselftest to KUnit
 -- disables usercopy testing on !CONFIG_MMU
 -- adds MODULE_DESCRIPTION() to core, list, and usercopy tests
 -- adds tests for assertion formatting functions - assert.c
 -- introduces KUNIT_ASSERT_MEMEQ and KUNIT_ASSERT_MEMNEQ macros
 -- fixes KUNIT_ASSERT_STRNEQ comments to make it clear that it is
    an assertion
 -- renames KUNIT_ASSERT_FAILURE to KUNIT_FAIL_AND_ABORT
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmaWpCYACgkQCwJExA0N
 QxwdPQ/9G26Q+xhbieosvXHu/04ZWTcuUP/cFRv56jLH9bKm25YbW8WZzKM/imE5
 So35IT6SIYlwxn9fYyriPz372h3ZC522cu8tIVrUh5Uo3O5LbzQqdrxos9a+RuCg
 u6lenSksAjJRZ3S3IKDJ1ErxLnPYKyjjZFwDmV1+0Xxy30SwzFEbQqj9lY2Q4iGs
 KWBm0lrFPipbHdBqZcPB/mxIDyF6rhe+oeuOPU8uag6ncNN31xMpDanU8O6XEAz9
 QoAiDICANbVKTRKG5xXgmsJtyLF8GON4e49kEYtCLdnESPc39hQtf3cTHeYI22HC
 7OWhhOySifNIukFj1hVtxnN3ZfjtBGmbCwe5rXZFvMovE3YwAplKK61GoOaI9UV0
 qPk5GGrAb/xEh2HZ9tgf8+CsqmnPQLGnVt2h3u3c28u4YzbkinqVj20KYsye39zz
 KzJsO2yDJH4LlIJjc8XWof1cyyo0TIJQVOwJqAieOPePnfs4zabmVOus8y1Cj07V
 iAvQTPPoZ165zA1cl0iSMolKkXeAgf2FjlEGbODrktKKX6Ag/PKVp3e6PW28zJbp
 0p1V1IDQQAlEhbcRAZb+5y1voh+hcy++KyPwpj7lAVkmHd7RoK/mDL3W+oLdOTrB
 aXWs4JOlkmtUaz3EpAQZuvhYWVW7DexR9rU1SF44UAVzSdZSndw=
 =nnFR
 -----END PGP SIGNATURE-----

Merge tag 'linux_kselftest-kunit-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull KUnit updates from Shuah Khan:

 - add vm_mmap() allocation resource manager

 - convert usercopy kselftest to KUnit

 - disable usercopy testing on !CONFIG_MMU

 - add MODULE_DESCRIPTION() to core, list, and usercopy tests

 - add tests for assertion formatting functions - assert.c

 - introduce KUNIT_ASSERT_MEMEQ and KUNIT_ASSERT_MEMNEQ macros

 - fix KUNIT_ASSERT_STRNEQ comments to make it clear that it is an
   assertion

 - rename KUNIT_ASSERT_FAILURE to KUNIT_FAIL_AND_ABORT

* tag 'linux_kselftest-kunit-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: Introduce KUNIT_ASSERT_MEMEQ and KUNIT_ASSERT_MEMNEQ macros
  kunit: Rename KUNIT_ASSERT_FAILURE to KUNIT_FAIL_AND_ABORT for readability
  kunit: Fix the comment of KUNIT_ASSERT_STRNEQ as assertion
  kunit: executor: Simplify string allocation handling
  kunit/usercopy: Add missing MODULE_DESCRIPTION()
  kunit/usercopy: Disable testing on !CONFIG_MMU
  usercopy: Convert test_user_copy to KUnit test
  kunit: test: Add vm_mmap() allocation resource manager
  list: test: add the missing MODULE_DESCRIPTION() macro
  kunit: add missing MODULE_DESCRIPTION() macros to core modules
  list: test: remove unused struct 'klist_test_struct'
  kunit: Cover 'assert.c' with tests
2024-07-16 17:42:14 -07:00
Linus Torvalds
576a997c63 Performance events changes for v6.11:
- Intel PT support enhancements & fixes
  - Fix leaked SIGTRAP events
  - Improve and fix the Intel uncore driver
  - Add support for Intel HBM and CXL uncore counters
  - Add Intel Lake and Arrow Lake support
  - AMD uncore driver fixes
  - Make SIGTRAP and __perf_pending_irq() work on RT
  - Micro-optimizations
  - Misc cleanups and fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmaWjncRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iZyg//TSafjCK4N9fyXrPdPqf8L7ntX5uYf0rd
 uVZpEo/+VGvuFhznHnZIV2DLetvuwYZcUWszCqQMYfokGGi6WI1/k4MeZkSpN5QE
 p5mFk6gW3cmpHT9bECg7mKQH+w7Qna/b6mnA0HYTFxPGmQKdQDl1/S+ZsgWedxpC
 4V3re7/FzenFVS45DwSMPi9s7uZzZhVhTSgb4XLy+0Da4S0iRULItBa8HT8HmqE5
 v5aQlw3mmwKPUWvyPMi3Sw6RRWK3C+n5ZxWswSYoLSM3dsp1ZD+YYqtOv2GqAx8v
 JoL0SOnGnNCfxGHh0kz5D2hztDvq61Enotih2gz7HxvdWh2DasNp4yS1USGQhu5h
 VJnKNA0TfOUaYqWFVj0EgRVhDX79lMwSHTkR1DZd4vM2GDigHeRPh0zGSn2w/koV
 oCRxFfBoktHBnX0Te1NE2BhojbuKp25vTGK6GriVcHt/RNpuz6hTxsjdJzHCAlVX
 M349l0EpUJafvfaIN9zF22uw22J8P9y9JYqI6ebkUIKiuoT9LuafVYhQupSE9H4u
 IqlozPCTNw6eAQcUo03gkl3n+SY/DZH6eU2ycKgEp3r7TDGYbJPwxY1BgOHbwi4U
 lySM07leso2accSVAz7GDMI3ejj6Sx64asWS1FSwbajDflouaIK2jtey+1IOdXfv
 hHY65tomV8U=
 =gguT
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance events updates from Ingo Molnar:

 - Intel PT support enhancements & fixes

 - Fix leaked SIGTRAP events

 - Improve and fix the Intel uncore driver

 - Add support for Intel HBM and CXL uncore counters

 - Add Intel Lake and Arrow Lake support

 - AMD uncore driver fixes

 - Make SIGTRAP and __perf_pending_irq() work on RT

 - Micro-optimizations

 - Misc cleanups and fixes

* tag 'perf-core-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
  perf/x86/intel: Add a distinct name for Granite Rapids
  perf/x86/intel/ds: Fix non 0 retire latency on Raptorlake
  perf/x86/intel: Hide Topdown metrics events if the feature is not enumerated
  perf/x86/intel/uncore: Fix the bits of the CHA extended umask for SPR
  perf: Split __perf_pending_irq() out of perf_pending_irq()
  perf: Don't disable preemption in perf_pending_task().
  perf: Move swevent_htable::recursion into task_struct.
  perf: Shrink the size of the recursion counter.
  perf: Enqueue SIGTRAP always via task_work.
  task_work: Add TWA_NMI_CURRENT as an additional notify mode.
  perf: Move irq_work_queue() where the event is prepared.
  perf: Fix event leak upon exec and file release
  perf: Fix event leak upon exit
  task_work: Introduce task_work_cancel() again
  task_work: s/task_work_cancel()/task_work_cancel_func()/
  perf/x86/amd/uncore: Fix DF and UMC domain identification
  perf/x86/amd/uncore: Avoid PMU registration if counters are unavailable
  perf/x86/intel: Support Perfmon MSRs aliasing
  perf/x86/intel: Support PERFEVTSEL extension
  perf/x86: Add config_mask to represent EVENTSEL bitmask
  ...
2024-07-16 17:13:31 -07:00
Linus Torvalds
4a996d90b9 Scheduler changes for v6.11:
- Update Daniel Bristot de Oliveira's entry in MAINTAINERS,
    and credit him in CREDITS.
 
  - Harmonize the lock-yielding behavior on dynamically selected
    preemption models with static ones.
 
  - Reorganize the code a bit: split out sched/syscalls.c to reduce
    the size of sched/core.c
 
  - Micro-optimize psi_group_change()
 
  - Fix set_load_weight() for SCHED_IDLE tasks
 
  - Misc cleanups & fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmaVtVARHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iqTQ/9GLNzNBnl0oBWCiybeQjyWsZ6BiZi48R0
 C1g9/RKy++OyGOjn/yqYK0Kg8cdfoGzHGioMMAucHFW1nXZwVw17xAJK127N0apF
 83up7AnFJw/JGr1bI0FwuozqHAs4Z5KzHTv2KBxhYuO77lyYna6/t0liRUbF8ZUZ
 I/nqav7wDB8RBIB5hEJ/uYLDX7qWdUlyFB+mcvV4ANA99yr++OgipCp6Ob3Rz3cP
 O676nKJY4vpNbZ/B6bpKg8ezULRP8re2qD3GJRf2huS63uu/Z5ct7ouLVZ1DwN53
 mFDBTYUMI2ToV0pseikuqwnmrjxAKcEajTyZpD3vckafd2TlWIopkQZoQ9XLLlIZ
 DxO+KoekaHTSVy8FWlO8O+iE3IAdUUgECEpNveX45Pb7nFP+5dtFqqnVIdNqCq5e
 zEuQvizaa5m+A1POZhZKya+z9jbLXXx+gtPCbbADTBWtuyl8azUIh3vjn0bykmv4
 IVV/wvUm+BPEIhnKusZZOgB0vLtxUdntBBfUSxqoSOad9L+0/UtSKoKI6wvW00q8
 ZkW+85yS3YFiN9W61276RLis2j7OAjE0eDJ96wfhooma2JRDJU4Wmg5oWg8x3WuA
 JRmK0s63Qik5gpwG5rHQsR5jNqYWTj5Lp7So+M1kRfFsOM/RXQ/AneSXZu/P7d65
 LnYWzbKu76c=
 =lLab
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Update Daniel Bristot de Oliveira's entry in MAINTAINERS,
   and credit him in CREDITS

 - Harmonize the lock-yielding behavior on dynamically selected
   preemption models with static ones

 - Reorganize the code a bit: split out sched/syscalls.c to reduce
   the size of sched/core.c

 - Micro-optimize psi_group_change()

 - Fix set_load_weight() for SCHED_IDLE tasks

 - Misc cleanups & fixes

* tag 'sched-core-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Update MAINTAINERS and CREDITS
  sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks
  sched/psi: Optimise psi_group_change a bit
  sched/core: Drop spinlocks on contention iff kernel is preemptible
  sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
  sched/balance: Skip unnecessary updates to idle load balancer's flags
  idle: Remove stale RCU comment
  sched/headers: Move struct pre-declarations to the beginning of the header
  sched/core: Clean up kernel/sched/sched.h a bit
  sched/core: Simplify prefetch_curr_exec_start()
  sched: Fix spelling in comments
  sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c
2024-07-16 17:00:50 -07:00
Linus Torvalds
151647ab58 Locking changes for v6.11:
- Jump label fixes, including a perf events fix that originally
    manifested as jump label failures, but was a serialization bug
    at the usage site.
 
  - Mark down_write*() helpers as __always_inline, to improve
    WCHAN debuggability.
 
  - Misc cleanups and fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmaVEV8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iWbQ/+KBa29udVx0qiX/1R/+4EBNKzFDyvhobu
 9XuAUZJZ9g46PFqnXlnA/VhGff52M2I+1scva11OPnb2NnOhAN1yUFQJaZh0HM+y
 GTsaLQEdJfNKv1Z0i05AnYE7BKFEnoksWInsOg6Or5KoML5RS6vIyLWsBQpCjelM
 ZxAYNHVpI5qeeUT7dqy9bkBxUXgOpc5X5+qZmmBnBDTGvhsPviV88gL1Ol2tjmGi
 hhALK9RdLMltrNEejDc9sBcLw/qRA5QOUDDxSoBykSOBaljJB+3BcTCuXdUJT24Y
 xpufbESHyl7HAmdvRU+n4ypo4kuKSIcyF1+V6LcMmNi0jSsyUG7x+SuhJetQHd2z
 kLsYD18JbHHBVz7/047DnVTtyTcifsDpJi4MqEYYrUhMw1ns/goEiLRgSEYdB/FT
 eCjdZxbCzkKzmV42/gqkM/mv8/pWkhvAqZ1/LNkHSOR/JSa5X3NeR0qrj18OKSrS
 Du+ycGDqk1PqVcqiC1heBayIU/QoY/sEkdktTtbAKSujKCJ+tYhPtqMki6JaOh5I
 VPp9ekcbs+Vvuu1qNDcueoe3TqxO2wfchKMNcSD8+I68EZtxHWZN3iHtTOlqGYVr
 uZuG0V2WSgGvUi7k0UIPpnnmZ2+XejA++VnjVitXM3Z+xf7MlrPNy/4fzUbkgTQn
 T8/Tt7SytQE=
 =T1KT
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2024-07-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:

 - Jump label fixes, including a perf events fix that originally
   manifested as jump label failures, but was a serialization bug at the
   usage site

 - Mark down_write*() helpers as __always_inline, to improve WCHAN
   debuggability

 - Misc cleanups and fixes

* tag 'locking-core-2024-07-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rwsem: Add __always_inline annotation to __down_write_common() and inlined callers
  jump_label: Simplify and clarify static_key_fast_inc_cpus_locked()
  jump_label: Clarify condition in static_key_fast_inc_not_disabled()
  jump_label: Fix concurrency issues in static_key_slow_dec()
  perf/x86: Serialize set_attr_rdpmc()
  cleanup: Standardize the header guard define's name
2024-07-16 16:42:37 -07:00
Linus Torvalds
f8a8b94d06 sysctl changes for 6.11-rc1
Summary
 
 * Remove "->procname == NULL" check when iterating through sysctl table arrays
 
     Removing sentinels in ctl_table arrays reduces the build time size and
     runtime memory consumed by ~64 bytes per array. With all ctl_table
     sentinels gone, the additional check for ->procname == NULL that worked in
     tandem with the ARRAY_SIZE to calculate the size of the ctl_table arrays is
     no longer needed and has been removed. The sysctl register functions now
     returns an error if a sentinel is used.
 
 * Preparation patches for sysctl constification
 
     Constifying ctl_table structs prevents the modification of proc_handler
     function pointers as they would reside in .rodata. The ctl_table arguments
     in sysctl utility functions are const qualified in preparation for a future
     treewide proc_handler argument constification commit.
 
 * Misc fixes
 
     Increase robustness of set_ownership by providing sane default ownership
     values in case the callee doesn't set them. Bound check proc_dou8vec_minmax
     to avoid loading buggy modules and give sysctl testing module a name to
     avoid compiler complaints.
 
 Testing
 
   * This got push to linux-next in v6.10-rc2, so it has had more than a month
     of testing
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmaWdz4ACgkQupfNUreW
 QU/WKQwAkSuUz42yCQye77BK+Z8ANcTF1f3aI/wfv2nahq1GaSrNBpqUiXvEe9Tt
 KD2lM1PWiQfizVLIDPh96yxa5q69GQrPPOA/V1jwIXmk/HRpjjoONCFNNXVRCTls
 VCqDz/RatuXvzO35Yn87MnWnxv6PiX7X/zq/3WikVsUI381kvTgC6OwZxdFM52w4
 ESwOa3LeOovtRnqV5dpHr6DCQKyd0N52nPxgXvaerjlsJsv7PlezN7z9YyLOOfmW
 xUD7X6LQcJq7HcEukaB6I9o2GQOi4yYXL2YOzed7qu9Thu+lasEoN3Bd7P+ilXkc
 JY6EXJ5o+d69PewKRuJ1QvD7wrHIkhNMNbMtvehNay124wAHDy3KtonFzyvlX4wE
 qCHBYc6rySJNhSqwVp9MoksOZfDM99pVIOs9YVIjc90Zzu5J7tORgYWRVOHTcAtj
 fd8nMdkK3+ZANapygFCyew6GueIzaqlQwveVgLGw4vc5L3ClknmURit3y487Pzdg
 B+BEVlsp
 =bs2G
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl

Pull sysctl updates from Joel Granados:

 - Remove "->procname == NULL" check when iterating through sysctl table
   arrays

   Removing sentinels in ctl_table arrays reduces the build time size
   and runtime memory consumed by ~64 bytes per array. With all
   ctl_table sentinels gone, the additional check for ->procname == NULL
   that worked in tandem with the ARRAY_SIZE to calculate the size of
   the ctl_table arrays is no longer needed and has been removed. The
   sysctl register functions now returns an error if a sentinel is used.

 - Preparation patches for sysctl constification

   Constifying ctl_table structs prevents the modification of
   proc_handler function pointers as they would reside in .rodata. The
   ctl_table arguments in sysctl utility functions are const qualified
   in preparation for a future treewide proc_handler argument
   constification commit.

 - Misc fixes

   Increase robustness of set_ownership by providing sane default
   ownership values in case the callee doesn't set them. Bound check
   proc_dou8vec_minmax to avoid loading buggy modules and give sysctl
   testing module a name to avoid compiler complaints.

* tag 'sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
  sysctl: Warn on an empty procname element
  sysctl: Remove ctl_table sentinel code comments
  sysctl: Remove "child" sysctl code comments
  sysctl: Remove superfluous empty allocations from sysctl internals
  sysctl: Replace nr_entries with ctl_table_size in new_links
  sysctl: Remove check for sentinel element in ctl_table arrays
  mm profiling: Remove superfluous sentinel element from ctl_table
  locking: Remove superfluous sentinel element from kern_lockdep_table
  sysctl: Add module description to sysctl-testing
  sysctl: constify ctl_table arguments of utility function
  utsname: constify ctl_table arguments of utility function
  sysctl: move the extra1/2 boundary check of u8 to sysctl_check_table_array
  sysctl: always initialize i_uid/i_gid
2024-07-16 14:24:29 -07:00
Linus Torvalds
1ca995edf8 seccomp updates for v6.11-rc1
- interrupt SECCOMP_IOCTL_NOTIF_RECV when all users exit (Andrei Vagin)
 
 - Update selftests to check for expected NOTIF_RECV exits (Andrei Vagin)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmaVTOkACgkQiXL039xt
 wCayUA//fJTZUq+idihdKMql9SEwYh0uIJQpGOxxBEesUksZUM4alTCswzmheLFL
 q9BxlWWJJfUsp3djeZK+0vnDv3izaR+LfA1JPEJf64ImbympbEenVXK0ZkmVrTqg
 bNAlD5c/LVpzYtB7cnOaglq18uUja6/E+EQvNYz5NLHrIhYYCieJZIiFATkHQ9Lj
 3Wq3g9FWEa5pZxpKbEI3UA2HllADnmHeb/Z78Zdvyue5lOOvsBIheQfL4m0pW38x
 xgBWNglIg7b+X+YgwYSv8w50Lhn4SJVtIynWnwzBz19qFJRQL7oJRj1zyFHZPCwZ
 ajHVIj5LOuts/BYxSiGzczxVqZaAqeOyCY5e8G+Mjk5ZD5kLYznbbcrIFIUIaHpx
 rpRD/TVVwJ3PHsOIpWHwrXKgoKnbe/0n8lJT+Ehnm/2lLrlGyZj9hLyl6/+JsizE
 dGIWgE2emykYI+52IRRYSZaw4hLb+CU52d1vd5a35wUk1ie5fcVGZAWnaul23x0I
 maQtXcyB6tYuhX3oPnxoxFVqGvCKGi3Tc5N+Vg4JR/RzTy2H7fZ02DRZq8Vs0QST
 EgO3cpCD3035qxkK6ivaV4ebPJjkL158D/+uFyne+PWQlSfrmXJvhfggPFA+Oqtb
 y8PTUV73+HxDiruXbGZ7MD8C7d+ZvGI/D+xheohYrijhivcXxcc=
 =tlFN
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp updates from Kees Cook:

 - interrupt SECCOMP_IOCTL_NOTIF_RECV when all users exit (Andrei Vagin)

 - Update selftests to check for expected NOTIF_RECV exits (Andrei
   Vagin)

* tag 'seccomp-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  selftests/seccomp: check that a zombie leader doesn't affect others
  selftests/seccomp: add test for NOTIF_RECV and unused filters
  seccomp: release task filters when the task exits
  seccomp: interrupt SECCOMP_IOCTL_NOTIF_RECV when all users have exited
2024-07-16 13:12:16 -07:00
Linus Torvalds
f83e38fc9f xen: branch for v6.11-rc1
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZpS2TQAKCRCAXGG7T9hj
 vjryAQDy08vSiCNYnO4K0AXO9KhCLsXMpbdevTE0yHdtSW/IwQD/eOmrntgBArA4
 PfQanbzM3Rj+h6p1zsfvW98DgmFrfAQ=
 =tG6C
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from Juergen Gross:

 - some trivial cleanups

 - a fix for the Xen timer

 - add boot time selectable debug capability to the Xen multicall
   handling

 - two fixes for the recently added Xen irqfd handling

* tag 'for-linus-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  x86/xen: remove deprecated xen_nopvspin boot parameter
  x86/xen: eliminate some private header files
  x86/xen: make some functions static
  xen: make multicall debug boot time selectable
  xen/arm: Convert comma to semicolon
  xen: privcmd: Fix possible access to a freed kirqfd instance
  xen: privcmd: Switch from mutex to spinlock for irqfds
  xen: add missing MODULE_DESCRIPTION() macros
  x86/xen: Convert comma to semicolon
  x86/xen/time: Reduce Xen timer tick
  xen/manage: Constify struct shutdown_handler
2024-07-16 12:30:57 -07:00
Linus Torvalds
d80f2996b8 asm-generic updates for 6.11
Most of this is part of my ongoing work to clean up the system call
 tables. In this bit, all of the newer architectures are converted to
 use the machine readable syscall.tbl format instead in place of complex
 macros in include/uapi/asm-generic/unistd.h.
 
 This follows an earlier series that fixed various API mismatches
 and in turn is used as the base for planned simplifications.
 
 The other two patches are dead code removal and a warning fix.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmaVB1cACgkQYKtH/8kJ
 UicMqxAAnYKOxfjoMIhYYK6bl126wg/vIcDcjIR9cNWH21Nhn3qxn11ZXau3S7xv
 3l/HreEhyEQr4gC2a70IlXyHUadYOlrk+83OURrunWk1oKPmZlMKcfPVbtp8GL7x
 PUNXQfwM1XZLveKwufY24hoZdwKC+Y/5WLc1t0ReznJuAqgeO2rM9W5dnV5bAfCp
 he3F5hFcr196Dz3/GJjJIWrY+cbwfmZWsNtj1vFTL5/r/LuCu8HTkqhsGj8tE5BJ
 NGVEEXbp5eaVTCIGqJWhnuZcsnKN9kM51M7CtdwWf8OTckUVuJap5OsDVKQkWkGl
 bLPbd2jhDltph0sah51hAIvv4WdkThW76u9FRW7KR3fo7ra67eF7l5j7wc1lE2JB
 GwLJ1X56Bxe1GhvvNTlDmb7DrnlP/DMPuRv3Z6xyH6l8iZ2pMGlnAxuw6Bs1s6Y5
 WSs36ZpnS0ctgjfx37ZITsZSvbKFPpQFJP4siwS8aRNv/NFALNNdFyOCY5lNzspZ
 0dxwjn6/7UpHE4MKh6/hvCg2QwupXXBTRytibw+75/rOsR+EYlmtuONtyq2sLUHe
 ktJ5pg+8XuZm27+wLffuluzmY7sv2F8OU4cTYeM60Ynmc6pRzwUY6/VhG52S1/mU
 Ua4VgYIpzOtlLrYmz5QTWIZpdSFSVbIc/3pLriD6hn4Mvg+BwdA=
 =XOhL
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm-generic updates from Arnd Bergmann:
 "Most of this is part of my ongoing work to clean up the system call
  tables. In this bit, all of the newer architectures are converted to
  use the machine readable syscall.tbl format instead in place of
  complex macros in include/uapi/asm-generic/unistd.h.

  This follows an earlier series that fixed various API mismatches and
  in turn is used as the base for planned simplifications.

  The other two patches are dead code removal and a warning fix"

* tag 'asm-generic-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  vmlinux.lds.h: catch .bss..L* sections into BSS")
  fixmap: Remove unused set_fixmap_offset_io()
  riscv: convert to generic syscall table
  openrisc: convert to generic syscall table
  nios2: convert to generic syscall table
  loongarch: convert to generic syscall table
  hexagon: use new system call table
  csky: convert to generic syscall table
  arm64: rework compat syscall macros
  arm64: generate 64-bit syscall.tbl
  arm64: convert unistd_32.h to syscall.tbl format
  arc: convert to generic syscall table
  clone3: drop __ARCH_WANT_SYS_CLONE3 macro
  kbuild: add syscall table generation to scripts/Makefile.asm-headers
  kbuild: verify asm-generic header list
  loongarch: avoid generating extra header files
  um: don't generate asm/bpf_perf_event.h
  csky: drop asm/gpio.h wrapper
  syscalls: add generic scripts/syscall.tbl
2024-07-16 12:09:03 -07:00
Linus Torvalds
98896d8795 - Unrelated x86/cc changes queued here to avoid ugly cross-merges and
conflicts:
 
    - Carve out CPU hotplug function declarations into a separate header
      with the goal to be able to use the lockdep assertions in a more
      flexible manner
 
    - As a result, refactor cacheinfo code after carving out a function
      to return the cache ID associated with a given cache level
 
    -  Cleanups
 
 - Add support to be able to kexec TDX guests. For that
 
    - Expand ACPI MADT CPU offlining support
 
    - Add machinery to prepare CoCo guests memory before kexec-ing into a new
      kernel
 
    - Cleanup, readjust and massage related code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVCYoACgkQEsHwGGHe
 VUoi6g//Up/4vMzcjqzrndXfl0aP+NpK4zNud+ZPP4Qza2yPhKydniMvkWVQ8DTx
 jQaGk/tJDeFG6ofOzGkmBGyuZzuO4D7E0XFyXZZeVgSvdk2Af5vaWu1D3e4i4MiM
 Ox4H8NtWnC4MozP0hos4qB0vtYaBWVJkNvIXDVF6162zLwEmbuyrpFe3glscwIxv
 hMZR/C47RHcEeOb7yA4m/gJ+AqMe9OKradoNJkkfDpnYr6CYsbmpY09or2WYuvoI
 0gevkIe6Q9HMcq3CQl6/pR8IgbA5VmGi7iCiE1ihgTPwR3AaU8llzBqYdSgezFrk
 68A7oGeUZQeifQgjwkreZclMtsGEeGWVOB0Bh3Jgr6uaWGFXtpydi/hc73wbTz+F
 IazKQcKQYjaPW/9UG+0+cFTQlCgQ+WxwqAsN1uqzL6gMgmC9B+TM//xzk5nVxpOd
 ouf8T85tyceIPCKepGE/bWEHYYCjfbqBMyQT6RHmxUKbb1/PIsbzN26cenkZmPXT
 cpwurWVG7mRQJRqTrsS+D+opP1h/jOdkpwGlBfl1s0sX6RZuMFBk+7TlMMs61Cyo
 PWtrLV7Dr369cuXE72wIgfBAao2AS8kFshc7Atokq7/XfL9cCWHeqIcu7yvParP5
 WY43YQv8XPGI7ZnPqULByTY0Wxg8TFk8whamx97kEp8uy2HmbQU=
 =k+T+
 -----END PGP SIGNATURE-----

Merge tag 'x86_cc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 confidential computing updates from Borislav Petkov:
 "Unrelated x86/cc changes queued here to avoid ugly cross-merges and
  conflicts:

   - Carve out CPU hotplug function declarations into a separate header
     with the goal to be able to use the lockdep assertions in a more
     flexible manner

   - As a result, refactor cacheinfo code after carving out a function
     to return the cache ID associated with a given cache level

   - Cleanups

  Add support to be able to kexec TDX guests:

   - Expand ACPI MADT CPU offlining support

   - Add machinery to prepare CoCo guests memory before kexec-ing into a
     new kernel

   - Cleanup, readjust and massage related code"

* tag 'x86_cc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  ACPI: tables: Print MULTIPROC_WAKEUP when MADT is parsed
  x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
  x86/mm: Introduce kernel_ident_mapping_free()
  x86/smp: Add smp_ops.stop_this_cpu() callback
  x86/acpi: Do not attempt to bring up secondary CPUs in the kexec case
  x86/acpi: Rename fields in the acpi_madt_multiproc_wakeup structure
  x86/mm: Do not zap page table entries mapping unaccepted memory table during kdump
  x86/mm: Make e820__end_ram_pfn() cover E820_TYPE_ACPI ranges
  x86/tdx: Convert shared memory back to private on kexec
  x86/mm: Add callbacks to prepare encrypted memory for kexec
  x86/tdx: Account shared memory
  x86/mm: Return correct level from lookup_address() if pte is none
  x86/mm: Make x86_platform.guest.enc_status_change_*() return an error
  x86/kexec: Keep CR4.MCE set during kexec for TDX guest
  x86/relocate_kernel: Use named labels for less confusion
  cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup
  cpu/hotplug: Add support for declaring CPU offlining not supported
  x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init
  x86/acpi: Extract ACPI MADT wakeup code into a separate file
  x86/kexec: Remove spurious unconditional JMP from from identity_mapped()
  ...
2024-07-15 19:36:01 -07:00
Linus Torvalds
b3c0eccb48 gpio updates for v6.11-rc1
GPIOLIB core:
 - rework kfifo handling rework in the character device code
 - improve the labeling of GPIOs requested as interrupts and show more info
   on interrupt-only GPIOs in debugfs
 - remove unused APIs
 - unexport interfaces that are only used from the core GPIO code
 - drop the return value from gpiochip_set_desc_names() as it cannot fail
 - move a string array definition out of a header and into a specific
   compilation unit
 - convert the last user of gpiochip_get_desc() other than GPIO core to using
   a safer alternative
 - use array_index_nospec() where applicable
 
 New drivers:
 - add a "virtual GPIO consumer" module that allows requesting GPIOs from actual
   hardware and driving tests of the in-kernel GPIO API from user-space over
   debugfs
 - add a GPIO-based "sloppy" logic analyzer module useful for "first glance"
   debugging on remote boards
 
 Driver improvements:
 - add support for a new model to gpio-pca953x
 - lock GPIOs as interrupts in gpio-sim when the lines are requested as irqs
   via the simulator domain + some other minor improvements
 - improve error reporting in gpio-syscon
 - convert gpio-ath79 to using dynamic GPIO base and range
 - use pcibios_err_to_errno() for converting PCIBIOS error codes to errno
   vaues in gpio-amd8111 and gpio-rdc321x
 - allow building gpio-brcmstb for the BCM2835 architecture
 
 DT bindings:
 - convert DT bindings for lsi,zevio, mpc8xxx, and atmel to DT schema
 - document new properties for aspeed,gpio, fsl,qoriq-gpio and gpio-vf610
 - document new compatibles for pca953x and fsl,qoriq-gpio
 
 Documentation:
 - document stricter behavior of the GPIO character device uAPI with regards to
   reconfiguring requested line without direction set
 - clarify the effect of the active-low flag on line values and edges
 - remove documentation for the legacy GPIO API in order to stop tempting
   people to use it
 - document the preference for using pread() for reading edge events in the
   sysfs API
 
 Other:
 - add an extended initializer to the interrupt simulator allowing to specify
   a number of callbacks callers can use to be notified about irqs being
   requested and released
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEFp3rbAvDxGAT0sefEacuoBRx13IFAmaU3n4ACgkQEacuoBRx
 13LnYxAAgIfV2MQxdlB8+I3+PrObiom9uykNpoj8Ho3FiGroJEmSi2Vg9NOP5j8n
 7THjBDFqKk0USMdzGgWDe+u0oCpql8ONLd+lxPiKzRxkebMVzlumeNzWNEE3wqxO
 MdV3AOs9DLM1a4MAuv9E8PgooBVR8Cyqs3tc3wwpZRKoSZBIzwrjFL3tO1P8Dezv
 9xoPqIiMJRBZr8jifU/ZRdLG3gYKqgQH1Mha7bm94ebUwA6q/hxtGYAtc2a3Q+dF
 6lPrPONJBN6/YwvmoDddm2ppoiyWN7QdX9DQjJvKBcNRTZSE1EAggdh8kNnCoa1d
 +PeClIAJLl8ZSkdMS8yvMZIpduK4gTl7yEBMkER1d0JoJLkTowqKsvONgU12Npr2
 3rwbpACt/kVt5v0lRwaafj5vnD3NgiiVCeuZZz99ICbrXqe6rYszMIemKDYWWlTn
 kEFrTM5ql+dwAfvp8Y9JZf4oOgInHbF3LBKM34PKMW9D0a4aQC/HTfmtHobeNHzn
 FmY9ysHjMG6fvuwnkpojW6N3/LLwt+TX8jik9x0O42AE7qXn6a8U2g6RUg6rJOdd
 mUiIX3+rn+AaI6eKPvUNp2h391jH1K3hBCAca4cNAIKpqPuE/A/B5RyZZnL5Q7HQ
 Iz2G3hSlTBVPf7QWMkBUfMzQMwmvqfoKsZljC5y7YgafJc5cf4k=
 =kK88
 -----END PGP SIGNATURE-----

Merge tag 'gpio-updates-for-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio updates from Bartosz Golaszewski:
 "The majority of added lines are two new modules: the GPIO virtual
  consumer module that improves our ability to add automated tests for
  the kernel API and the "sloppy" logic analyzer module that uses the
  GPIO API to implement a coarse-grained debugging tool for useful for
  remote development.

  Other than that we have the usual assortment of various driver
  extensions, improvements to the core GPIO code, DT-bindings and other
  documentation updates as well as an extension to the interrupt
  simulator:

  GPIOLIB core:
   - rework kfifo handling rework in the character device code
   - improve the labeling of GPIOs requested as interrupts and show more
     info on interrupt-only GPIOs in debugfs
   - remove unused APIs
   - unexport interfaces that are only used from the core GPIO code
   - drop the return value from gpiochip_set_desc_names() as it cannot
     fail
   - move a string array definition out of a header and into a specific
     compilation unit
   - convert the last user of gpiochip_get_desc() other than GPIO core
     to using a safer alternative
   - use array_index_nospec() where applicable

  New drivers:
   - add a "virtual GPIO consumer" module that allows requesting GPIOs
     from actual hardware and driving tests of the in-kernel GPIO API
     from user-space over debugfs
   - add a GPIO-based "sloppy" logic analyzer module useful for "first
     glance" debugging on remote boards

  Driver improvements:
   - add support for a new model to gpio-pca953x
   - lock GPIOs as interrupts in gpio-sim when the lines are requested
     as irqs via the simulator domain + some other minor improvements
   - improve error reporting in gpio-syscon
   - convert gpio-ath79 to using dynamic GPIO base and range
   - use pcibios_err_to_errno() for converting PCIBIOS error codes to
     errno vaues in gpio-amd8111 and gpio-rdc321x
   - allow building gpio-brcmstb for the BCM2835 architecture

  DT bindings:
   - convert DT bindings for lsi,zevio, mpc8xxx, and atmel to DT schema
   - document new properties for aspeed,gpio, fsl,qoriq-gpio and
     gpio-vf610
   - document new compatibles for pca953x and fsl,qoriq-gpio

  Documentation:
   - document stricter behavior of the GPIO character device uAPI with
     regards to reconfiguring requested line without direction set
   - clarify the effect of the active-low flag on line values and edges
   - remove documentation for the legacy GPIO API in order to stop
     tempting people to use it
   - document the preference for using pread() for reading edge events
     in the sysfs API

  Other:
   - add an extended initializer to the interrupt simulator allowing to
     specify a number of callbacks callers can use to be notified about
     irqs being requested and released"

* tag 'gpio-updates-for-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: (41 commits)
  gpio: mc33880: Convert comma to semicolon
  gpio: virtuser: actually use the "trimmed" local variable
  dt-bindings: gpio: convert Atmel GPIO to json-schema
  gpio: virtuser: new virtual testing driver for the GPIO API
  dt-bindings: gpio: vf610: Allow gpio-line-names to be set
  gpio: sim: lock GPIOs as interrupts when they are requested
  genirq/irq_sim: add an extended irq_sim initializer
  dt-bindings: gpio: fsl,qoriq-gpio: Add compatible string fsl,ls1046a-gpio
  gpiolib: unexport gpiochip_get_desc()
  gpio: add sloppy logic analyzer using polling
  Documentation: gpio: Reconfiguration with unset direction (uAPI v2)
  Documentation: gpio: Reconfiguration with unset direction (uAPI v1)
  dt-bindings: gpio: fsl,qoriq-gpio: add common property gpio-line-names
  gpio: ath79: convert to dynamic GPIO base allocation
  pinctrl: da9062: replace gpiochip_get_desc() with gpio_device_get_desc()
  gpiolib: put gpio_suffixes in a single compilation unit
  Documentation: gpio: Clarify effect of active low flag on line edges
  Documentation: gpio: Clarify effect of active low flag on line values
  gpiolib: Remove data-less gpiochip_add() function
  gpio: sim: use devm_mutex_init()
  ...
2024-07-15 17:53:24 -07:00
Linus Torvalds
d8764c1931 workqueue: Fixes for v6.11-rc1
cpus_read_lock() is dropped from workqueue creation path but there were
 still remaining lockdep_assert_cpus_held() triggering spurious lockdep
 failures. Remove them.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZpW6Zw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGWLmAQCKp0zrzmJYfsU7KYFFKyUy1PukIe8NZ2kaqNcX
 7FyDZwD6A5k6SZfS/efXWYEwbQZo0MCOS5V1/M7KjqGR/ZeIEgc=
 =7t+F
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.11-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fix from Tejun Heo:
 "cpus_read_lock() was dropped from workqueue creation path but there
  were still remaining lockdep_assert_cpus_held() triggering spurious
  lockdep failures. Remove them"

* tag 'wq-for-6.11-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Remove unneeded lockdep_assert_cpus_held()
2024-07-15 17:23:32 -07:00
Linus Torvalds
c89d780cc1 arm64 updates for 6.11:
* Virtual CPU hotplug support for arm64 ACPI systems
 
 * cpufeature infrastructure cleanups and making the FEAT_ECBHB ID bits
   visible to guests
 
 * CPU errata: expand the speculative SSBS workaround to more CPUs
 
 * arm64 ACPI:
 
   - acpi=nospcr option to disable SPCR as default console for arm64
 
   - Move some ACPI code (cpuidle, FFH) to drivers/acpi/arm64/
 
 * GICv3, use compile-time PMR values: optimise the way regular IRQs are
   masked/unmasked when GICv3 pseudo-NMIs are used, removing the need for
   a static key in fast paths by using a priority value chosen
   dynamically at boot time
 
 * arm64 perf updates:
 
   - Rework of the IMX PMU driver to enable support for I.MX95
 
   - Enable support for tertiary match groups in the CMN PMU driver
 
   - Initial refactoring of the CPU PMU code to prepare for the fixed
     instruction counter introduced by Arm v9.4
 
   - Add missing PMU driver MODULE_DESCRIPTION() strings
 
   - Hook up DT compatibles for recent CPU PMUs
 
 * arm64 kselftest updates:
 
   - Kernel mode NEON fp-stress
 
   - Cleanups, spelling mistakes
 
 * arm64 Documentation update with a minor clarification on TBI
 
 * Miscellaneous:
 
   - Fix missing IPI statistics
 
   - Implement raw_smp_processor_id() using thread_info rather than a
     per-CPU variable (better code generation)
 
   - Make MTE checking of in-kernel asynchronous tag faults conditional
     on KASAN being enabled
 
   - Minor cleanups, typos
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmaQKN4ACgkQa9axLQDI
 XvE0Nw/+JZ6OEQ+DMUHXZfbWanvn1p0nVOoEV3MYVpOeQK1ILYCoDapatLNIlet0
 wcja7tohKbL1ifc7GOqlkitu824LMlotncrdOBycRqb/4C5KuJ+XhygFv5hGfX0T
 Uh2zbo4w52FPPEUMICfEAHrKT3QB9tv7f66xeUNbWWFqUn3rY02/ZVQVVdw6Zc0e
 fVYWGUUoQDR7+9hRkk6tnYw3+9YFVAUAbLWk+DGrW7WsANi6HuJ/rBMibwFI6RkG
 SZDZHum6vnwx0Dj9H7WrYaQCvUMm7AlckhQGfPbIFhUk6pWysfJtP5Qk49yiMl7p
 oRk/GrSXpiKumuetgTeOHbokiE1Nb8beXx0OcsjCu4RrIaNipAEpH1AkYy5oiKoT
 9vKZErMDtQgd96JHFVaXc+A3D2kxVfkc1u7K3TEfVRnZFV7CN+YL+61iyZ+uLxVi
 d9xrAmwRsWYFVQzlZG3NWvSeQBKisUA1L8JROlzWc/NFDwTqDGIt/zS4pZNL3+OM
 EXW0LyKt7Ijl6vPXKCXqrODRrPlcLc66VMZxofZOl0/dEqyJ+qLL4GUkWZu8lTqO
 BqydYnbTSjiDg/ntWjTrD0uJ8c40Qy7KTPEdaPqEIQvkDEsUGlOnhAQjHrnGNb9M
 psZtpDW2xm7GykEOcd6rgSz4Xeky2iLsaR4Wc7FTyDS0YRmeG44=
 =ob2k
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "The biggest part is the virtual CPU hotplug that touches ACPI,
  irqchip. We also have some GICv3 optimisation for pseudo-NMIs that has
  been queued via the arm64 tree. Otherwise the usual perf updates,
  kselftest, various small cleanups.

  Core:

   - Virtual CPU hotplug support for arm64 ACPI systems

   - cpufeature infrastructure cleanups and making the FEAT_ECBHB ID
     bits visible to guests

   - CPU errata: expand the speculative SSBS workaround to more CPUs

   - GICv3, use compile-time PMR values: optimise the way regular IRQs
     are masked/unmasked when GICv3 pseudo-NMIs are used, removing the
     need for a static key in fast paths by using a priority value
     chosen dynamically at boot time

  ACPI:

   - 'acpi=nospcr' option to disable SPCR as default console for arm64

   - Move some ACPI code (cpuidle, FFH) to drivers/acpi/arm64/

  Perf updates:

   - Rework of the IMX PMU driver to enable support for I.MX95

   - Enable support for tertiary match groups in the CMN PMU driver

   - Initial refactoring of the CPU PMU code to prepare for the fixed
     instruction counter introduced by Arm v9.4

   - Add missing PMU driver MODULE_DESCRIPTION() strings

   - Hook up DT compatibles for recent CPU PMUs

  Kselftest updates:

   - Kernel mode NEON fp-stress

   - Cleanups, spelling mistakes

  Miscellaneous:

   - arm64 Documentation update with a minor clarification on TBI

   - Fix missing IPI statistics

   - Implement raw_smp_processor_id() using thread_info rather than a
     per-CPU variable (better code generation)

   - Make MTE checking of in-kernel asynchronous tag faults conditional
     on KASAN being enabled

   - Minor cleanups, typos"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (69 commits)
  selftests: arm64: tags: remove the result script
  selftests: arm64: tags_test: conform test to TAP output
  perf: add missing MODULE_DESCRIPTION() macros
  arm64: smp: Fix missing IPI statistics
  irqchip/gic-v3: Fix 'broken_rdists' unused warning when !SMP and !ACPI
  ACPI: Add acpi=nospcr to disable ACPI SPCR as default console on ARM64
  Documentation: arm64: Update memory.rst for TBI
  arm64/cpufeature: Replace custom macros with fields from ID_AA64PFR0_EL1
  KVM: arm64: Replace custom macros with fields from ID_AA64PFR0_EL1
  perf: arm_pmuv3: Include asm/arm_pmuv3.h from linux/perf/arm_pmuv3.h
  perf: arm_v6/7_pmu: Drop non-DT probe support
  perf/arm: Move 32-bit PMU drivers to drivers/perf/
  perf: arm_pmuv3: Drop unnecessary IS_ENABLED(CONFIG_ARM64) check
  perf: arm_pmuv3: Avoid assigning fixed cycle counter with threshold
  arm64: Kconfig: Fix dependencies to enable ACPI_HOTPLUG_CPU
  perf: imx_perf: add support for i.MX95 platform
  perf: imx_perf: fix counter start and config sequence
  perf: imx_perf: refactor driver for imx93
  perf: imx_perf: let the driver manage the counter usage rather the user
  perf: imx_perf: add macro definitions for parsing config attr
  ...
2024-07-15 17:06:19 -07:00
Lai Jiangshan
aa8684755a workqueue: Remove unneeded lockdep_assert_cpus_held()
The commit 19af457573 ("workqueue: Remove cpus_read_lock() from
apply_wqattrs_lock()") removes the unneed cpus_read_lock() after the pwq
creations and installations have been reworked based on wq_online_cpumask
rather than cpu_online_mask making cpus_read_lock() is unneeded during
wqattrs changes.

But it desn't remove the lockdep_assert_cpus_held() checks during wqattrs
changes, which leads to complaints from lockdep reported by kernel test
robot:

[   15.726567][  T131] ------------[ cut here ]------------
[ 15.728117][ T131] WARNING: CPU: 1 PID: 131 at kernel/cpu.c:525 lockdep_assert_cpus_held (kernel/cpu.c:525)
[   15.731191][  T131] Modules linked in: floppy(+) parport_pc(+) parport qemu_fw_cfg rtc_cmos
[   15.733423][  T131] CPU: 1 PID: 131 Comm: systemd-udevd Tainted: G                T  6.10.0-rc2-00254-g19af45757383 #1 df6f039f42e8818bf9a534449362ebad1aad32e2
[   15.737011][  T131] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 15.739760][ T131] EIP: lockdep_assert_cpus_held (kernel/cpu.c:525)
[ 15.741326][ T131] Code: 97 c2 03 72 20 83 3d f4 73 97 c2 00 74 17 55 89 e5 b8 fc bd 4d c2 ba ff ff ff ff e8 e4 57 d1 00 85 c0 74 06 5d 31 c0 31 d2 c3 <0f> 0b eb f6 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 89 e5 b8

Fix it by removing the unneeded lockdep_assert_cpus_held().
Also remove the unneed cpus_read_lock() from wq_affn_dfl_set().

tj: Dropped the removal of cpus_read_lock/unlock() in wq_affn_dfl_set() to
    keep this patch fix only.

Cc: kernel test robot <oliver.sang@intel.com>
Fixes: 19af45757383("workqueue: Remove cpus_read_lock() from apply_wqattrs_lock()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202407141846.665c0446-lkp@intel.com
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-15 14:01:14 -10:00
Linus Torvalds
b02c520fee workqueue: Changes for v6.11
- Lai fixed a bug where CPU hotplug and workqueue attribute changes race
   leaving some workqueues not fully updated. This involved refactoring and
   changing how online CPUs are tracked. The resulting code is cleaner.
 
 - Workqueue watchdog touch operation was causing too much cacheline
   contention on very large machines. Nicholas improved scalabililty by
   avoiding unnecessary global updates.
 
 - Code cleanups and minor rescuer behavior improvement.
 
 - The last commit 58629d4871 ("workqueue: Always queue work items to the
   newest PWQ for order workqueues") is a cherry-picked straggler commit from
   for-6.10-fixes, a fix for a bug which may not actually trigger.
   Unfortunately, maybe because for-6.10-fixes was branched off at a
   different point from for-6.11, I couldn't persuade git request-pull to
   generate clean diffstat if pulled into for-6.11.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZpSn2g4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGe4UAQCc4Zo2qShh6M97r7K5Epa0XmHKqGcOvafiEZQx
 qwXfyAD9HWBrwPfdA+L1hHnMRTtZJapvWnBRK/swdg8Rou5Grgk=
 =t60j
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue updates from Tejun Heo:

 - Lai fixed a bug where CPU hotplug and workqueue attribute changes
   race leaving some workqueues not fully updated. This involved
   refactoring and changing how online CPUs are tracked. The resulting
   code is cleaner.

 - Workqueue watchdog touch operation was causing too much cacheline
   contention on very large machines. Nicholas improved scalabililty by
   avoiding unnecessary global updates.

 - Code cleanups and minor rescuer behavior improvement.

 - The last commit 58629d4871 ("workqueue: Always queue work items to
   the newest PWQ for order workqueues") is a cherry-picked straggler
   commit from for-6.10-fixes, a fix for a bug which may not actually
   trigger.

* tag 'wq-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (24 commits)
  workqueue: Always queue work items to the newest PWQ for order workqueues
  workqueue: Rename wq_update_pod() to unbound_wq_update_pwq()
  workqueue: Remove the arguments @hotplug_cpu and @online from wq_update_pod()
  workqueue: Remove the argument @cpu_going_down from wq_calc_pod_cpumask()
  workqueue: Remove the unneeded cpumask empty check in wq_calc_pod_cpumask()
  workqueue: Remove cpus_read_lock() from apply_wqattrs_lock()
  workqueue: Simplify wq_calc_pod_cpumask() with wq_online_cpumask
  workqueue: Add wq_online_cpumask
  workqueue: Init rescuer's affinities as the wq's effective cpumask
  workqueue: Put PWQ allocation and WQ enlistment in the same lock C.S.
  workqueue: Move kthread_flush_worker() out of alloc_and_link_pwqs()
  workqueue: Make rescuer initialization as the last step of the creation of a new wq
  workqueue: Register sysfs after the whole creation of the new wq
  workqueue: Simplify goto statement
  workqueue: Update cpumasks after only applying it successfully
  workqueue: Improve scalability of workqueue watchdog touch
  workqueue: wq_watchdog_touch is always called with valid CPU
  workqueue: Remove useless pool->dying_workers
  workqueue: Detach workers directly in idle_cull_fn()
  workqueue: Don't bind the rescuer in the last working cpu
  ...
2024-07-15 16:51:22 -07:00
Linus Torvalds
895b9b1207 cgroup: Changes for v6.11
- Added Michal Koutný as a maintainer.
 
 - Counters in pids.events were behaving inconsistently. pids.events made
   properly hierarchical and pids.events.local added.
 
 - misc.peak and misc.events.local added.
 
 - cpuset remote partition creation and cpuset.cpus.exclusive handling
   improved.
 
 - Code cleanups, non-critical fixes, doc updates.
 
 - for-6.10-fixes is merged in to receive two non-critical fixes that didn't
   trigger pull.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZpSsdw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGSEMAQDQ5VfcRz+rW20ez5IAgyN3EKIwSbW6pY6jojgj
 bJtJSQD/TzA8DoRxcCvTdHcZcwJ2e2zBVcuM8NkZHfSCNiPrrgs=
 =5f3I
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - Added Michal Koutný as a maintainer

 - Counters in pids.events were behaving inconsistently. pids.events
   made properly hierarchical and pids.events.local added

 - misc.peak and misc.events.local added

 - cpuset remote partition creation and cpuset.cpus.exclusive handling
   improved

 - Code cleanups, non-critical fixes, doc updates

 - for-6.10-fixes is merged in to receive two non-critical fixes that
   didn't trigger pull

* tag 'cgroup-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (23 commits)
  cgroup: Add Michal Koutný as a maintainer
  cgroup/misc: Introduce misc.events.local
  cgroup/rstat: add force idle show helper
  cgroup: Protect css->cgroup write under css_set_lock
  cgroup/misc: Introduce misc.peak
  cgroup_misc: add kernel-doc comments for enum misc_res_type
  cgroup/cpuset: Prevent UAF in proc_cpuset_show()
  selftest/cgroup: Update test_cpuset_prs.sh to match changes
  cgroup/cpuset: Make cpuset.cpus.exclusive independent of cpuset.cpus
  cgroup/cpuset: Delay setting of CS_CPU_EXCLUSIVE until valid partition
  selftest/cgroup: Fix test_cpuset_prs.sh problems reported by test robot
  cgroup/cpuset: Fix remote root partition creation problem
  cgroup: avoid the unnecessary list_add(dying_tasks) in cgroup_exit()
  cgroup/cpuset: Optimize isolated partition only generate_sched_domains() calls
  cgroup/cpuset: Reduce the lock protecting CS_SCHED_LOAD_BALANCE
  kernel/cgroup: cleanup cgroup_base_files when fail to add cgroup_psi_files
  selftests: cgroup: Add basic tests for pids controller
  selftests: cgroup: Lexicographic order in Makefile
  cgroup/pids: Add pids.events.local
  cgroup/pids: Make event counters hierarchical
  ...
2024-07-15 16:41:32 -07:00
Linus Torvalds
e4b2b0b1e4 kcsan: Add __data_racy documentation and module description
This series contains on commit that improves the documentation for the
 new __data_racy type qualifier to the data_race() macro's kernel-doc
 header and to the LKMM's access-marking documentation.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmaRubITHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jLWlD/99wMLUIfPh1cvWVVyhv8tytWuLJn6y
 olwukg9pOCz6WGucECfjAC2kFivSQYxS3b54K697tF4hnABEtpsx7ozkv11kgyR8
 niHNv4P+L+0J7CnM/g7gkLrAGosdo+PF7rUhX3u3kpeasVjJ5NyK379jUeTjrDrc
 ia34vjbVVNdJ5v0c4ITxbbV/NKecbsadRSqDLzjtNrFPkwo/yfFCrz8ddztadsTd
 4jftt/L4Up51QQ8NAgiHsHdp+iY/FRjwd/QtvEXSVUZ38sGseH+eNqn4hMuyTnka
 cjnHwAlTbEfAuR3Mdcz25ToiaI6qNxptvvM7E9+LtHuBhI+h6vEvdvPMmSFo48Rv
 kzdPix39O5dqpu49nz/Qsmmr/AWn9i3M5oht62YMJ5twoutDFAcc5MMppPT6sNsX
 Kq3fn8fjRvdHqgE3jBYOgDDWEZuTxdtcdv8wKfplID/SgyjXL48fghMEl62b2O1d
 X6PrHhKARk7RKMndAPk0UNz4m8T5oDmKu9Z9mF8UidjeZjYU0tVYPAoN1H/9lVdH
 Kn2DoTW5Oz2A21z37r5SGAro0QBE50ZMYA+xH+Ab3fyfFcd4l4ia/wu06c3AVJFc
 DqgKJ7f3YQ7tHGLf0AEpO6o/nNZGxwDHJ5Pjbteu2RwbCcUHmUtKxrM+McPXTvuv
 MjW/IAvvPdMxCw==
 =r0/X
 -----END PGP SIGNATURE-----

Merge tag 'kcsan.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull KCSAN updates from Paul McKenney:

 - improve the documentation for the new __data_racy type qualifier
   to the data_race() macro's kernel-doc header and to the LKMM's
   access-marking documentation

 - add missing MODULE_DESCRIPTION

* tag 'kcsan.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  kcsan: Add missing MODULE_DESCRIPTION() macro
  kcsan: Add example to data_race() kerneldoc header
2024-07-15 15:44:40 -07:00
Linus Torvalds
b176e21d81 Torture-test updates for v6.11
This pull request adds MODULE_DESCRIPTION() to torture.c, locktorture.c,
 and scftorture.c, and also adds static to a global variable that is used
 only in scftorture.c.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmaRxOcTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jCa6D/9/Q/2MKzFMH3MzBVQXNtELbxw5r/IE
 7BWcNE3MUt6If2ZyybvF3KoyGp4cfjyyox0EZWDeOHc4yJ85DVflzG9kn52tc9xZ
 QBgLQxtovnyoqHGjCKt4ULYAUBl6KlL2H+YOAYc576w5EcL+kbomjXDJEH5HtbAe
 p7KT+bWCcYOkz2irEoK32mlv3nG8eVzySODAbtNKQQv9fkiwftPAD/cRTYbCsEKH
 jk9JLToAbBbN8LXqWGVa/FrTZeXIt8DieIffsJi1t7K//k6nnGojUqgOAej5A6rS
 bGQZs7Z/Yb3KCF8hbWOqV0ofgBOAHWo/kE8BvcCwvnFV8d6DrEucVT5i/FQugvrE
 HA/HwsS+Bymp70/1iZHtKA+p710/Snh99dr6Okylj8wUhMJYK9UmA0RIq9QIkhhU
 VNXg7SmRdT/oZnxHXR2iqWSVjrpgAEgzT6Jw0PLLeFpVS+e95LtOh0SFvPCaNVJ9
 bDyYedB1Rbfsovg0XShGeNNTNKZrgqWav7XfpHPz/YnwJNahWJb+xjZ7L3+wIjAv
 CIcAaqU6nk+TnjG5zXKRdtV/8/qcPbe8Pij9TPd2iJ4NvONLDAGkxDLy/aY0uVei
 NGxsmM4F+Bgd9ry8xrvQ8xQg54VRABI6BUwdRiYUo7+WPdZ2DPbjWiAQf5FtQ+ng
 HosxD6XdCSa0Iw==
 =WdlA
 -----END PGP SIGNATURE-----

Merge tag 'torture.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull torture-test updates from Paul McKenney:
 "This adds MODULE_DESCRIPTION() to torture.c, locktorture.c, and
  scftorture.c, and also adds 'static' to a global variable that is used
  only in scftorture.c"

* tag 'torture.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  scftorture: Make torture_type static
  scftorture: Add MODULE_DESCRIPTION()
  locktorture: Add MODULE_DESCRIPTION()
  torture: Add MODULE_DESCRIPTION()
2024-07-15 15:39:25 -07:00
Linus Torvalds
9855e87328 RCU pull request for v6.11
doc.2024.06.06a: Update Tasks RCU and Tasks Rude RCU description in
 	Requirements.rst and clarify rcu_assign_pointer() and
 	rcu_dereference() ordering properties.
 
 fixes.2024.07.04a: Add lockdep assertions for RCU readers, limit inline
 	wakeups for callback-bypass synchronize_rcu(), add an
 	rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter,
 	add Uladzislau Rezki as RCU maintainer, and fix a subtle
 	callback-migration memory-ordering issue.
 
 mb.2024.06.28a: Remove a number of redundant memory barriers.
 
 nocb.2024.06.03a: Remove unnecessary bypass-list lock-contention
 	mitigation, use parking API instead of open-coded ad-hoc
 	equivalent, and upgrade obsolete comments.
 
 rcu-tasks.2024.06.06a: Revert avoidance of a deadlock that can no
 	longer occur and properly synchronize Tasks Trace RCU checking
 	of runqueues.
 
 rcutorture.2024.06.06a: Add tests for handling of double-call_rcu()
 	bug, add missing MODULE_DESCRIPTION, and add a script that
 	histograms the number of calls to RCU updaters.
 
 srcu.2024.06.18a: Fill out SRCU polled-grace-period API.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmaR7/QTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jGwAEACJKef2LryG6khoJdorWbvRf1V2k23H
 19CxXexCE4UoGsgGST9z1/5rM8kBdNhdhQ0JB9CitW+zGlXpOM79/mO3gALKMj++
 YBPw9B5EM622H2cKJGFzoHFSO4X9nM1CCMeuFCo6bVsbWfMtX3ENqsYl2IQy1JkB
 pGiKqcNXGWU0mdUcZKs/8ilfLG1NhaLwrkfinlsP9V1+8z8LxxDH5Qh27AT3rIvu
 W87OITTZoHlUaDVHYTautHTZoqM381xv9kNoQlS9lpH/gcFOPiO9DLj8NcLjkJ4y
 S/OrxOwfQ+BGKwnk8daFQFAc3Nr9KeVAQH7CbOW7guARhj3z97J0+wPm6nZGEE2s
 tDzg8zLT9LtbmUypJLurl29+wFE4fPNsnd69XDONbMFN1Ox2tJM3dd/rPCsHSUvz
 kEOK9gUreHOv7/Ou6UIHlYVlHY7HHuD7TAsrhaaWk7CEmlY31UKwXG+fMl1FAnSy
 F3PcBF/1M687RRFWVeMlug/+0/+ghtc+kZ1YyR79KZR6dI0C7ueQbCBGztCCtFDz
 RjrHcDifS0Y2GNQO9+zAyrJvttidRATdYDeFstk+8nnta3CnYzxCp4rn5hs3Ss3N
 AJVJm244jR3AcoL4V/tQwiQlYh9ZYN5tZ7qxFiASdtV50Uc8HoIrWXeP0Ar+GHiV
 2z/f5fKF4+5clQ==
 =7a1C
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:

 - Update Tasks RCU and Tasks Rude RCU description in Requirements.rst
   and clarify rcu_assign_pointer() and rcu_dereference() ordering
   properties

 - Add lockdep assertions for RCU readers, limit inline wakeups for
   callback-bypass synchronize_rcu(), add an
   rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter, add
   Uladzislau Rezki as RCU maintainer, and fix a subtle
   callback-migration memory-ordering issue

 - Remove a number of redundant memory barriers

 - Remove unnecessary bypass-list lock-contention mitigation, use
   parking API instead of open-coded ad-hoc equivalent, and upgrade
   obsolete comments

 - Revert avoidance of a deadlock that can no longer occur and properly
   synchronize Tasks Trace RCU checking of runqueues

 - Add tests for handling of double-call_rcu() bug, add missing
   MODULE_DESCRIPTION, and add a script that histograms the number of
   calls to RCU updaters

 - Fill out SRCU polled-grace-period API

* tag 'rcu.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (29 commits)
  rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation
  rcu: Eliminate lockless accesses to rcu_sync->gp_count
  MAINTAINERS: Add Uladzislau Rezki as RCU maintainer
  rcu: Add rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter
  rcu/exp: Remove redundant full memory barrier at the end of GP
  rcu: Remove full memory barrier on RCU stall printout
  rcu: Remove full memory barrier on boot time eqs sanity check
  rcu/exp: Remove superfluous full memory barrier upon first EQS snapshot
  rcu: Remove superfluous full memory barrier upon first EQS snapshot
  rcu: Remove full ordering on second EQS snapshot
  srcu: Fill out polled grace-period APIs
  srcu: Update cleanup_srcu_struct() comment
  srcu: Add NUM_ACTIVE_SRCU_POLL_OLDSTATE
  srcu: Disable interrupts directly in srcu_gp_end()
  rcu: Disable interrupts directly in rcu_gp_init()
  rcu/tree: Reduce wake up for synchronize_rcu() common case
  rcu/tasks: Fix stale task snaphot for Tasks Trace
  tools/rcu: Add rcu-updaters.sh script
  rcutorture: Add missing MODULE_DESCRIPTION() macros
  rcutorture: Fix rcu_torture_fwd_cb_cr() data race
  ...
2024-07-15 15:25:27 -07:00
Linus Torvalds
4fd9435641 Updates for timers, timekeeping and related functionality:
- Core:
 
     - Make the takeover of a hrtimer based broadcast timer reliable during
       CPU hot-unplug. The current implementation suffers from a race which
       can lead to broadcast timer starvation in the worst case.
 
     - VDSO related cleanups and simplifications
 
     - Small cleanups and enhancements all over the place
 
   - PTP:
 
     - Replace the architecture specific base clock to clocksource, e.g. ART
       to TSC, conversion function with generic functionality to avoid
       exposing such internals to drivers and convert all existing drivers
       over. This also allows to provide functionality which converts the
       other way round in the core code based on the same parameter set.
 
     - Provide a function to convert CLOCK_REALTIME to the base clock to
       support the upcoming PPS output driver on Intel platforms.
 
   - Drivers:
 
     - A set of Device Tree bindings for new hardware
 
     - Cleanups and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmaUOM0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofolD/9kK+aYdDj1gCFuZXZ2wTgMMxFmf/91
 0UcsGRuBJiIXs3H3iizQ0Mb0cdTW6qZJoBp0jPlvUSm0BEKdEgE1uRX2RuAPZ/Gq
 4/54ZJVopKSgAqeJFmqQubRVSv2XdMRAAJT0o1oUG3jZ0c6u8vqArIh5ZCnu13l/
 tsNOeYLYzQFyA30eHSJ/KjQ2zHwAhJnl5a/b7pdAvxmlN37bGgKEpglv+9zwFiDB
 K/kWbpb/oED9WOmoQy5QYi8iSvLQHEhFGrqzXV3fegu/B/mBBf/bpsisVx7Z1m2R
 nzxNqg86RdMjNR6giwBETZjm7YxM+gKb9nCBNILjbjWZFC4tyrBkLGJ+KniTRNyZ
 M5R4X1oP/14h00qXmCgIEFWysXaJRewYI+TIm8R2rLXrR6Tf3c4oL6fHQJxy3X52
 7A+4Z/vOk/KX6PxYmLC+xQDukhFh2nirVYsP1oNM9yC9zR/wkBBXTTmUSAI+8m8l
 KphniSPS2HMSBI6TtgOT8SKY7lRUZTnafBZq7wRXCv0Zz8AXoofgQDmBkXC99BkB
 MjLvRotJVJvY9a8LtA7htjDg/jiEMa0wHRNAGNSbflKoAKrJzoE5WbFxFZKbq3vZ
 o8cEYRMAIP+X+qn+oymT45XXXQlifZiccJdAi9FqDTvplEib2jmTmH6Ae5Khkr4l
 Lbzh/nSKVN7lOg==
 =8GjP
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2024-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "Updates for timers, timekeeping and related functionality:

  Core:

   - Make the takeover of a hrtimer based broadcast timer reliable
     during CPU hot-unplug. The current implementation suffers from a
     race which can lead to broadcast timer starvation in the worst
     case.

   - VDSO related cleanups and simplifications

   - Small cleanups and enhancements all over the place

  PTP:

   - Replace the architecture specific base clock to clocksource, e.g.
     ART to TSC, conversion function with generic functionality to avoid
     exposing such internals to drivers and convert all existing drivers
     over. This also allows to provide functionality which converts the
     other way round in the core code based on the same parameter set.

   - Provide a function to convert CLOCK_REALTIME to the base clock to
     support the upcoming PPS output driver on Intel platforms.

  Drivers:

   - A set of Device Tree bindings for new hardware

   - Cleanups and enhancements all over the place"

* tag 'timers-core-2024-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  clocksource/drivers/realtek: Add timer driver for rtl-otto platforms
  dt-bindings: timer: Add schema for realtek,otto-timer
  dt-bindings: timer: Add SOPHGO SG2002 clint
  dt-bindings: timer: renesas,tmu: Add R-Car Gen2 support
  dt-bindings: timer: renesas,tmu: Add RZ/G1 support
  dt-bindings: timer: renesas,tmu: Add R-Mobile APE6 support
  clocksource/drivers/mips-gic-timer: Correct sched_clock width
  clocksource/drivers/mips-gic-timer: Refine rating computation
  clocksource/drivers/sh_cmt: Address race condition for clock events
  clocksource/driver/arm_global_timer: Remove unnecessary ‘0’ values from err
  clocksource/drivers/arm_arch_timer: Remove unnecessary ‘0’ values from irq
  tick/broadcast: Make takeover of broadcast hrtimer reliable
  tick/sched: Combine WARN_ON_ONCE and print_once
  x86/vdso: Remove unused include
  x86/vgtod: Remove unused typedef gtod_long_t
  x86/vdso: Fix function reference in comment
  vdso: Add comment about reason for vdso struct ordering
  vdso/gettimeofday: Clarify comment about open coded function
  timekeeping: Add missing kernel-doc function comments
  tick: Remove unnused tick_nohz_get_idle_calls()
  ...
2024-07-15 15:03:09 -07:00
Linus Torvalds
0eff0491e7 A small set of SMP/CPU hotplug updates:
- Reverse the order of iteration when freezing secondary CPUs for
     hibernation.
 
     This avoids that drivers like the Intel uncore performance counter have
     to transfer the assignement of handling the per package uncore events
     for every CPU in a package, which is a considerable speedup on larger
     systems.
 
   - Add a missing destroy_work_on_stack() invocation in smp_call_on_cpu()
     to prevent debug objects to emit a false positive warning when the
     stack is freed.
 
   - Small cleanups in comments and a str_plural() conversion
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmaUNnYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYocjbEACqfc8EkTEJczAMbAaq7Nq7jYGRt5e8
 UyWpbnT1re63TAiGUEAr7pPJd6ucXAwPOHBt0V+j1HIr2UsaGT2/cfnlSipR68xh
 UmqPxXpuba+JeZDS0Bf6gRXKYBogVWFgjP8cp+f9IKr6xMgwJc7ujB9mX0YKmIgz
 bDoaFS+NFSD1ZS7tuidLqfU9UmJYRDRhyZg124HwDXG20zHR2CHgP4QQ2bHOvxZU
 LKbdjoHBmGtkLomS+R1UxsT+onsnE0c1EN37LX0mUE5L1YbUTcbZlXLLG7T35jOO
 rvai+EKVbA2KUAtrM/LZ8WZ0Lt5DTjMrouyzv6of7N2WljijdlxMb04sXl7kdng+
 rohRfDB6yNhQhEnDx6fd+IP/JCpPCctCmkN/QbvMBfnRnTe1yNg/gmnWaY3RAHNM
 GBifAxSEosMn7AMnSs/or6DAVNHVxI3Ms+r4Xb/o1JvGx7PXiBULh1Nww5WdxmBl
 IraXNR/R0qXUokXmNtJrq3aG/SepKWAc0BJJc3b6zA9tf5rsizTaxetTVZLS6jOX
 /DHJOOgAlLRDZdE53YpdL3HVVTdM/BDSM0xpMO7uxdJ4laJ9s+7dFGk9KXxu24qM
 6dIG6hn7D9XT0q05sP+r7qO0ygIe0Qorg5bA+xgvpYdPNrVfQpjzkj5jwkspvk+l
 5/gpTUmVm8Vwhw==
 =M4eq
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2024-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull CPU hotplug updates from Thomas Gleixner:
 "A small set of SMP/CPU hotplug updates:

   - Reverse the order of iteration when freezing secondary CPUs for
     hibernation.

     This avoids that drivers like the Intel uncore performance counter
     have to transfer the assignement of handling the per package uncore
     events for every CPU in a package, which is a considerable speedup
     on larger systems.

   - Add a missing destroy_work_on_stack() invocation in
     smp_call_on_cpu() to prevent debug objects to emit a false positive
     warning when the stack is freed.

   - Small cleanups in comments and a str_plural() conversion"

* tag 'smp-core-2024-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Add missing destroy_work_on_stack() call in smp_call_on_cpu()
  cpu/hotplug: Reverse order of iteration in freeze_secondary_cpus()
  smp: Use str_plural() to fix Coccinelle warnings
  cpu/hotplug: Fix typo in comment
2024-07-15 14:55:30 -07:00
Linus Torvalds
3a56e24173 for-6.11/io_uring-20240714
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaTgusQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr+1EAC4I7pRAM341sfmhe/9QQKMM8VzGwy5Tlr1
 AFLO3BujRTl6X8S9fQjIjN1coW6u4F42I19+vVlxqvB7CUnqt9VWpexEjxe4K0FR
 R+hIZW+fWV9K/eMrcsLcI7oReN5kIihHOzzy3wz0rENoGB5dCl6JAZMHDUCSqP0/
 ZJJQ5ut8ah20Y/myHnzP5o4TfdE7nGo73Di2YoE2g3KqeX/dlAKW9+5hqKzzrHhM
 2U25k/6KLy0ROzKpy2qW0QRE3pT5udoHLK2ue9+XwXF8JWVTlfVkHBzGY7NstyyT
 z07SEzW1q4xV1HdCwGDAU7cL2NJMRXSG0p2WZTm8QyaVTdsZQvEx08GLsVdLvFH5
 Gg+oOaxVE+INzW+/Lwz7lFHgq6XEjdAlEAOXDtGkZoni6Rt6iCzFCW6RTf/guy8o
 Cub7tatMyegxai9+FTN/oFVoydRR0tsMf0OHrWnLOperh9CaxAwXvmKFeT/UTwiB
 KIuIOJop7aThJbiV42a/xwTrEjNMZRv6uVBBEtJX3rxpmIhqTbjcAv9rKMmgtLMk
 s6yX1MvYdOLhhEDyoUBX0dJdEETBf3KbnYIwi8kb4Sbkw/ZDgnkmSxFysom61wUF
 byAFEpah3ZFR8aES0uNKUE6UHK6i5qqp0Za/n6gA927E/WGCU9ndaS+01gyknog0
 8FqFYwruHQ==
 =50CO
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:
 "Here are the io_uring updates queued up for 6.11.

  Nothing major this time around, various minor improvements and
  cleanups/fixes. This contains:

   - Add bind/listen opcodes. Main motivation is to support direct
     descriptors, to avoid needing a regular fd just for doing these two
     operations (Gabriel)

   - Probe fixes (Gabriel)

   - Treat io-wq work flags as atomics. Not fixing a real issue, but may
     as well and it silences a KCSAN warning (me)

   - Cleanup of rsrc __set_current_state() usage (me)

   - Add 64-bit for {m,f}advise operations (me)

   - Improve performance of data ring messages (me)

   - Fix for ring message overflow posting (Pavel)

   - Fix for freezer interaction with TWA_NOTIFY_SIGNAL. Not strictly an
     io_uring thing, but since TWA_NOTIFY_SIGNAL was originally added
     for faster task_work signaling for io_uring, bundling it with this
     pull (Pavel)

   - Add Pavel as a co-maintainer

   - Various cleanups (me, Thorsten)"

* tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linux: (28 commits)
  io_uring/net: check socket is valid in io_bind()/io_listen()
  kernel: rerun task_work while freezing in get_signal()
  io_uring/io-wq: limit retrying worker initialisation
  io_uring/napi: Remove unnecessary s64 cast
  io_uring/net: cleanup io_recv_finish() bundle handling
  io_uring/msg_ring: fix overflow posting
  MAINTAINERS: change Pavel Begunkov from io_uring reviewer to maintainer
  io_uring/msg_ring: use kmem_cache_free() to free request
  io_uring/msg_ring: check for dead submitter task
  io_uring/msg_ring: add an alloc cache for io_kiocb entries
  io_uring/msg_ring: improve handling of target CQE posting
  io_uring: add io_add_aux_cqe() helper
  io_uring: add remote task_work execution helper
  io_uring/msg_ring: tighten requirement for remote posting
  io_uring: Allocate only necessary memory in io_probe
  io_uring: Fix probe of disabled operations
  io_uring: Introduce IORING_OP_LISTEN
  io_uring: Introduce IORING_OP_BIND
  net: Split a __sys_listen helper for io_uring
  net: Split a __sys_bind helper for io_uring
  ...
2024-07-15 13:49:10 -07:00
levi.yun
7dc836187f trace/pid_list: Change gfp flags in pid_list_fill_irq()
pid_list_fill_irq() runs via irq_work.
When CONFIG_PREEMPT_RT is disabled, it would run in irq_context.
so it shouldn't sleep while memory allocation.

Change gfp flags from GFP_KERNEL to GFP_NOWAIT to prevent sleep in
irq_work.

This change wouldn't impact functionality in practice because the worst-size
is 2K.

Cc: stable@goodmis.org
Fixes: 8d6e90983a ("tracing: Create a sparse bitmask for pid filtering")
Link: https://lore.kernel.org/20240704150226.1359936-1-yeoreum.yun@arm.com
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: levi.yun <yeoreum.yun@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-07-15 15:07:14 -04:00
Linus Torvalds
b051320d6a vfs-6.11.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEF0AAKCRCRxhvAZXjc
 oq0TAQDjfTLN75RwKQ34RIFtRun2q+OMfBQtSegtaccqazghyAD/QfmPuZDxB5DL
 rsI/5k5O4VupIKrEdIaqvNxmkmDsSAc=
 =bf7E
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.11.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Features:

   - Support passing NULL along AT_EMPTY_PATH for statx().

     NULL paths with any flag value other than AT_EMPTY_PATH go the
     usual route and end up with -EFAULT to retain compatibility (Rust
     is abusing calls of the sort to detect availability of statx)

     This avoids path lookup code, lockref management, memory allocation
     and in case of NULL path userspace memory access (which can be
     quite expensive with SMAP on x86_64)

   - Don't block i_writecount during exec. Remove the
     deny_write_access() mechanism for executables

   - Relax open_by_handle_at() permissions in specific cases where we
     can prove that the caller had sufficient privileges to open a file

   - Switch timespec64 fields in struct inode to discrete integers
     freeing up 4 bytes

  Fixes:

   - Fix false positive circular locking warning in hfsplus

   - Initialize hfs_inode_info after hfs_alloc_inode() in hfs

   - Avoid accidental overflows in vfs_fallocate()

   - Don't interrupt fallocate with EINTR in tmpfs to avoid constantly
     restarting shmem_fallocate()

   - Add missing quote in comment in fs/readdir

  Cleanups:

   - Don't assign and test in an if statement in mqueue. Move the
     assignment out of the if statement

   - Reflow the logic in may_create_in_sticky()

   - Remove the usage of the deprecated ida_simple_xx() API from procfs

   - Reject FSCONFIG_CMD_CREATE_EXCL requets that depend on the new
     mount api early

   - Rename variables in copy_tree() to make it easier to understand

   - Replace WARN(down_read_trylock, ...) abuse with proper asserts in
     various places in the VFS

   - Get rid of user_path_at_empty() and drop the empty argument from
     getname_flags()

   - Check for error while copying and no path in one branch in
     getname_flags()

   - Avoid redundant smp_mb() for THP handling in do_dentry_open()

   - Rename parent_ino to d_parent_ino and make it use RCU

   - Remove unused header include in fs/readdir

   - Export in_group_capable() helper and switch f2fs and fuse over to
     it instead of open-coding the logic in both places"

* tag 'vfs-6.11.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
  ipc: mqueue: remove assignment from IS_ERR argument
  vfs: rename parent_ino to d_parent_ino and make it use RCU
  vfs: support statx(..., NULL, AT_EMPTY_PATH, ...)
  stat: use vfs_empty_path() helper
  fs: new helper vfs_empty_path()
  fs: reflow may_create_in_sticky()
  vfs: remove redundant smp_mb for thp handling in do_dentry_open
  fuse: Use in_group_or_capable() helper
  f2fs: Use in_group_or_capable() helper
  fs: Export in_group_or_capable()
  vfs: reorder checks in may_create_in_sticky
  hfs: fix to initialize fields of hfs_inode_info after hfs_alloc_inode()
  proc: Remove usage of the deprecated ida_simple_xx() API
  hfsplus: fix to avoid false alarm of circular locking
  Improve readability of copy_tree
  vfs: shave a branch in getname_flags
  vfs: retire user_path_at_empty and drop empty arg from getname_flags
  vfs: stop using user_path_at_empty in do_readlinkat
  tmpfs: don't interrupt fallocate with EINTR
  fs: don't block i_writecount during exec
  ...
2024-07-15 10:52:51 -07:00
Masahiro Yamada
c442db3f49 kbuild: remove PROVIDE() for kallsyms symbols
This reimplements commit 951bcae6c5 ("kallsyms: Avoid weak references
for kallsyms symbols") because I am not a big fan of PROVIDE().

As an alternative solution, this commit prepends one more kallsyms step.

    KSYMS   .tmp_vmlinux.kallsyms0.S          # added
    AS      .tmp_vmlinux.kallsyms0.o          # added
    LD      .tmp_vmlinux.btf
    BTF     .btf.vmlinux.bin.o
    LD      .tmp_vmlinux.kallsyms1
    NM      .tmp_vmlinux.kallsyms1.syms
    KSYMS   .tmp_vmlinux.kallsyms1.S
    AS      .tmp_vmlinux.kallsyms1.o
    LD      .tmp_vmlinux.kallsyms2
    NM      .tmp_vmlinux.kallsyms2.syms
    KSYMS   .tmp_vmlinux.kallsyms2.S
    AS      .tmp_vmlinux.kallsyms2.o
    LD      vmlinux

Step 0 takes /dev/null as input, and generates .tmp_vmlinux.kallsyms0.o,
which has a valid kallsyms format with the empty symbol list, and can be
linked to vmlinux. Since it is really small, the added compile-time cost
is negligible.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
2024-07-16 01:08:36 +09:00
Vlastimil Babka
436381eaf2 Merge branch 'slab/for-6.11/buckets' into slab/for-next
Merge all the slab patches previously collected on top of v6.10-rc1,
over cleanups/fixes that had to be based on rc6.
2024-07-15 10:44:16 +02:00
Lai Jiangshan
58629d4871 workqueue: Always queue work items to the newest PWQ for order workqueues
To ensure non-reentrancy, __queue_work() attempts to enqueue a work
item to the pool of the currently executing worker. This is not only
unnecessary for an ordered workqueue, where order inherently suggests
non-reentrancy, but it could also disrupt the sequence if the item is
not enqueued on the newest PWQ.

Just queue it to the newest PWQ and let order management guarantees
non-reentrancy.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Fixes: 4c065dbce1 ("workqueue: Enable unbound cpumask update on ordered workqueues")
Cc: stable@vger.kernel.org # v6.9+
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 74347be3edfd11277799242766edf844c43dd5d3)
2024-07-14 18:20:19 -10:00
Tejun Heo
9283ff5be1 Merge branch 'for-6.10-fixes' into for-6.11 2024-07-14 18:04:03 -10:00
Kent Overstreet
1a616c2fe9 lockdep: lockdep_set_notrack_class()
Add a new helper to disable lockdep tracking entirely for a given class.

This is needed for bcachefs, which takes too many btree node locks for
lockdep to track. Instead, we have a single lockdep_map for "btree_trans
has any btree nodes locked", which makes more since given that we have
centralized lock management and a cycle detector.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Linus Torvalds
365346980e - Fix a performance regression when measuring the CPU time of a thread
(clock_gettime(CLOCK_THREAD_CPUTIME_ID,...)) due to the addition of
   PSI IRQ time accounting in the hotpath
 
 - Fix a task_struct leak due to missing to decrement the refcount when
   the task is enqueued before the timer which is supposed to do that,
   expires
 
 - Revert an attempt to expedite detaching of movable tasks, as finding
   those could become very costly. Turns out the original issue wasn't
   even hit by anyone
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaTmqUACgkQEsHwGGHe
 VUos3BAAgeZdeFiqop5TuNPURy7DDFpl/Ibwe9Wv1PGvZ70WHT0Aqf6S+woE91+g
 uRR9VZnyS7ODUEP4PD43zFeBHbrt6mZkKTyPRKxiylZpJGOp1KGfGmaxPEoi+kC+
 3rwphrs7F6cJ0H4mKvqj5+x1jA19L/RZ7LqZ4tZBicwkZXmBnk4Hy9mlO/5Neb2Q
 SqhzzCSVgpUW3mVvpPetst8N26R7BTYkejA3RWmCr8xFLB9nyzLBX5uGPtolv1QZ
 B5gRtK5ZY2tohdKaShFqdiSUDoUAKuO2WPLSS/ALZDfwK5a+Pue7uGt97OSHhVLt
 fTCcPcWDiNj5t/uA7FYXA9wiTQmATUzPtvj2urf/mVupaMLZJiadnQvX0Ya9YrCr
 9dyowk7me3326FvUzeqga12cyUtPxElfVsb3KzT1YzSyu7nd/Kezn8Lie41xzZJL
 cSqhex3Xl8Eoxvkf2gIy9FnBrx8t47B8SYCSK+JvjeGzIgeBpiHM4zsd5RE5xZ8d
 9m2uhlFXV+fIlDrqO1RA9k3yRfvbnuGrIHQUo0B2EKC/u/aSvrQVShQwnKzjM1u7
 mXMMyPUxyiDi2VLCTUcKqhLmf77TA0/1px+4dQ7E+ar6tvZxeBUrhoTM9jbNqDWl
 g5ShAHBWQXifoIytyhFNM5cfJnkRll97LyKm3LXyG6yZ7VK/JL8=
 =gxU0
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Fix a performance regression when measuring the CPU time of a thread
   (clock_gettime(CLOCK_THREAD_CPUTIME_ID,...)) due to the addition of
   PSI IRQ time accounting in the hotpath

 - Fix a task_struct leak due to missing to decrement the refcount when
   the task is enqueued before the timer which is supposed to do that,
   expires

 - Revert an attempt to expedite detaching of movable tasks, as finding
   those could become very costly. Turns out the original issue wasn't
   even hit by anyone

* tag 'sched_urgent_for_v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath
  sched/deadline: Fix task_struct reference leak
  Revert "sched/fair: Make sure to try to detach at least one movable task"
2024-07-14 10:18:25 -07:00
Thomas Gleixner
b7625d67eb - Remove unnecessary local variables initialization as they will be
initialized in the code path anyway right after on the ARM arch
   timer and the ARM global timer (Li kunyu)
 
 - Fix a race condition in the interrupt leading to a deadlock on the
   SH CMT driver. Note that this fix was not tested on the platform
   using this timer but the fix seems reasonable enough to be picked
   confidently (Niklas Söderlund)
 
 - Increase the rating of the gic-timer and use the configured width
   clocksource register on the MIPS architecture (Jiaxun Yang)
 
 - Add the DT bindings for the TMU on the Renesas platforms (Geert
   Uytterhoeven)
 
 - Add the DT bindings for the SOPHGO SG2002 clint on RiscV (Thomas
   Bonnefille)
 
 - Add the rtl-otto timer driver along with the DT bindings for the
   Realtek platform (Chris Packham)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGn3N4YVz0WNVyHskqDIjiipP6E8FAmaRQh0ACgkQqDIjiipP
 6E+rfQgAqkAWZ9BjswxV8Fg+Hj+a1cSohKjDczqitQF5rJm25X5VvMwlXVa3XQGm
 yemh4tKPpll02LOiYCTyqOWzNrkVS9VsoBd5rrYjRX5aSv7UD35EXklLj4P/INwX
 O9CRGD6aK4Xbw66xxheYHSSh+2iRs2x2mq61+/VdcIBlAwpQo+vx7McRoJZZI+2t
 NFIXw8RF5dDlmmAaqiB0WnPAtcOK3SDo9fu1LEAX1ZAzvbZriLo7XLnL7ibySWVe
 BW1n7Ore6PN5Dvz7jMfTsOQsgAlVv6MPfp/s4EDqMfBLVqXNirzXrdhiee/ahnYP
 vyzQyU5HPCMiIYS45mhJF0OyDd3wyw==
 =wuYA
 -----END PGP SIGNATURE-----

Merge tag 'timers-v6.11-rc1' of https://git.linaro.org/people/daniel.lezcano/linux into timers/core

Pull clocksource/event driver updates from Daniel Lezcano:

  - Remove unnecessary local variables initialization as they will be
    initialized in the code path anyway right after on the ARM arch
    timer and the ARM global timer (Li kunyu)

  - Fix a race condition in the interrupt leading to a deadlock on the
    SH CMT driver. Note that this fix was not tested on the platform
    using this timer but the fix seems reasonable enough to be picked
    confidently (Niklas Söderlund)

  - Increase the rating of the gic-timer and use the configured width
    clocksource register on the MIPS architecture (Jiaxun Yang)

  - Add the DT bindings for the TMU on the Renesas platforms (Geert
    Uytterhoeven)

  - Add the DT bindings for the SOPHGO SG2002 clint on RiscV (Thomas
    Bonnefille)

  - Add the rtl-otto timer driver along with the DT bindings for the
    Realtek platform (Chris Packham)

Link: https://lore.kernel.org/all/91cd05de-4c5d-4242-a381-3b8a4fe6a2a2@linaro.org
2024-07-13 12:07:10 +02:00
Yang Li
b69bdba5a3 swiotlb: fix kernel-doc description for swiotlb_del_transient
Describe the pool argument in the kernel-doc comment for
swiotlb_del_transient.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-07-13 07:36:10 +02:00
Jakub Kicinski
26f453176a bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZpGVmAAKCRDbK58LschI
 gxB4AQCgquQis63yqTI36j4iXBT+TuxHEBNoQBSLyzYdrLS1dgD/S5DRJDA+3LD+
 394hn/VtB1qvX5vaqjsov4UIwSMyxA0=
 =OhSn
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-07-12

We've added 23 non-merge commits during the last 3 day(s) which contain
a total of 18 files changed, 234 insertions(+), 243 deletions(-).

The main changes are:

1) Improve BPF verifier by utilizing overflow.h helpers to check
   for overflows, from Shung-Hsi Yu.

2) Fix NULL pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT
   when attr->attach_prog_fd was not specified, from Tengda Wu.

3) Fix arm64 BPF JIT when generating code for BPF trampolines with
   BPF_TRAMP_F_CALL_ORIG which corrupted upper address bits,
   from Puranjay Mohan.

4) Remove test_run callback from lwt_seg6local_prog_ops which never worked
   in the first place and caused syzbot reports,
   from Sebastian Andrzej Siewior.

5) Relax BPF verifier to accept non-zero offset on KF_TRUSTED_ARGS/
   /KF_RCU-typed BPF kfuncs, from Matt Bobrowski.

6) Fix a long standing bug in libbpf with regards to handling of BPF
   skeleton's forward and backward compatibility, from Andrii Nakryiko.

7) Annotate btf_{seq,snprintf}_show functions with __printf,
   from Alan Maguire.

8) BPF selftest improvements to reuse common network helpers in sk_lookup
   test and dropping the open-coded inetaddr_len() and make_socket() ones,
   from Geliang Tang.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits)
  selftests/bpf: Test for null-pointer-deref bugfix in resolve_prog_type()
  bpf: Fix null pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT
  selftests/bpf: DENYLIST.aarch64: Skip fexit_sleep again
  bpf: use check_sub_overflow() to check for subtraction overflows
  bpf: use check_add_overflow() to check for addition overflows
  bpf: fix overflow check in adjust_jmp_off()
  bpf: Eliminate remaining "make W=1" warnings in kernel/bpf/btf.o
  bpf: annotate BTF show functions with __printf
  bpf, arm64: Fix trampoline for BPF_TRAMP_F_CALL_ORIG
  selftests/bpf: Close obj in error path in xdp_adjust_tail
  selftests/bpf: Null checks for links in bpf_tcp_ca
  selftests/bpf: Use connect_fd_to_fd in sk_lookup
  selftests/bpf: Use start_server_addr in sk_lookup
  selftests/bpf: Use start_server_str in sk_lookup
  selftests/bpf: Close fd in error path in drop_on_reuseport
  selftests/bpf: Add ASSERT_OK_FD macro
  selftests/bpf: Add backlog for network_helper_opts
  selftests/bpf: fix compilation failure when CONFIG_NF_FLOW_TABLE=m
  bpf: Remove tst_run from lwt_seg6local_prog_ops.
  bpf: relax zero fixed offset constraint on KF_TRUSTED_ARGS/KF_RCU
  ...
====================

Link: https://patch.msgid.link/20240712212448.5378-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-12 22:25:54 -07:00
Kees Cook
0fe2356434 tsacct: replace strncpy() with strscpy()
Replace the deprecated[1] use of strncpy() in bacct_add_tsk().  Since this
is UAPI, include trailing padding in the copy.

Link: https://github.com/KSPP/linux/issues/90 [1]
Link: https://lkml.kernel.org/r/20240711171308.work.995-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
Cc: "Dr. Thomas Orgis" <thomas.orgis@uni-hamburg.de>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ismael Luceno <ismael@iodev.co.uk>
Cc: Peng Liu <liupeng256@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12 16:39:53 -07:00
Christophe Leroy
18d095b255 mm: define __pte_leaf_size() to also take a PMD entry
On powerpc 8xx, when a page is 8M size, the information is in the PMD
entry.  So allow architectures to provide __pte_leaf_size() instead of
pte_leaf_size() and provide the PMD entry to that function.

When __pte_leaf_size() is not defined, define it as a pte_leaf_size() so
that architectures not interested in the PMD arguments are not impacted.

Only define a default pte_leaf_size() when __pte_leaf_size() is not
defined to make sure nobody adds new calls to pte_leaf_size() in the core.

Link: https://lkml.kernel.org/r/c7c008f0a314bf8029ad7288fdc908db1ec7e449.1719928057.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12 15:52:15 -07:00
Xiu Jianfeng
6a26f9c689 cgroup/misc: Introduce misc.events.local
Currently the event counting provided by misc.events is hierarchical,
it's not practical if user is only concerned with events of a
specified cgroup. Therefore, introduce misc.events.local collect events
specific to the given cgroup.

This is analogous to memory.events.local and pids.events.local.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-12 06:45:23 -10:00
Shung-Hsi Yu
deac5871eb bpf: use check_sub_overflow() to check for subtraction overflows
Similar to previous patch that drops signed_add*_overflows() and uses
(compiler) builtin-based check_add_overflow(), do the same for
signed_sub*_overflows() and replace them with the generic
check_sub_overflow() to make future refactoring easier and have the
checks implemented more efficiently.

Unsigned overflow check for subtraction does not use helpers and are
simple enough already, so they're left untouched.

After the change GCC 13.3.0 generates cleaner assembly on x86_64:

	if (check_sub_overflow(*dst_smin, src_reg->smax_value, dst_smin) ||
   139bf:	mov    0x28(%r12),%rax
   139c4:	mov    %edx,0x54(%r12)
   139c9:	sub    %r11,%rax
   139cc:	mov    %rax,0x28(%r12)
   139d1:	jo     14627 <adjust_reg_min_max_vals+0x1237>
	    check_sub_overflow(*dst_smax, src_reg->smin_value, dst_smax)) {
   139d7:	mov    0x30(%r12),%rax
   139dc:	sub    %r9,%rax
   139df:	mov    %rax,0x30(%r12)
	if (check_sub_overflow(*dst_smin, src_reg->smax_value, dst_smin) ||
   139e4:	jo     14627 <adjust_reg_min_max_vals+0x1237>
   ...
		*dst_smin = S64_MIN;
   14627:	movabs $0x8000000000000000,%rax
   14631:	mov    %rax,0x28(%r12)
		*dst_smax = S64_MAX;
   14636:	sub    $0x1,%rax
   1463a:	mov    %rax,0x30(%r12)

Before the change it gives:

	if (signed_sub_overflows(dst_reg->smin_value, smax_val) ||
   13a50:	mov    0x28(%r12),%rdi
   13a55:	mov    %edx,0x54(%r12)
		dst_reg->smax_value = S64_MAX;
   13a5a:	movabs $0x7fffffffffffffff,%rdx
   13a64:	mov    %eax,0x50(%r12)
		dst_reg->smin_value = S64_MIN;
   13a69:	movabs $0x8000000000000000,%rax
	s64 res = (s64)((u64)a - (u64)b);
   13a73:	mov    %rdi,%rsi
   13a76:	sub    %rcx,%rsi
	if (b < 0)
   13a79:	test   %rcx,%rcx
   13a7c:	js     145ea <adjust_reg_min_max_vals+0x119a>
	if (signed_sub_overflows(dst_reg->smin_value, smax_val) ||
   13a82:	cmp    %rsi,%rdi
   13a85:	jl     13ac7 <adjust_reg_min_max_vals+0x677>
	    signed_sub_overflows(dst_reg->smax_value, smin_val)) {
   13a87:	mov    0x30(%r12),%r8
	s64 res = (s64)((u64)a - (u64)b);
   13a8c:	mov    %r8,%rax
   13a8f:	sub    %r9,%rax
	return res > a;
   13a92:	cmp    %rax,%r8
   13a95:	setl   %sil
	if (b < 0)
   13a99:	test   %r9,%r9
   13a9c:	js     147d1 <adjust_reg_min_max_vals+0x1381>
		dst_reg->smax_value = S64_MAX;
   13aa2:	movabs $0x7fffffffffffffff,%rdx
		dst_reg->smin_value = S64_MIN;
   13aac:	movabs $0x8000000000000000,%rax
	if (signed_sub_overflows(dst_reg->smin_value, smax_val) ||
   13ab6:	test   %sil,%sil
   13ab9:	jne    13ac7 <adjust_reg_min_max_vals+0x677>
		dst_reg->smin_value -= smax_val;
   13abb:	mov    %rdi,%rax
		dst_reg->smax_value -= smin_val;
   13abe:	mov    %r8,%rdx
		dst_reg->smin_value -= smax_val;
   13ac1:	sub    %rcx,%rax
		dst_reg->smax_value -= smin_val;
   13ac4:	sub    %r9,%rdx
   13ac7:	mov    %rax,0x28(%r12)
   ...
   13ad1:	mov    %rdx,0x30(%r12)
   ...
	if (signed_sub_overflows(dst_reg->smin_value, smax_val) ||
   145ea:	cmp    %rsi,%rdi
   145ed:	jg     13ac7 <adjust_reg_min_max_vals+0x677>
   145f3:	jmp    13a87 <adjust_reg_min_max_vals+0x637>

Suggested-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240712080127.136608-4-shung-hsi.yu@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-12 08:54:08 -07:00
Shung-Hsi Yu
28a4411076 bpf: use check_add_overflow() to check for addition overflows
signed_add*_overflows() was added back when there was no overflow-check
helper. With the introduction of such helpers in commit f0907827a8
("compiler.h: enable builtin overflow checkers and add fallback code"), we
can drop signed_add*_overflows() in kernel/bpf/verifier.c and use the
generic check_add_overflow() instead.

This will make future refactoring easier, and takes advantage of
compiler-emitted hardware instructions that efficiently implement these
checks.

After the change GCC 13.3.0 generates cleaner assembly on x86_64:

	err = adjust_scalar_min_max_vals(env, insn, dst_reg, *src_reg);
   13625:	mov    0x28(%rbx),%r9  /*  r9 = src_reg->smin_value */
   13629:	mov    0x30(%rbx),%rcx /* rcx = src_reg->smax_value */
   ...
	if (check_add_overflow(*dst_smin, src_reg->smin_value, dst_smin) ||
   141c1:	mov    %r9,%rax
   141c4:	add    0x28(%r12),%rax
   141c9:	mov    %rax,0x28(%r12)
   141ce:	jo     146e4 <adjust_reg_min_max_vals+0x1294>
	    check_add_overflow(*dst_smax, src_reg->smax_value, dst_smax)) {
   141d4:	add    0x30(%r12),%rcx
   141d9:	mov    %rcx,0x30(%r12)
	if (check_add_overflow(*dst_smin, src_reg->smin_value, dst_smin) ||
   141de:	jo     146e4 <adjust_reg_min_max_vals+0x1294>
   ...
		*dst_smin = S64_MIN;
   146e4:	movabs $0x8000000000000000,%rax
   146ee:	mov    %rax,0x28(%r12)
		*dst_smax = S64_MAX;
   146f3:	sub    $0x1,%rax
   146f7:	mov    %rax,0x30(%r12)

Before the change it gives:

	s64 smin_val = src_reg->smin_value;
     675:	mov    0x28(%rsi),%r8
	s64 smax_val = src_reg->smax_value;
	u64 umin_val = src_reg->umin_value;
	u64 umax_val = src_reg->umax_value;
     679:	mov    %rdi,%rax /* rax = dst_reg */
	if (signed_add_overflows(dst_reg->smin_value, smin_val) ||
     67c:	mov    0x28(%rdi),%rdi /* rdi = dst_reg->smin_value */
	u64 umin_val = src_reg->umin_value;
     680:	mov    0x38(%rsi),%rdx
	u64 umax_val = src_reg->umax_value;
     684:	mov    0x40(%rsi),%rcx
	s64 res = (s64)((u64)a + (u64)b);
     688:	lea    (%r8,%rdi,1),%r9 /* r9 = dst_reg->smin_value + src_reg->smin_value */
	return res < a;
     68c:	cmp    %r9,%rdi
     68f:	setg   %r10b /* r10b = (dst_reg->smin_value + src_reg->smin_value) > dst_reg->smin_value */
	if (b < 0)
     693:	test   %r8,%r8
     696:	js     72b <scalar_min_max_add+0xbb>
	    signed_add_overflows(dst_reg->smax_value, smax_val)) {
		dst_reg->smin_value = S64_MIN;
		dst_reg->smax_value = S64_MAX;
     69c:	movabs $0x7fffffffffffffff,%rdi
	s64 smax_val = src_reg->smax_value;
     6a6:	mov    0x30(%rsi),%r8
		dst_reg->smin_value = S64_MIN;
     6aa:	00 00 00 	movabs $0x8000000000000000,%rsi
	if (signed_add_overflows(dst_reg->smin_value, smin_val) ||
     6b4:	test   %r10b,%r10b /* (dst_reg->smin_value + src_reg->smin_value) > dst_reg->smin_value ? goto 6cb */
     6b7:	jne    6cb <scalar_min_max_add+0x5b>
	    signed_add_overflows(dst_reg->smax_value, smax_val)) {
     6b9:	mov    0x30(%rax),%r10   /* r10 = dst_reg->smax_value */
	s64 res = (s64)((u64)a + (u64)b);
     6bd:	lea    (%r10,%r8,1),%r11 /* r11 = dst_reg->smax_value + src_reg->smax_value */
	if (b < 0)
     6c1:	test   %r8,%r8
     6c4:	js     71e <scalar_min_max_add+0xae>
	if (signed_add_overflows(dst_reg->smin_value, smin_val) ||
     6c6:	cmp    %r11,%r10 /* (dst_reg->smax_value + src_reg->smax_value) <= dst_reg->smax_value ? goto 723 */
     6c9:	jle    723 <scalar_min_max_add+0xb3>
	} else {
		dst_reg->smin_value += smin_val;
		dst_reg->smax_value += smax_val;
	}
     6cb:	mov    %rsi,0x28(%rax)
     ...
     6d5:	mov    %rdi,0x30(%rax)
     ...
	if (signed_add_overflows(dst_reg->smin_value, smin_val) ||
     71e:	cmp    %r11,%r10
     721:	jl     6cb <scalar_min_max_add+0x5b>
		dst_reg->smin_value += smin_val;
     723:	mov    %r9,%rsi
		dst_reg->smax_value += smax_val;
     726:	mov    %r11,%rdi
     729:	jmp    6cb <scalar_min_max_add+0x5b>
		return res > a;
     72b:	cmp    %r9,%rdi
     72e:	setl   %r10b
     732:	jmp    69c <scalar_min_max_add+0x2c>
     737:	nopw   0x0(%rax,%rax,1)

Note: unlike adjust_ptr_min_max_vals() and scalar*_min_max_add(), it is
necessary to introduce intermediate variable in adjust_jmp_off() to keep
the functional behavior unchanged. Without an intermediate variable
imm/off will be altered even on overflow.

Suggested-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20240712080127.136608-3-shung-hsi.yu@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-12 08:54:08 -07:00
Shung-Hsi Yu
4a04b4f0de bpf: fix overflow check in adjust_jmp_off()
adjust_jmp_off() incorrectly used the insn->imm field for all overflow check,
which is incorrect as that should only be done or the BPF_JMP32 | BPF_JA case,
not the general jump instruction case. Fix it by using insn->off for overflow
check in the general case.

Fixes: 5337ac4c9b ("bpf: Fix the corner case with may_goto and jump to the 1st insn.")
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20240712080127.136608-2-shung-hsi.yu@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-12 08:54:07 -07:00
Alan Maguire
2454075f8e bpf: Eliminate remaining "make W=1" warnings in kernel/bpf/btf.o
As reported by Mirsad [1] we still see format warnings in kernel/bpf/btf.o
at W=1 warning level:

  CC      kernel/bpf/btf.o
./kernel/bpf/btf.c: In function ‘btf_type_seq_show_flags’:
./kernel/bpf/btf.c:7553:21: warning: assignment left-hand side might be a candidate for a format attribute [-Wsuggest-attribute=format]
 7553 |         sseq.showfn = btf_seq_show;
      |                     ^
./kernel/bpf/btf.c: In function ‘btf_type_snprintf_show’:
./kernel/bpf/btf.c:7604:31: warning: assignment left-hand side might be a candidate for a format attribute [-Wsuggest-attribute=format]
 7604 |         ssnprintf.show.showfn = btf_snprintf_show;
      |                               ^

Combined with CONFIG_WERROR=y these can halt the build.

The fix (annotating the structure field with __printf())
suggested by Mirsad resolves these. Apologies I missed this last time.
No other W=1 warnings were observed in kernel/bpf after this fix.

[1] https://lore.kernel.org/bpf/92c9d047-f058-400c-9c7d-81d4dc1ef71b@gmail.com/

Fixes: b3470da314 ("bpf: annotate BTF show functions with __printf")
Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Suggested-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240712092859.1390960-1-alan.maguire@oracle.com
2024-07-12 17:02:26 +02:00
Lai Jiangshan
b2b1f93384 workqueue: Rename wq_update_pod() to unbound_wq_update_pwq()
What wq_update_pod() does is just to update the pwq of the specific
cpu.  Rename it and update the comments.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11 12:50:35 -10:00
Lai Jiangshan
d160a58de5 workqueue: Remove the arguments @hotplug_cpu and @online from wq_update_pod()
The arguments @hotplug_cpu and @online are not used in wq_update_pod()
since the functions called by wq_update_pod() don't need them.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11 12:50:35 -10:00
Lai Jiangshan
88a41b185d workqueue: Remove the argument @cpu_going_down from wq_calc_pod_cpumask()
wq_calc_pod_cpumask() uses wq_online_cpumask, which excludes the cpu
going down, so the argument cpu_going_down is unused and can be removed.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11 12:50:34 -10:00
Lai Jiangshan
2cb61f76be workqueue: Remove the unneeded cpumask empty check in wq_calc_pod_cpumask()
The cpumask empty check in wq_calc_pod_cpumask() has long been useless.
It just works purely as documents which states that the cpumask is not
possible empty after the function returns.

Now the code above is even more explicit that the cpumask is not empty,
so the document-only empty check can be removed.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11 12:50:34 -10:00
Lai Jiangshan
19af457573 workqueue: Remove cpus_read_lock() from apply_wqattrs_lock()
1726a17135 ("workqueue: Put PWQ allocation and WQ enlistment in the same
lock C.S.") led to the following possible deadlock:

  WARNING: possible recursive locking detected
  6.10.0-rc5-00004-g1d4c6111406c #1 Not tainted
   --------------------------------------------
   swapper/0/1 is trying to acquire lock:
   c27760f4 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue (kernel/workqueue.c:5152 kernel/workqueue.c:5730) 
  
   but task is already holding lock:
   c27760f4 (cpu_hotplug_lock){++++}-{0:0}, at: padata_alloc (kernel/padata.c:1007) 
   ...  
   stack backtrace:
   ...
   cpus_read_lock (include/linux/percpu-rwsem.h:53 kernel/cpu.c:488) 
   alloc_workqueue (kernel/workqueue.c:5152 kernel/workqueue.c:5730) 
   padata_alloc (kernel/padata.c:1007 (discriminator 1)) 
   pcrypt_init_padata (crypto/pcrypt.c:327 (discriminator 1)) 
   pcrypt_init (crypto/pcrypt.c:353) 
   do_one_initcall (init/main.c:1267) 
   do_initcalls (init/main.c:1328 (discriminator 1) init/main.c:1345 (discriminator 1)) 
   kernel_init_freeable (init/main.c:1364) 
   kernel_init (init/main.c:1469) 
   ret_from_fork (arch/x86/kernel/process.c:153) 
   ret_from_fork_asm (arch/x86/entry/entry_32.S:737) 
   entry_INT80_32 (arch/x86/entry/entry_32.S:944) 

This is caused by pcrypt allocating a workqueue while holding
cpus_read_lock(), so workqueue code can't do it again as that can lead to
deadlocks if down_write starts after the first down_read.

The pwq creations and installations have been reworked based on
wq_online_cpumask rather than cpu_online_mask making cpus_read_lock() is
unneeded during wqattrs changes. Fix the deadlock by removing
cpus_read_lock() from apply_wqattrs_lock().

tj: Updated changelog.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Fixes: 1726a17135 ("workqueue: Put PWQ allocation and WQ enlistment in the same lock C.S.")
Link: http://lkml.kernel.org/r/202407081521.83b627c1-lkp@intel.com
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11 12:50:34 -10:00
Lai Jiangshan
fbb3d4c15d workqueue: Simplify wq_calc_pod_cpumask() with wq_online_cpumask
Avoid relying on cpu_online_mask for wqattrs changes so that
cpus_read_lock() can be removed from apply_wqattrs_lock().

And with wq_online_cpumask, attrs->__pod_cpumask doesn't need to be
reused as a temporary storage to calculate if the pod have any online
CPUs @attrs wants since @cpu_going_down is not in the wq_online_cpumask.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11 12:50:34 -10:00
Lai Jiangshan
8d84baf760 workqueue: Add wq_online_cpumask
The new wq_online_mask mirrors the cpu_online_mask except during
hotplugging; specifically, it differs between the hotplugging stages
of workqueue_offline_cpu() and workqueue_online_cpu(), during which
the transitioning CPU is not represented in the mask.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11 12:50:34 -10:00
Alan Maguire
b3470da314 bpf: annotate BTF show functions with __printf
-Werror=suggest-attribute=format warns about two functions
in kernel/bpf/btf.c [1]; add __printf() annotations to silence
these warnings since for CONFIG_WERROR=y they will trigger
build failures.

[1] https://lore.kernel.org/bpf/a8b20c72-6631-4404-9e1f-0410642d7d20@gmail.com/

Fixes: 31d0bc8163 ("bpf: Move to generic BTF show support, apply it to seq files/strings")
Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Tested-by: Mirsad Todorovac <mtodorovac69@yahoo.com>
Link: https://lore.kernel.org/r/20240711182321.963667-1-alan.maguire@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-11 14:15:17 -07:00
Jakub Kicinski
7c8267275d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/sched/act_ct.c
  26488172b0 ("net/sched: Fix UAF when resolving a clash")
  3abbd7ed8b ("act_ct: prepare for stolen verdict coming from conntrack and nat engine")

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11 12:58:13 -07:00
Yu Liao
f7d43dd206 tick/broadcast: Make takeover of broadcast hrtimer reliable
Running the LTP hotplug stress test on a aarch64 machine results in
rcu_sched stall warnings when the broadcast hrtimer was owned by the
un-plugged CPU. The issue is the following:

CPU1 (owns the broadcast hrtimer)	CPU2

				tick_broadcast_enter()
				  // shutdown local timer device
				  broadcast_shutdown_local()
				...
				tick_broadcast_exit()
				  clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT)
				  // timer device is not programmed
				  cpumask_set_cpu(cpu, tick_broadcast_force_mask)

				initiates offlining of CPU1
take_cpu_down()
/*
 * CPU1 shuts down and does not
 * send broadcast IPI anymore
 */
				takedown_cpu()
				  hotplug_cpu__broadcast_tick_pull()
				    // move broadcast hrtimer to this CPU
				    clockevents_program_event()
				      bc_set_next()
					hrtimer_start()
					/*
					 * timer device is not programmed
					 * because only the first expiring
					 * timer will trigger clockevent
					 * device reprogramming
					 */

What happens is that CPU2 exits broadcast mode with force bit set, then the
local timer device is not reprogrammed and CPU2 expects to receive the
expired event by the broadcast IPI. But this does not happen because CPU1
is offlined by CPU2. CPU switches the clockevent device to ONESHOT state,
but does not reprogram the device.

The subsequent reprogramming of the hrtimer broadcast device does not
program the clockevent device of CPU2 either because the pending expiry
time is already in the past and the CPU expects the event to be delivered.
As a consequence all CPUs which wait for a broadcast event to be delivered
are stuck forever.

Fix this issue by reprogramming the local timer device if the broadcast
force bit of the CPU is set so that the broadcast hrtimer is delivered.

[ tglx: Massage comment and change log. Add Fixes tag ]

Fixes: 989dcb645c ("tick: Handle broadcast wakeup of multiple cpus")
Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240711124843.64167-1-liaoyu15@huawei.com
2024-07-11 18:00:24 +02:00
Juergen Gross
9fe6a8c5b2 x86/xen: remove deprecated xen_nopvspin boot parameter
The xen_nopvspin boot parameter is deprecated since 2019. nopvspin
can be used instead.

Remove the xen_nopvspin boot parameter and replace the xen_pvspin
variable use cases with nopvspin.

This requires to move the nopvspin variable out of the .initdata
section, as it needs to be accessed for cpuhotplug, too.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Message-ID: <20240710110139.22300-1-jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2024-07-11 16:33:51 +02:00
Ingo Molnar
011b1134b8 Merge branch 'sched/urgent' into sched/core, to pick up fixes and refresh the branch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-07-11 10:42:33 +02:00
Pavel Begunkov
943ad0b62e kernel: rerun task_work while freezing in get_signal()
io_uring can asynchronously add a task_work while the task is getting
freezed. TIF_NOTIFY_SIGNAL will prevent the task from sleeping in
do_freezer_trap(), and since the get_signal()'s relock loop doesn't
retry task_work, the task will spin there not being able to sleep
until the freezing is cancelled / the task is killed / etc.

Run task_works in the freezer path. Keep the patch small and simple
so it can be easily back ported, but we might need to do some cleaning
after and look if there are other places with similar problems.

Cc: stable@vger.kernel.org
Link: https://github.com/systemd/systemd/issues/33626
Fixes: 12db8b6900 ("entry: Add support for TIF_NOTIFY_SIGNAL")
Reported-by: Julian Orth <ju.orth@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/89ed3a52933370deaaf61a0a620a6ac91f1e754d.1720634146.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-11 01:51:44 -06:00
Kumar Kartikeya Dwivedi
a6fcd19d7e bpf: Defer work in bpf_timer_cancel_and_free
Currently, the same case as previous patch (two timer callbacks trying
to cancel each other) can be invoked through bpf_map_update_elem as
well, or more precisely, freeing map elements containing timers. Since
this relies on hrtimer_cancel as well, it is prone to the same deadlock
situation as the previous patch.

It would be sufficient to use hrtimer_try_to_cancel to fix this problem,
as the timer cannot be enqueued after async_cancel_and_free. Once
async_cancel_and_free has been done, the timer must be reinitialized
before it can be armed again. The callback running in parallel trying to
arm the timer will fail, and freeing bpf_hrtimer without waiting is
sufficient (given kfree_rcu), and bpf_timer_cb will return
HRTIMER_NORESTART, preventing the timer from being rearmed again.

However, there exists a UAF scenario where the callback arms the timer
before entering this function, such that if cancellation fails (due to
timer callback invoking this routine, or the target timer callback
running concurrently). In such a case, if the timer expiration is
significantly far in the future, the RCU grace period expiration
happening before it will free the bpf_hrtimer state and along with it
the struct hrtimer, that is enqueued.

Hence, it is clear cancellation needs to occur after
async_cancel_and_free, and yet it cannot be done inline due to deadlock
issues. We thus modify bpf_timer_cancel_and_free to defer work to the
global workqueue, adding a work_struct alongside rcu_head (both used at
_different_ points of time, so can share space).

Update existing code comments to reflect the new state of affairs.

Fixes: b00628b1c7 ("bpf: Introduce bpf timers.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240709185440.1104957-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-10 15:59:44 -07:00
Kumar Kartikeya Dwivedi
d4523831f0 bpf: Fail bpf_timer_cancel when callback is being cancelled
Given a schedule:

timer1 cb			timer2 cb

bpf_timer_cancel(timer2);	bpf_timer_cancel(timer1);

Both bpf_timer_cancel calls would wait for the other callback to finish
executing, introducing a lockup.

Add an atomic_t count named 'cancelling' in bpf_hrtimer. This keeps
track of all in-flight cancellation requests for a given BPF timer.
Whenever cancelling a BPF timer, we must check if we have outstanding
cancellation requests, and if so, we must fail the operation with an
error (-EDEADLK) since cancellation is synchronous and waits for the
callback to finish executing. This implies that we can enter a deadlock
situation involving two or more timer callbacks executing in parallel
and attempting to cancel one another.

Note that we avoid incrementing the cancelling counter for the target
timer (the one being cancelled) if bpf_timer_cancel is not invoked from
a callback, to avoid spurious errors. The whole point of detecting
cur->cancelling and returning -EDEADLK is to not enter a busy wait loop
(which may or may not lead to a lockup). This does not apply in case the
caller is in a non-callback context, the other side can continue to
cancel as it sees fit without running into errors.

Background on prior attempts:

Earlier versions of this patch used a bool 'cancelling' bit and used the
following pattern under timer->lock to publish cancellation status.

lock(t->lock);
t->cancelling = true;
mb();
if (cur->cancelling)
	return -EDEADLK;
unlock(t->lock);
hrtimer_cancel(t->timer);
t->cancelling = false;

The store outside the critical section could overwrite a parallel
requests t->cancelling assignment to true, to ensure the parallely
executing callback observes its cancellation status.

It would be necessary to clear this cancelling bit once hrtimer_cancel
is done, but lack of serialization introduced races. Another option was
explored where bpf_timer_start would clear the bit when (re)starting the
timer under timer->lock. This would ensure serialized access to the
cancelling bit, but may allow it to be cleared before in-flight
hrtimer_cancel has finished executing, such that lockups can occur
again.

Thus, we choose an atomic counter to keep track of all outstanding
cancellation requests and use it to prevent lockups in case callbacks
attempt to cancel each other while executing in parallel.

Reported-by: Dohyun Kim <dohyunkim@google.com>
Reported-by: Neel Natu <neelnatu@google.com>
Fixes: b00628b1c7 ("bpf: Introduce bpf timers.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240709185440.1104957-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-10 15:59:44 -07:00
Mohammad Shehar Yaar Tausif
af253aef18 bpf: fix order of args in call to bpf_map_kvcalloc
The original function call passed size of smap->bucket before the number of
buckets which raises the error 'calloc-transposed-args' on compilation.

Vlastimil Babka added:

The order of parameters can be traced back all the way to 6ac99e8f23
("bpf: Introduce bpf sk local storage") accross several refactorings,
and that's why the commit is used as a Fixes: tag.

In v6.10-rc1, a different commit 2c321f3f70 ("mm: change inlined
allocation helpers to account at the call site") however exposed the
order of args in a way that gcc-14 has enough visibility to start
warning about it, because (in !CONFIG_MEMCG case) bpf_map_kvcalloc is
then a macro alias for kvcalloc instead of a static inline wrapper.

To sum up the warning happens when the following conditions are all met:

- gcc-14 is used (didn't see it with gcc-13)
- commit 2c321f3f70 is present
- CONFIG_MEMCG is not enabled in .config
- CONFIG_WERROR turns this from a compiler warning to error

Fixes: 6ac99e8f23 ("bpf: Introduce bpf sk local storage")
Reviewed-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Christian Kujau <lists@nerdbynature.de>
Signed-off-by: Mohammad Shehar Yaar Tausif <sheharyaar48@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20240710100521.15061-2-vbabka@suse.cz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-10 15:31:19 -07:00
Zqiang
77aeb1b685 smp: Add missing destroy_work_on_stack() call in smp_call_on_cpu()
For CONFIG_DEBUG_OBJECTS_WORK=y kernels sscs.work defined by
INIT_WORK_ONSTACK() is initialized by debug_object_init_on_stack() for
the debug check in __init_work() to work correctly.

But this lacks the counterpart to remove the tracked object from debug
objects again, which will cause a debug object warning once the stack is
freed.

Add the missing destroy_work_on_stack() invocation to cure that.

[ tglx: Massaged changelog ]

Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20240704065213.13559-1-qiang.zhang1211@gmail.com
2024-07-10 22:40:39 +02:00
Wei Yang
9325585288 kernel/fork.c: put set_max_threads()/task_struct_whitelist() in __init section
The functions set_max_threads() and task_struct_whitelist() are only used
by fork_init() during bootup.

Let's add __init tag to them.

Link: https://lkml.kernel.org/r/20240701013410.17260-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10 12:14:54 -07:00
Wei Yang
66b4aaf733 kernel/fork.c: get totalram_pages from memblock to calculate max_threads
Since we plan to move the accounting into __free_pages_core(),
totalram_pages may not represent the total usable pages on system at this
point when defer_init is enabled.

Instead we can get the total usable pages from memblock directly.

Link: https://lkml.kernel.org/r/20240701013410.17260-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10 12:14:54 -07:00
Johannes Weiner
3a3b7fec39 mm: remove CONFIG_MEMCG_KMEM
CONFIG_MEMCG_KMEM used to be a user-visible option for whether slab
tracking is enabled.  It has been default-enabled and equivalent to
CONFIG_MEMCG for almost a decade.  We've only grown more kernel memory
accounting sites since, and there is no imaginable cgroup usecase going
forward that wants to track user pages but not the multitude of
user-drivable kernel allocations.

Link: https://lkml.kernel.org/r/20240701153148.452230-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10 12:14:54 -07:00
Arnd Bergmann
505d66d1ab clone3: drop __ARCH_WANT_SYS_CLONE3 macro
When clone3() was introduced, it was not obvious how each architecture
deals with setting up the stack and keeping the register contents in
a fork()-like system call, so this was left for the architecture
maintainers to implement, with __ARCH_WANT_SYS_CLONE3 defined by those
that already implement it.

Five years later, we still have a few architectures left that are missing
clone3(), and the macro keeps getting in the way as it's fundamentally
different from all the other __ARCH_WANT_SYS_* macros that are meant
to provide backwards-compatibility with applications using older
syscalls that are no longer provided by default.

Address this by reversing the polarity of the macro, adding an
__ARCH_BROKEN_SYS_CLONE3 macro to all architectures that don't
already provide the syscall, and remove __ARCH_WANT_SYS_CLONE3
from all the other ones.

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-07-10 14:23:38 +02:00
Michael Kelley
7296f2301a swiotlb: reduce swiotlb pool lookups
With CONFIG_SWIOTLB_DYNAMIC enabled, each round-trip map/unmap pair
in the swiotlb results in 6 calls to swiotlb_find_pool(). In multiple
places, the pool is found and used in one function, and then must
be found again in the next function that is called because only the
tlb_addr is passed as an argument. These are the six call sites:

dma_direct_map_page:
 1. swiotlb_map -> swiotlb_tbl_map_single -> swiotlb_bounce

dma_direct_unmap_page:
 2. dma_direct_sync_single_for_cpu -> is_swiotlb_buffer
 3. dma_direct_sync_single_for_cpu -> swiotlb_sync_single_for_cpu ->
	swiotlb_bounce
 4. is_swiotlb_buffer
 5. swiotlb_tbl_unmap_single -> swiotlb_del_transient
 6. swiotlb_tbl_unmap_single -> swiotlb_release_slots

Reduce the number of calls by finding the pool at a higher level, and
passing it as an argument instead of searching again. A key change is
for is_swiotlb_buffer() to return a pool pointer instead of a boolean,
and then pass this pool pointer to subsequent swiotlb functions.

There are 9 occurrences of is_swiotlb_buffer() used to test if a buffer
is a swiotlb buffer before calling a swiotlb function. To reduce code
duplication in getting the pool pointer and passing it as an argument,
introduce inline wrappers for this pattern. The generated code is
essentially unchanged.

Since is_swiotlb_buffer() no longer returns a boolean, rename some
functions to reflect the change:

 * swiotlb_find_pool() becomes __swiotlb_find_pool()
 * is_swiotlb_buffer() becomes swiotlb_find_pool()
 * is_xen_swiotlb_buffer() becomes xen_swiotlb_find_pool()

With these changes, a round-trip map/unmap pair requires only 2 pool
lookups (listed using the new names and wrappers):

dma_direct_unmap_page:
 1. dma_direct_sync_single_for_cpu -> swiotlb_find_pool
 2. swiotlb_tbl_unmap_single -> swiotlb_find_pool

These changes come from noticing the inefficiencies in a code review,
not from performance measurements. With CONFIG_SWIOTLB_DYNAMIC,
__swiotlb_find_pool() is not trivial, and it uses an RCU read lock,
so avoiding the redundant calls helps performance in a hot path.
When CONFIG_SWIOTLB_DYNAMIC is *not* set, the code size reduction
is minimal and the perf benefits are likely negligible, but no
harm is done.

No functional change is intended.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Petr Tesarik <petr@tesarici.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-07-10 07:59:03 +02:00
Matt Bobrowski
605c96997d bpf: relax zero fixed offset constraint on KF_TRUSTED_ARGS/KF_RCU
Currently, BPF kfuncs which accept trusted pointer arguments
i.e. those flagged as KF_TRUSTED_ARGS, KF_RCU, or KF_RELEASE, all
require an original/unmodified trusted pointer argument to be supplied
to them. By original/unmodified, it means that the backing register
holding the trusted pointer argument that is to be supplied to the BPF
kfunc must have its fixed offset set to zero, or else the BPF verifier
will outright reject the BPF program load. However, this zero fixed
offset constraint that is currently enforced by the BPF verifier onto
BPF kfuncs specifically flagged to accept KF_TRUSTED_ARGS or KF_RCU
trusted pointer arguments is rather unnecessary, and can limit their
usability in practice. Specifically, it completely eliminates the
possibility of constructing a derived trusted pointer from an original
trusted pointer. To put it simply, a derived pointer is a pointer
which points to one of the nested member fields of the object being
pointed to by the original trusted pointer.

This patch relaxes the zero fixed offset constraint that is enforced
upon BPF kfuncs which specifically accept KF_TRUSTED_ARGS, or KF_RCU
arguments. Although, the zero fixed offset constraint technically also
applies to BPF kfuncs accepting KF_RELEASE arguments, relaxing this
constraint for such BPF kfuncs has subtle and unwanted
side-effects. This was discovered by experimenting a little further
with an initial version of this patch series [0]. The primary issue
with relaxing the zero fixed offset constraint on BPF kfuncs accepting
KF_RELEASE arguments is that it'd would open up the opportunity for
BPF programs to supply both trusted pointers and derived trusted
pointers to them. For KF_RELEASE BPF kfuncs specifically, this could
be problematic as resources associated with the backing pointer could
be released by the backing BPF kfunc and cause instabilities for the
rest of the kernel.

With this new fixed offset semantic in-place for BPF kfuncs accepting
KF_TRUSTED_ARGS and KF_RCU arguments, we now have more flexibility
when it comes to the BPF kfuncs that we're able to introduce moving
forward.

Early discussions covering the possibility of relaxing the zero fixed
offset constraint can be found using the link below. This will provide
more context on where all this has stemmed from [1].

Notably, pre-existing tests have been updated such that they provide
coverage for the updated zero fixed offset
functionality. Specifically, the nested offset test was converted from
a negative to positive test as it was already designed to assert zero
fixed offset semantics of a KF_TRUSTED_ARGS BPF kfunc.

[0] https://lore.kernel.org/bpf/ZnA9ndnXKtHOuYMe@google.com/
[1] https://lore.kernel.org/bpf/ZhkbrM55MKQ0KeIV@google.com/

Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240709210939.1544011-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-09 19:11:47 -07:00
Masami Hiramatsu (Google)
b10545b6b8 tracing/kprobes: Fix build error when find_module() is not available
The kernel test robot reported that the find_module() is not available
if CONFIG_MODULES=n.
Fix this error by hiding find_modules() in #ifdef CONFIG_MODULES with
related rcu locks as try_module_get_by_name().

Link: https://lore.kernel.org/all/172056819167.201571.250053007194508038.stgit@devnote2/

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202407070744.RcLkn8sq-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202407070917.VVUCBlaS-lkp@intel.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-07-10 09:47:00 +09:00
Paolo Abeni
7b769adc26 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZoxN0AAKCRDbK58LschI
 g0c5AQDa3ZV9gfbN42y1zSDoM1uOgO60fb+ydxyOYh8l3+OiQQD/fLfpTY3gBFSY
 9yi/pZhw/QdNzQskHNIBrHFGtJbMxgs=
 =p1Zz
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-07-08

The following pull-request contains BPF updates for your *net-next* tree.

We've added 102 non-merge commits during the last 28 day(s) which contain
a total of 127 files changed, 4606 insertions(+), 980 deletions(-).

The main changes are:

1) Support resilient split BTF which cuts down on duplication and makes BTF
   as compact as possible wrt BTF from modules, from Alan Maguire & Eduard Zingerman.

2) Add support for dumping kfunc prototypes from BTF which enables both detecting
   as well as dumping compilable prototypes for kfuncs, from Daniel Xu.

3) Batch of s390x BPF JIT improvements to add support for BPF arena and to implement
   support for BPF exceptions, from Ilya Leoshkevich.

4) Batch of riscv64 BPF JIT improvements in particular to add 12-argument support
   for BPF trampolines and to utilize bpf_prog_pack for the latter, from Pu Lehui.

5) Extend BPF test infrastructure to add a CHECKSUM_COMPLETE validation option
   for skbs and add coverage along with it, from Vadim Fedorenko.

6) Inline bpf_get_current_task/_btf() helpers in the arm64 BPF JIT which gives
   a small 1% performance improvement in micro-benchmarks, from Puranjay Mohan.

7) Extend the BPF verifier to track the delta between linked registers in order
   to better deal with recent LLVM code optimizations, from Alexei Starovoitov.

8) Fix bpf_wq_set_callback_impl() kfunc signature where the third argument should
   have been a pointer to the map value, from Benjamin Tissoires.

9) Extend BPF selftests to add regular expression support for test output matching
   and adjust some of the selftest when compiled under gcc, from Cupertino Miranda.

10) Simplify task_file_seq_get_next() and remove an unnecessary loop which always
    iterates exactly once anyway, from Dan Carpenter.

11) Add the capability to offload the netfilter flowtable in XDP layer through
    kfuncs, from Florian Westphal & Lorenzo Bianconi.

12) Various cleanups in networking helpers in BPF selftests to shave off a few
    lines of open-coded functions on client/server handling, from Geliang Tang.

13) Properly propagate prog->aux->tail_call_reachable out of BPF verifier, so
    that x86 JIT does not need to implement detection, from Leon Hwang.

14) Fix BPF verifier to add a missing check_func_arg_reg_off() to prevent an
    out-of-bounds memory access for dynpointers, from Matt Bobrowski.

15) Fix bpf_session_cookie() kfunc to return __u64 instead of long pointer as
    it might lead to problems on 32-bit archs, from Jiri Olsa.

16) Enhance traffic validation and dynamic batch size support in xsk selftests,
    from Tushar Vyavahare.

bpf-next-for-netdev

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (102 commits)
  selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep
  selftests/bpf: amend for wrong bpf_wq_set_callback_impl signature
  bpf: helpers: fix bpf_wq_set_callback_impl signature
  libbpf: Add NULL checks to bpf_object__{prev_map,next_map}
  selftests/bpf: Remove exceptions tests from DENYLIST.s390x
  s390/bpf: Implement exceptions
  s390/bpf: Change seen_reg to a mask
  bpf: Remove unnecessary loop in task_file_seq_get_next()
  riscv, bpf: Optimize stack usage of trampoline
  bpf, devmap: Add .map_alloc_check
  selftests/bpf: Remove arena tests from DENYLIST.s390x
  selftests/bpf: Add UAF tests for arena atomics
  selftests/bpf: Introduce __arena_global
  s390/bpf: Support arena atomics
  s390/bpf: Enable arena
  s390/bpf: Support address space cast instruction
  s390/bpf: Support BPF_PROBE_MEM32
  s390/bpf: Land on the next JITed instruction after exception
  s390/bpf: Introduce pre- and post- probe functions
  s390/bpf: Get rid of get_probe_mem_regno()
  ...
====================

Link: https://patch.msgid.link/20240708221438.10974-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-09 17:01:46 +02:00
Sebastian Andrzej Siewior
2b84def990 perf: Split __perf_pending_irq() out of perf_pending_irq()
perf_pending_irq() invokes perf_event_wakeup() and __perf_pending_irq().
The former is in charge of waking any tasks which waits to be woken up
while the latter disables perf-events.

The irq_work perf_pending_irq(), while this an irq_work, the callback
is invoked in thread context on PREEMPT_RT. This is needed because all
the waking functions (wake_up_all(), kill_fasync()) acquire sleep locks
which must not be used with disabled interrupts.
Disabling events, as done by __perf_pending_irq(), expects a hardirq
context and disabled interrupts. This requirement is not fulfilled on
PREEMPT_RT.

Split functionality based on perf_event::pending_disable into irq_work
named `pending_disable_irq' and invoke it in hardirq context on
PREEMPT_RT. Rename the split out callback to perf_pending_disable().

Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20240704170424.1466941-8-bigeasy@linutronix.de
2024-07-09 13:26:37 +02:00
Sebastian Andrzej Siewior
16b9569df9 perf: Don't disable preemption in perf_pending_task().
perf_pending_task() is invoked in task context and disables preemption
because perf_swevent_get_recursion_context() used to access per-CPU
variables. The other reason is to create a RCU read section while
accessing the perf_event.

The recursion counter is no longer a per-CPU accounter so disabling
preemption is no longer required. The RCU section is needed and must be
created explicit.

Replace the preemption-disable section with a explicit RCU-read section.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20240704170424.1466941-7-bigeasy@linutronix.de
2024-07-09 13:26:36 +02:00
Sebastian Andrzej Siewior
0d40a6d83e perf: Move swevent_htable::recursion into task_struct.
The swevent_htable::recursion counter is used to avoid creating an
swevent while an event is processed to avoid recursion. The counter is
per-CPU and preemption must be disabled to have a stable counter.
perf_pending_task() disables preemption to access the counter and then
signal. This is problematic on PREEMPT_RT because sending a signal uses
a spinlock_t which must not be acquired in atomic on PREEMPT_RT because
it becomes a sleeping lock.

The atomic context can be avoided by moving the counter into the
task_struct. There is a 4 byte hole between futex_state (usually always
on) and the following perf pointer (perf_event_ctxp). After the
recursion lost some weight it fits perfectly.

Move swevent_htable::recursion into task_struct.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20240704170424.1466941-6-bigeasy@linutronix.de
2024-07-09 13:26:36 +02:00
Sebastian Andrzej Siewior
5af42f928f perf: Shrink the size of the recursion counter.
There are four recursion counter, one for each context. The type of the
counter is `int' but the counter is used as `bool' since it is only
incremented if zero.
The main goal here is to shrink the whole struct into 32bit int which
can later be added task_struct into an existing hole.

Reduce the type of the recursion counter to an unsigned char, keep the
increment/ decrement operation.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20240704170424.1466941-5-bigeasy@linutronix.de
2024-07-09 13:26:35 +02:00
Sebastian Andrzej Siewior
c5d93d23a2 perf: Enqueue SIGTRAP always via task_work.
A signal is delivered by raising irq_work() which works from any context
including NMI. irq_work() can be delayed if the architecture does not
provide an interrupt vector. In order not to lose a signal, the signal
is injected via task_work during event_sched_out().

Instead going via irq_work, the signal could be added directly via
task_work. The signal is sent to current and can be enqueued on its
return path to userland.

Queue signal via task_work and consider possible NMI context. Remove
perf_event::pending_sigtrap and and use perf_event::pending_work
instead.

Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20240704170424.1466941-4-bigeasy@linutronix.de
2024-07-09 13:26:35 +02:00
Sebastian Andrzej Siewior
466e4d801c task_work: Add TWA_NMI_CURRENT as an additional notify mode.
Adding task_work from NMI context requires the following:
- The kasan_record_aux_stack() is not NMU safe and must be avoided.
- Using TWA_RESUME is NMI safe. If the NMI occurs while the CPU is in
  userland then it will continue in userland and not invoke the `work'
  callback.

Add TWA_NMI_CURRENT as an additional notify mode. In this mode skip
kasan and use irq_work in hardirq-mode to for needed interrupt. Set
TIF_NOTIFY_RESUME within the irq_work callback due to k[ac]san
instrumentation in test_and_set_bit() which does not look NMI safe in
case of a report.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240704170424.1466941-3-bigeasy@linutronix.de
2024-07-09 13:26:34 +02:00
Sebastian Andrzej Siewior
058244c683 perf: Move irq_work_queue() where the event is prepared.
Only if perf_event::pending_sigtrap is zero, the irq_work accounted by
increminging perf_event::nr_pending. The member perf_event::pending_addr
might be overwritten by a subsequent event if the signal was not yet
delivered and is expected. The irq_work will not be enqeueued again
because it has a check to be only enqueued once.

Move irq_work_queue() to where the counter is incremented and
perf_event::pending_sigtrap is set to make it more obvious that the
irq_work is scheduled once.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20240704170424.1466941-2-bigeasy@linutronix.de
2024-07-09 13:26:34 +02:00
Frederic Weisbecker
3a5465418f perf: Fix event leak upon exec and file release
The perf pending task work is never waited upon the matching event
release. In the case of a child event, released via free_event()
directly, this can potentially result in a leaked event, such as in the
following scenario that doesn't even require a weak IRQ work
implementation to trigger:

schedule()
   prepare_task_switch()
=======> <NMI>
      perf_event_overflow()
         event->pending_sigtrap = ...
         irq_work_queue(&event->pending_irq)
<======= </NMI>
      perf_event_task_sched_out()
          event_sched_out()
              event->pending_sigtrap = 0;
              atomic_long_inc_not_zero(&event->refcount)
              task_work_add(&event->pending_task)
   finish_lock_switch()
=======> <IRQ>
   perf_pending_irq()
      //do nothing, rely on pending task work
<======= </IRQ>

begin_new_exec()
   perf_event_exit_task()
      perf_event_exit_event()
         // If is child event
         free_event()
            WARN(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1)
            // event is leaked

Similar scenarios can also happen with perf_event_remove_on_exec() or
simply against concurrent perf_event_release().

Fix this with synchonizing against the possibly remaining pending task
work while freeing the event, just like is done with remaining pending
IRQ work. This means that the pending task callback neither need nor
should hold a reference to the event, preventing it from ever beeing
freed.

Fixes: 517e6a301f ("perf: Fix perf_pending_task() UaF")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240621091601.18227-5-frederic@kernel.org
2024-07-09 13:26:33 +02:00
Frederic Weisbecker
2fd5ad3f31 perf: Fix event leak upon exit
When a task is scheduled out, pending sigtrap deliveries are deferred
to the target task upon resume to userspace via task_work.

However failures while adding an event's callback to the task_work
engine are ignored. And since the last call for events exit happen
after task work is eventually closed, there is a small window during
which pending sigtrap can be queued though ignored, leaking the event
refcount addition such as in the following scenario:

    TASK A
    -----

    do_exit()
       exit_task_work(tsk);

       <IRQ>
       perf_event_overflow()
          event->pending_sigtrap = pending_id;
          irq_work_queue(&event->pending_irq);
       </IRQ>
    =========> PREEMPTION: TASK A -> TASK B
       event_sched_out()
          event->pending_sigtrap = 0;
          atomic_long_inc_not_zero(&event->refcount)
          // FAILS: task work has exited
          task_work_add(&event->pending_task)
       [...]
       <IRQ WORK>
       perf_pending_irq()
          // early return: event->oncpu = -1
       </IRQ WORK>
       [...]
    =========> TASK B -> TASK A
       perf_event_exit_task(tsk)
          perf_event_exit_event()
             free_event()
                WARN(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1)
                // leak event due to unexpected refcount == 2

As a result the event is never released while the task exits.

Fix this with appropriate task_work_add()'s error handling.

Fixes: 517e6a301f ("perf: Fix perf_pending_task() UaF")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240621091601.18227-4-frederic@kernel.org
2024-07-09 13:26:33 +02:00
Frederic Weisbecker
f409530e4d task_work: Introduce task_work_cancel() again
Re-introduce task_work_cancel(), this time to cancel an actual callback
and not *any* callback pointing to a given function. This is going to be
needed for perf events event freeing.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240621091601.18227-3-frederic@kernel.org
2024-07-09 13:26:32 +02:00
Frederic Weisbecker
68cbd415dd task_work: s/task_work_cancel()/task_work_cancel_func()/
A proper task_work_cancel() API that actually cancels a callback and not
*any* callback pointing to a given function is going to be needed for
perf events event freeing. Do the appropriate rename to prepare for
that.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240621091601.18227-2-frederic@kernel.org
2024-07-09 13:26:31 +02:00
John Stultz
e81859fe64 locking/rwsem: Add __always_inline annotation to __down_write_common() and inlined callers
Apparently despite it being marked inline, the compiler
may not inline __down_write_common() which makes it difficult
to identify the cause of lock contention, as the wchan of the
blocked function will always be listed as __down_write_common().

So add __always_inline annotation to the common function (as
well as the inlined helper callers) to force it to be inlined
so a more useful blocking function will be listed (via wchan).

This mirrors commit 92cc5d00a4 ("locking/rwsem: Add
__always_inline annotation to __down_read_common() and inlined
callers") which did the same for __down_read_common.

I sort of worry that I'm playing wack-a-mole here, and talking
with compiler people, they tell me inline means nothing, which
makes me want to cry a little. So I'm wondering if we need to
replace all the inlines with __always_inline, or remove them
because either we mean something by it, or not.

Fixes: c995e638cc ("locking/rwsem: Fold __down_{read,write}*()")
Reported-by: Tim Murray <timmurray@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lkml.kernel.org/r/20240709060831.495366-1-jstultz@google.com
2024-07-09 13:26:26 +02:00
Yicong Yang
54624acf88 dma-mapping: benchmark: Don't starve others when doing the test
The test thread will start N benchmark kthreads and then schedule out
until the test time finished and notify the benchmark kthreads to stop.
The benchmark kthreads will keep running until notified to stop.
There's a problem with current implementation when the benchmark
kthreads number is equal to the CPUs on a non-preemptible kernel:
since the scheduler will balance the kthreads across the CPUs and
when the test time's out the test thread won't get a chance to be
scheduled on any CPU then cannot notify the benchmark kthreads to stop.

This can be easily reproduced on a VM (simulated with 16 CPUs) with
PREEMPT_VOLUNTARY:
estuary:/mnt$ ./dma_map_benchmark -t 16 -s 1
 rcu: INFO: rcu_sched self-detected stall on CPU
 rcu:     10-...!: (5221 ticks this GP) idle=ed24/1/0x4000000000000000 softirq=142/142 fqs=0
 rcu:     (t=5254 jiffies g=-559 q=45 ncpus=16)
 rcu: rcu_sched kthread starved for 5255 jiffies! g-559 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=12
 rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
 rcu: RCU grace-period kthread stack dump:
 task:rcu_sched       state:R  running task     stack:0     pid:16    tgid:16    ppid:2      flags:0x00000008
 Call trace
  __switch_to+0xec/0x138
  __schedule+0x2f8/0x1080
  schedule+0x30/0x130
  schedule_timeout+0xa0/0x188
  rcu_gp_fqs_loop+0x128/0x528
  rcu_gp_kthread+0x1c8/0x208
  kthread+0xec/0xf8
  ret_from_fork+0x10/0x20
 Sending NMI from CPU 10 to CPUs 0:
 NMI backtrace for cpu 0
 CPU: 0 PID: 332 Comm: dma-map-benchma Not tainted 6.10.0-rc1-vanilla-LSE #8
 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
 pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : arm_smmu_cmdq_issue_cmdlist+0x218/0x730
 lr : arm_smmu_cmdq_issue_cmdlist+0x488/0x730
 sp : ffff80008748b630
 x29: ffff80008748b630 x28: 0000000000000000 x27: ffff80008748b780
 x26: 0000000000000000 x25: 000000000000bc70 x24: 000000000001bc70
 x23: ffff0000c12af080 x22: 0000000000010000 x21: 000000000000ffff
 x20: ffff80008748b700 x19: ffff0000c12af0c0 x18: 0000000000010000
 x17: 0000000000000001 x16: 0000000000000040 x15: ffffffffffffffff
 x14: 0001ffffffffffff x13: 000000000000ffff x12: 00000000000002f1
 x11: 000000000001ffff x10: 0000000000000031 x9 : ffff800080b6b0b8
 x8 : ffff0000c2a48000 x7 : 000000000001bc71 x6 : 0001800000000000
 x5 : 00000000000002f1 x4 : 01ffffffffffffff x3 : 000000000009aaf1
 x2 : 0000000000000018 x1 : 000000000000000f x0 : ffff0000c12af18c
 Call trace:
  arm_smmu_cmdq_issue_cmdlist+0x218/0x730
  __arm_smmu_tlb_inv_range+0xe0/0x1a8
  arm_smmu_iotlb_sync+0xc0/0x128
  __iommu_dma_unmap+0x248/0x320
  iommu_dma_unmap_page+0x5c/0xe8
  dma_unmap_page_attrs+0x38/0x1d0
  map_benchmark_thread+0x118/0x2c0
  kthread+0xec/0xf8
  ret_from_fork+0x10/0x20

Solve this by adding scheduling point in the kthread loop,
so if there're other threads in the system they may have
a chance to run, especially the thread to notify the test
end. However this may degrade the test concurrency so it's
recommended to run this on an idle system.

Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Acked-by: Barry Song <baohua@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-07-09 07:48:32 +02:00
Benjamin Tissoires
f56f4d541e bpf: helpers: fix bpf_wq_set_callback_impl signature
I realized this while having a map containing both a struct bpf_timer and
a struct bpf_wq: the third argument provided to the bpf_wq callback is
not the struct bpf_wq pointer itself, but the pointer to the value in
the map.

Which means that the users need to double cast the provided "value" as
this is not a struct bpf_wq *.

This is a change of API, but there doesn't seem to be much users of bpf_wq
right now, so we should be able to go with this right now.

Fixes: 81f1d7a583 ("bpf: wq: add bpf_wq_set_callback_impl")
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240708-fix-wq-v2-1-667e5c9fbd99@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-08 10:01:48 -07:00
Dan Carpenter
bc239eb271 bpf: Remove unnecessary loop in task_file_seq_get_next()
After commit 0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU") this
loop always iterates exactly one time.  Delete the for statement and pull
the code in a tab.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/ZoWJF51D4zWb6f5t@stanley.mountain
2024-07-08 16:23:19 +02:00
Masami Hiramatsu (Google)
9d8616034f tracing/kprobes: Add symbol counting check when module loads
Currently, kprobe event checks whether the target symbol name is unique
or not, so that it does not put a probe on an unexpected place. But this
skips the check if the target is on a module because the module may not
be loaded.

To fix this issue, this patch checks the number of probe target symbols
in a target module when the module is loaded. If the probe is not on the
unique name symbols in the module, it will be rejected at that point.

Note that the symbol which has a unique name in the target module,
it will be accepted even if there are same-name symbols in the
kernel or other modules,

Link: https://lore.kernel.org/all/172016348553.99543.2834679315611882137.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-07-06 09:27:47 +09:00
Lai Jiangshan
449b31ad29 workqueue: Init rescuer's affinities as the wq's effective cpumask
Make it consistent with apply_wqattrs_commit().

Link: https://lore.kernel.org/lkml/20240203154334.791910-5-longman@redhat.com/
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-05 09:14:40 -10:00
Lai Jiangshan
1726a17135 workqueue: Put PWQ allocation and WQ enlistment in the same lock C.S.
The PWQ allocation and WQ enlistment are not within the same lock-held
critical section; therefore, their states can become out of sync when
the user modifies the unbound mask or if CPU hotplug events occur in
the interim since those operations only update the WQs that are already
in the list.

Make the PWQ allocation and WQ enlistment atomic.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-05 09:14:40 -10:00
Lai Jiangshan
4e9a37389e workqueue: Move kthread_flush_worker() out of alloc_and_link_pwqs()
kthread_flush_worker() can't be called with wq_pool_mutex held.

Prepare for moving wq_pool_mutex and cpu hotplug lock out of
alloc_and_link_pwqs().

Cc: Zqiang <qiang.zhang1211@gmail.com>
Link: https://lore.kernel.org/lkml/20230920060704.24981-1-qiang.zhang1211@gmail.com/
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-05 09:14:40 -10:00
Lai Jiangshan
c5178e6ca6 workqueue: Make rescuer initialization as the last step of the creation of a new wq
For early wq allocation, rescuer initialization is the last step of the
creation of a new wq.  Make the behavior the same for all allocations.

Prepare for initializing rescuer's affinities with the default pwq's
affinities.

Prepare for moving the whole workqueue initializing procedure into
wq_pool_mutex and cpu hotplug locks.

Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-05 09:14:40 -10:00
Lai Jiangshan
c3138f3881 workqueue: Register sysfs after the whole creation of the new wq
workqueue creation includes adding it to the workqueue list.

Prepare for moving the whole workqueue initializing procedure into
wq_pool_mutex and cpu hotplug locks.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-05 09:14:40 -10:00
Chen Ridong
b824766504 cgroup/rstat: add force idle show helper
In the function cgroup_base_stat_cputime_show, there are five
instances of #ifdef, which makes the code not concise.
To address this, add the function cgroup_force_idle_show
to make the code more succinct.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-05 08:32:08 -10:00
Oleg Nesterov
8ac5dc6659 get_task_mm: check PF_KTHREAD lockless
Nowadays PF_KTHREAD is sticky and it was never protected by ->alloc_lock. 
Move the PF_KTHREAD check outside of task_lock() section to make this code
more understandable.

Link: https://lkml.kernel.org/r/20240626191017.GA20031@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04 18:05:58 -07:00
Oleg Nesterov
d73d003521 memcg: mm_update_next_owner: move for_each_thread() into try_to_set_owner()
mm_update_next_owner() checks the children / real_parent->children to
avoid the "everything else" loop in the likely case, but this won't work
if a child/sibling has a zombie leader with ->mm == NULL.

Move the for_each_thread() logic into try_to_set_owner(), if nothing else
this makes the children/siblings/everything searches more consistent.

Link: https://lkml.kernel.org/r/20240626152930.GA17936@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jinliang Zheng <alexjlzheng@tencent.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04 18:05:58 -07:00
Oleg Nesterov
2a22b773b1 memcg: mm_update_next_owner: kill the "retry" logic
Add the new helper, try_to_set_owner(), which tries to update mm->owner
once we see c->mm == mm.  This way mm_update_next_owner() doesn't need to
restart the list_for_each_entry/for_each_process loops from the very
beginning if it races with exit/exec, it can just continue.

Unlike the current code, try_to_set_owner() re-checks tsk->mm == mm before
it drops tasklist_lock, so it doesn't need get/put_task_struct().

Link: https://lkml.kernel.org/r/20240626152924.GA17933@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jinliang Zheng <alexjlzheng@tencent.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04 18:05:57 -07:00
Jakub Kicinski
76ed626479 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/phy/aquantia/aquantia.h
  219343755e ("net: phy: aquantia: add missing include guards")
  61578f6793 ("net: phy: aquantia: add support for PHY LEDs")

drivers/net/ethernet/wangxun/libwx/wx_hw.c
  bd07a98178 ("net: txgbe: remove separate irq request for MSI and INTx")
  b501d261a5 ("net: txgbe: add FDIR ATR support")
https://lore.kernel.org/all/20240703112936.483c1975@canb.auug.org.au/

include/linux/mlx5/mlx5_ifc.h
  048a403648 ("net/mlx5: IFC updates for changing max EQs")
  99be56171f ("net/mlx5e: SHAMPO, Re-enable HW-GRO")
https://lore.kernel.org/all/20240701133951.6926b2e3@canb.auug.org.au/

Adjacent changes:

drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c
  4130c67cd1 ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference")
  3f3126515f ("wifi: iwlwifi: mvm: add mvm-specific guard")

include/net/mac80211.h
  816c6bec09 ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP")
  5a009b42e0 ("wifi: mac80211: track changes in AP's TPE")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04 14:16:11 -07:00
Paul E. McKenney
02219caa92 Merge branches 'doc.2024.06.06a', 'fixes.2024.07.04a', 'mb.2024.06.28a', 'nocb.2024.06.03a', 'rcu-tasks.2024.06.06a', 'rcutorture.2024.06.06a' and 'srcu.2024.06.18a' into HEAD
doc.2024.06.06a: Documentation updates.
fixes.2024.07.04a: Miscellaneous fixes.
mb.2024.06.28a: Grace-period memory-barrier redundancy removal.
nocb.2024.06.03a: No-CB CPU updates.
rcu-tasks.2024.06.06a: RCU-Tasks updates.
rcutorture.2024.06.06a: Torture-test updates.
srcu.2024.06.18a: SRCU polled-grace-period updates.
2024-07-04 13:54:17 -07:00
Frederic Weisbecker
55d4669ef1 rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation
When rcu_barrier() calls rcu_rdp_cpu_online() and observes a CPU off
rnp->qsmaskinitnext, it means that all accesses from the offline CPU
preceding the CPUHP_TEARDOWN_CPU are visible to RCU barrier, including
callbacks expiration and counter updates.

However interrupts can still fire after stop_machine() re-enables
interrupts and before rcutree_report_cpu_dead(). The related accesses
happening between CPUHP_TEARDOWN_CPU and rnp->qsmaskinitnext clearing
are _NOT_ guaranteed to be seen by rcu_barrier() without proper
ordering, especially when callbacks are invoked there to the end, making
rcutree_migrate_callback() bypass barrier_lock.

The following theoretical race example can make rcu_barrier() hang:

CPU 0                                               CPU 1
-----                                               -----
//cpu_down()
smpboot_park_threads()
//ksoftirqd is parked now
<IRQ>
rcu_sched_clock_irq()
   invoke_rcu_core()
do_softirq()
   rcu_core()
      rcu_do_batch()
         // callback storm
         // rcu_do_batch() returns
         // before completing all
         // of them
   // do_softirq also returns early because of
   // timeout. It defers to ksoftirqd but
   // it's parked
</IRQ>
stop_machine()
   take_cpu_down()
                                                    rcu_barrier()
                                                        spin_lock(barrier_lock)
                                                        // observes rcu_segcblist_n_cbs(&rdp->cblist) != 0
<IRQ>
do_softirq()
   rcu_core()
      rcu_do_batch()
         //completes all pending callbacks
         //smp_mb() implied _after_ callback number dec
</IRQ>

rcutree_report_cpu_dead()
   rnp->qsmaskinitnext &= ~rdp->grpmask;

rcutree_migrate_callback()
   // no callback, early return without locking
   // barrier_lock
                                                        //observes !rcu_rdp_cpu_online(rdp)
                                                        rcu_barrier_entrain()
                                                           rcu_segcblist_entrain()
                                                              // Observe rcu_segcblist_n_cbs(rsclp) == 0
                                                              // because no barrier between reading
                                                              // rnp->qsmaskinitnext and rsclp->len
                                                              rcu_segcblist_add_len()
                                                                 smp_mb__before_atomic()
                                                                 // will now observe the 0 count and empty
                                                                 // list, but too late, we enqueue regardless
                                                                 WRITE_ONCE(rsclp->len, rsclp->len + v);
                                                        // ignored barrier callback
                                                        // rcu barrier stall...

This could be solved with a read memory barrier, enforcing the message
passing between rnp->qsmaskinitnext and rsclp->len, matching the full
memory barrier after rsclp->len addition in rcu_segcblist_add_len()
performed at the end of rcu_do_batch().

However the rcu_barrier() is complicated enough and probably doesn't
need too many more subtleties. CPU down is a slowpath and the
barrier_lock seldom contended. Solve the issue with unconditionally
locking the barrier_lock on rcutree_migrate_callbacks(). This makes sure
that either rcu_barrier() sees the empty queue or its entrained
callback will be migrated.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-07-04 13:48:57 -07:00
Oleg Nesterov
6f4cec22c3 rcu: Eliminate lockless accesses to rcu_sync->gp_count
The rcu_sync structure's ->gp_count field is always accessed under the
protection of that same structure's ->rss_lock field, with the exception
of a pair of WARN_ON_ONCE() calls just prior to acquiring that lock in
functions rcu_sync_exit() and rcu_sync_dtor().  These lockless accesses
are unnecessary and impair KCSAN's ability to catch bugs that might be
inserted via other lockless accesses.

This commit therefore moves those WARN_ON_ONCE() calls under the lock.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-07-04 13:48:57 -07:00
Paul E. McKenney
68d124b099 rcu: Add rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter
If a CPU is running either a userspace application or a guest OS in
nohz_full mode, it is possible for a system call to occur just as an
RCU grace period is starting.  If that CPU also has the scheduling-clock
tick enabled for any reason (such as a second runnable task), and if the
system was booted with rcutree.use_softirq=0, then RCU can add insult to
injury by awakening that CPU's rcuc kthread, resulting in yet another
task and yet more OS jitter due to switching to that task, running it,
and switching back.

In addition, in the common case where that system call is not of
excessively long duration, awakening the rcuc task is pointless.
This pointlessness is due to the fact that the CPU will enter an extended
quiescent state upon returning to the userspace application or guest OS.
In this case, the rcuc kthread cannot do anything that the main RCU
grace-period kthread cannot do on its behalf, at least if it is given
a few additional milliseconds (for example, given the time duration
specified by rcutree.jiffies_till_first_fqs, give or take scheduling
delays).

This commit therefore adds a rcutree.nohz_full_patience_delay kernel
boot parameter that specifies the grace period age (in milliseconds,
rounded to jiffies) before which RCU will refrain from awakening the
rcuc kthread.  Preliminary experimentation suggests a value of 1000,
that is, one second.  Increasing rcutree.nohz_full_patience_delay will
increase grace-period latency and in turn increase memory footprint,
so systems with constrained memory might choose a smaller value.
Systems with less-aggressive OS-jitter requirements might choose the
default value of zero, which keeps the traditional immediate-wakeup
behavior, thus avoiding increases in grace-period latency.

[ paulmck: Apply Leonardo Bras feedback.  ]

Link: https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/

Reported-by: Leonardo Bras <leobras@redhat.com>
Suggested-by: Leonardo Bras <leobras@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Leonardo Bras <leobras@redhat.com>
2024-07-04 13:47:39 -07:00
Peter Zijlstra
0c8ea05e9b Merge branch 'tip/x86/cpu'
The Lunarlake patches rely on the new VFM stuff.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2024-07-04 16:00:24 +02:00
Adrian Hunter
0ca4da2412 perf: Make rb_alloc_aux() return an error immediately if nr_pages <= 0
rb_alloc_aux() should not be called with nr_pages <= 0. Make it more robust
and readable by returning an error immediately in that case.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240624201101.60186-8-adrian.hunter@intel.com
2024-07-04 16:00:23 +02:00
Adrian Hunter
43deb76b19 perf: Fix default aux_watermark calculation
The default aux_watermark is half the AUX area buffer size. In general,
on a 64-bit architecture, the AUX area buffer size could be a bigger than
fits in a 32-bit type, but the calculation does not allow for that
possibility.

However the aux_watermark value is recorded in a u32, so should not be
more than U32_MAX either.

Fix by doing the calculation in a correctly sized type, and limiting the
result to U32_MAX.

Fixes: d68e6799a5 ("perf: Cap allocation order at aux_watermark")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240624201101.60186-7-adrian.hunter@intel.com
2024-07-04 16:00:23 +02:00
Adrian Hunter
dbc48c8f41 perf: Prevent passing zero nr_pages to rb_alloc_aux()
nr_pages is unsigned long but gets passed to rb_alloc_aux() as an int,
and is stored as an int.

Only power-of-2 values are accepted, so if nr_pages is a 64_bit value, it
will be passed to rb_alloc_aux() as zero.

That is not ideal because:
 1. the value is incorrect
 2. rb_alloc_aux() is at risk of misbehaving, although it manages to
 return -ENOMEM in that case, it is a result of passing zero to get_order()
 even though the get_order() result is documented to be undefined in that
 case.

Fix by simply validating the maximum supported value in the first place.
Use -ENOMEM error code for consistency with the current error code that
is returned in that case.

Fixes: 45bfb2e504 ("perf: Add AUX area to ring buffer for raw data streams")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240624201101.60186-6-adrian.hunter@intel.com
2024-07-04 16:00:22 +02:00
Adrian Hunter
3df94a5b10 perf: Fix perf_aux_size() for greater-than 32-bit size
perf_buffer->aux_nr_pages uses a 32-bit type, so a cast is needed to
calculate a 64-bit size.

Fixes: 45bfb2e504 ("perf: Add AUX area to ring buffer for raw data streams")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240624201101.60186-5-adrian.hunter@intel.com
2024-07-04 16:00:22 +02:00
Tejun Heo
d329605287 sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks
When a task's weight is being changed, set_load_weight() is called with
@update_load set. As weight changes aren't trivial for the fair class,
set_load_weight() calls fair.c::reweight_task() for fair class tasks.

However, set_load_weight() first tests task_has_idle_policy() on entry and
skips calling reweight_task() for SCHED_IDLE tasks. This is buggy as
SCHED_IDLE tasks are just fair tasks with a very low weight and they would
incorrectly skip load, vlag and position updates.

Fix it by updating reweight_task() to take struct load_weight as idle weight
can't be expressed with prio and making set_load_weight() call
reweight_task() for SCHED_IDLE tasks too when @update_load is set.

Fixes: 9059393e4e ("sched/fair: Use reweight_entity() for set_user_nice()")
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org # v4.15+
Link: http://lkml.kernel.org/r/20240624102331.GI31592@noisy.programming.kicks-ass.net
2024-07-04 15:59:52 +02:00
Tvrtko Ursulin
0ec208ce98 sched/psi: Optimise psi_group_change a bit
The current code loops over the psi_states only to call a helper which
then resolves back to the action needed for each state using a switch
statement. That is effectively creating a double indirection of a kind
which, given how all the states need to be explicitly listed and handled
anyway, we can simply remove. Both the for loop and the switch statement
that is.

The benefit is both in the code size and CPU time spent in this function.
YMMV but on my Steam Deck, while in a game, the patch makes the CPU usage
go from ~2.4% down to ~1.2%. Text size at the same time went from 0x323 to
0x2c1.

Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20240625135000.38652-1-tursulin@igalia.com
2024-07-04 15:59:52 +02:00
Tony Lindgren
7640f1a44e printk: Add match_devname_and_update_preferred_console()
Let's add match_devname_and_update_preferred_console() for driver
subsystems to call during init when the console is ready, and it's
character device name is known. For now, we use it only for the serial
layer to allow console=DEVNAME:0.0 style hardware based addressing for
consoles.

The earlier attempt on doing this caused a regression with the kernel
command line console order as it added calling __add_preferred_console()
again later on during init. A better approach was suggested by Petr where
we add the deferred console to the console_cmdline[] and update it later
on when the console is ready.

Suggested-by: Petr Mladek <pmladek@suse.com>
Co-developed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240703100615.118762-2-tony.lindgren@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-04 15:41:44 +02:00
Bartosz Golaszewski
011f583781 genirq/irq_sim: add an extended irq_sim initializer
Currently users of the interrupt simulator don't have any way of being
notified about interrupts from the simulated domain being requested or
released. This causes a problem for one of the users - the GPIO
simulator - which is unable to lock the pins as interrupts.

Define a structure containing callbacks to be executed on various
irq_sim-related events (for now: irq request and release) and provide an
extended function for creating simulated interrupt domains that takes it
and a pointer to custom user data (to be passed to said callbacks) as
arguments.

Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20240624093934.17089-2-brgl@bgdev.pl
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2024-07-04 09:25:59 +02:00
Ilya Leoshkevich
c02525a339 ftrace: unpoison ftrace_regs in ftrace_ops_list_func()
Patch series "kmsan: Enable on s390", v7.


Architectures use assembly code to initialize ftrace_regs and call
ftrace_ops_list_func().  Therefore, from the KMSAN's point of view,
ftrace_regs is poisoned on ftrace_ops_list_func entry().  This causes
KMSAN warnings when running the ftrace testsuite.

Fix by trusting the architecture-specific assembly code and always
unpoisoning ftrace_regs in ftrace_ops_list_func.

The issue was not encountered on x86_64 so far only by accident:
assembly-allocated ftrace_regs was overlapping a stale partially
unpoisoned stack frame.  Poisoning stack frames before returns [1] makes
the issue appear on x86_64 as well.

[1] https://github.com/iii-i/llvm-project/commits/msan-poison-allocas-before-returning-2024-06-12/

Link: https://lkml.kernel.org/r/20240621113706.315500-1-iii@linux.ibm.com
Link: https://lkml.kernel.org/r/20240621113706.315500-2-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03 19:30:21 -07:00
Barry Song
15bde4abab mm: extend rmap flags arguments for folio_add_new_anon_rmap
Patch series "mm: clarify folio_add_new_anon_rmap() and
__folio_add_anon_rmap()", v2.

This patchset is preparatory work for mTHP swapin.

folio_add_new_anon_rmap() assumes that new anon rmaps are always
exclusive.  However, this assumption doesn’t hold true for cases like
do_swap_page(), where a new anon might be added to the swapcache and is
not necessarily exclusive.

The patchset extends the rmap flags to allow folio_add_new_anon_rmap() to
handle both exclusive and non-exclusive new anon folios.  The
do_swap_page() function is updated to use this extended API with rmap
flags.  Consequently, all new anon folios now consistently use
folio_add_new_anon_rmap().  The special case for !folio_test_anon() in
__folio_add_anon_rmap() can be safely removed.

In conclusion, new anon folios always use folio_add_new_anon_rmap(),
regardless of exclusivity.  Old anon folios continue to use
__folio_add_anon_rmap() via folio_add_anon_rmap_pmd() and
folio_add_anon_rmap_ptes().


This patch (of 3):

In the case of a swap-in, a new anonymous folio is not necessarily
exclusive.  This patch updates the rmap flags to allow a new anonymous
folio to be treated as either exclusive or non-exclusive.  To maintain the
existing behavior, we always use EXCLUSIVE as the default setting.

[akpm@linux-foundation.org: cleanup and constifications per David and akpm]
[v-songbaohua@oppo.com: fix missing doc for flags of folio_add_new_anon_rmap()]
  Link: https://lkml.kernel.org/r/20240619210641.62542-1-21cnbao@gmail.com
[v-songbaohua@oppo.com: enhance doc for extend rmap flags arguments for folio_add_new_anon_rmap]
  Link: https://lkml.kernel.org/r/20240622030256.43775-1-21cnbao@gmail.com
Link: https://lkml.kernel.org/r/20240617231137.80726-1-21cnbao@gmail.com
Link: https://lkml.kernel.org/r/20240617231137.80726-2-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Tested-by: Shuai Yuan <yuanshuai@oppo.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03 19:30:18 -07:00
Jinliang Zheng
76ba6acfcc mm: optimize the redundant loop of mm_update_owner_next()
When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or
/proc or ptrace or page migration (get_task_mm()), it is impossible to
find an appropriate task_struct in the loop whose mm_struct is the same as
the target mm_struct.

If the above race condition is combined with the stress-ng-zombie and
stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
write_lock_irq() for tasklist_lock.

Recognize this situation in advance and exit early.

Link: https://lkml.kernel.org/r/20240620122123.3877432-1-alexjlzheng@tencent.com
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tycho Andersen <tandersen@netflix.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03 19:30:15 -07:00
Barry Song
54f7a49c20 mm: remove the implementation of swap_free() and always use swap_free_nr()
To streamline maintenance efforts, we propose removing the implementation
of swap_free().  Instead, we can simply invoke swap_free_nr() with nr set
to 1.  swap_free_nr() is designed with a bitmap consisting of only one
long, resulting in overhead that can be ignored for cases where nr equals
1.

A prime candidate for leveraging swap_free_nr() lies within
kernel/power/swap.c.  Implementing this change facilitates the adoption of
batch processing for hibernation.

Link: https://lkml.kernel.org/r/20240529082824.150954-3-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <len.brown@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chuanhua Han <hanchuanhua@oppo.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03 19:30:01 -07:00
Anna-Maria Behnsen
59dbee7d4d tick/sched: Combine WARN_ON_ONCE and print_once
When the WARN_ON_ONCE() triggers, the printk() of the additional
information related to the warning will not happen in print level
"warn". When reading dmesg with a restriction to level "warn", the
information published by the printk_once() will not show up there.

Transform WARN_ON_ONCE() and printk_once() into a WARN_ONCE().

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240610103552.25252-1-anna-maria@linutronix.de
2024-07-03 21:32:55 +02:00
Jinliang Zheng
cf3f9a593d mm: optimize the redundant loop of mm_update_owner_next()
When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or
/proc or ptrace or page migration (get_task_mm()), it is impossible to
find an appropriate task_struct in the loop whose mm_struct is the same as
the target mm_struct.

If the above race condition is combined with the stress-ng-zombie and
stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
write_lock_irq() for tasklist_lock.

Recognize this situation in advance and exit early.

Link: https://lkml.kernel.org/r/20240620122123.3877432-1-alexjlzheng@tencent.com
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tycho Andersen <tandersen@netflix.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03 12:29:24 -07:00
Waiman Long
57b56d1680 cgroup: Protect css->cgroup write under css_set_lock
The writing of css->cgroup associated with the cgroup root in
rebind_subsystems() is currently protected only by cgroup_mutex.
However, the reading of css->cgroup in both proc_cpuset_show() and
proc_cgroup_show() is protected just by css_set_lock. That makes the
readers susceptible to racing problems like data tearing or caching.
It is also a problem that can be reported by KCSAN.

This can be fixed by using READ_ONCE() and WRITE_ONCE() to access
css->cgroup. Alternatively, the writing of css->cgroup can be moved
under css_set_lock as well which is done by this patch.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-03 08:59:06 -10:00
Xiu Jianfeng
1028f391d5 cgroup/misc: Introduce misc.peak
Introduce misc.peak to record the historical maximum usage of the
resource, as in some scenarios the value of misc.max could be
adjusted based on the peak usage of the resource.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-03 08:08:43 -10:00
Kees Cook
67f2df3b82 mm/slab: Plumb kmem_buckets into __do_kmalloc_node()
Introduce CONFIG_SLAB_BUCKETS which provides the infrastructure to
support separated kmalloc buckets (in the following kmem_buckets_create()
patches and future codetag-based separation). Since this will provide
a mitigation for a very common case of exploits, it is recommended to
enable this feature for general purpose distros. By default, the new
Kconfig will be enabled if CONFIG_SLAB_FREELIST_HARDENED is enabled (and
it is added to the hardening.config Kconfig fragment).

To be able to choose which buckets to allocate from, make the buckets
available to the internal kmalloc interfaces by adding them as the
second argument, rather than depending on the buckets being chosen from
the fixed set of global buckets. Where the bucket is not available,
pass NULL, which means "use the default system kmalloc bucket set"
(the prior existing behavior), as implemented in kmalloc_slab().

To avoid adding the extra argument when !CONFIG_SLAB_BUCKETS, only the
top-level macros and static inlines use the buckets argument (where
they are stripped out and compiled out respectively). The actual extern
functions can then be built without the argument, and the internals
fall back to the global kmalloc buckets unconditionally.

Co-developed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-03 12:24:19 +02:00
Lai Jiangshan
b3d209164d workqueue: Simplify goto statement
Use a simple if-statement to replace the cumbersome goto-statement in
workqueue_set_unbound_cpumask().

Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-02 07:17:22 -10:00
Lai Jiangshan
8416588323 workqueue: Update cpumasks after only applying it successfully
Make workqueue_unbound_exclude_cpumask() and workqueue_set_unbound_cpumask()
only update wq_isolated_cpumask and wq_requested_unbound_cpumask when
workqueue_apply_unbound_cpumask() returns successfully.

Fixes: fe28f631fa94("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask")
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-02 07:14:33 -10:00
Florian Lehner
fd8db07705 bpf, devmap: Add .map_alloc_check
Use the .map_allock_check callback to perform allocation checks before
allocating memory for the devmap.

Signed-off-by: Florian Lehner <dev@der-flo.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240615101158.57889-1-dev@der-flo.net
2024-07-02 19:05:25 +02:00
Ilya Leoshkevich
df34ec9db6 bpf: Fix atomic probe zero-extension
Zero-extending results of atomic probe operations fails with:

    verifier bug. zext_dst is set, but no reg is defined

The problem is that insn_def_regno() handles BPF_ATOMICs, but not
BPF_PROBE_ATOMICs. Fix by adding the missing condition.

Fixes: d503a04f8b ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240701234304.14336-2-iii@linux.ibm.com
2024-07-02 18:31:35 +02:00
Yafang Shao
9205269280 livepatch: Replace snprintf() with sysfs_emit()
Let's use sysfs_emit() instead of snprintf().

Suggested-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lore.kernel.org/r/20240625151123.2750-4-laoar.shao@gmail.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-07-02 16:56:18 +02:00
Yafang Shao
adb68ed26a livepatch: Add "replace" sysfs attribute
There are situations when it might make sense to combine livepatches
with and without the atomic replace on the same system. For example,
the livepatch without the atomic replace might provide a hotfix
or extra tuning.

Managing livepatches on such systems might be challenging. And the
information which of the installed livepatches do not use the atomic
replace would be useful.

Add new sysfs interface 'replace'. It works as follows:

   $ cat /sys/kernel/livepatch/livepatch-non_replace/replace
   0

   $ cat /sys/kernel/livepatch/livepatch-replace/replace
   1

[ commit log improved by Petr ]

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lore.kernel.org/r/20240625151123.2750-2-laoar.shao@gmail.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-07-02 16:56:18 +02:00
Sebastian Andrzej Siewior
e3d69f585d net: Move flush list retrieval to where it is used.
The bpf_net_ctx_get_.*_flush_list() are used at the top of the function.
This means the variable is always assigned even if unused. By moving the
function to where it is used, it is possible to delay the initialisation
until it is unavoidable.
Not sure how much this gains in reality but by looking at bq_enqueue()
(in devmap.c) gcc pushes one register less to the stack. \o/.

 Move flush list retrieval to where it is used.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-02 15:26:57 +02:00
Sebastian Andrzej Siewior
d839a73179 net: Optimize xdp_do_flush() with bpf_net_context infos.
Every NIC driver utilizing XDP should invoke xdp_do_flush() after
processing all packages. With the introduction of the bpf_net_context
logic the flush lists (for dev, CPU-map and xsk) are lazy initialized
only if used. However xdp_do_flush() tries to flush all three of them so
all three lists are always initialized and the likely empty lists are
"iterated".
Without the usage of XDP but with CONFIG_DEBUG_NET the lists are also
initialized due to xdp_do_check_flushed().

Jakub suggest to utilize the hints in bpf_net_context and avoid invoking
the flush function. This will also avoiding initializing the lists which
are otherwise unused.

Introduce bpf_net_ctx_get_all_used_flush_lists() to return the
individual list if not-empty. Use the logic in xdp_do_flush() and
xdp_do_check_flushed(). Remove the not needed .*_check_flush().

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-02 15:26:57 +02:00
Sebastian Andrzej Siewior
2896624be3 net: Remove task_struct::bpf_net_context init on fork.
There is no clone() invocation within a bpf_net_ctx_…() block. Therefore
the task_struct::bpf_net_context has always to be NULL and an explicit
initialisation is not required.

Remove the NULL assignment in the clone() path.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-02 15:26:57 +02:00
Jiapeng Chong
b576d375b5 fgraph: Use str_plural() in test_graph_storage_single()
Use existing str_plural() function rather than duplicating its
implementation.

./kernel/trace/trace_selftest.c:880:56-60: opportunity for str_plural(size).

Link: https://lore.kernel.org/linux-trace-kernel/20240618072014.20855-1-jiapeng.chong@linux.alibaba.com

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9349
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-07-01 19:57:51 -04:00
Luis Claudio R. Goncalves
c40583e19e rtla/osnoise: set the default threshold to 1us
Change the default threshold for osnoise to 1us, so that any noise
equal or above this value is recorded. Let the user set a higher
threshold if necessary.

Link: https://lore.kernel.org/linux-trace-kernel/Zmb-QhiiiI6jM9To@uudg.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Suggested-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Reviewed-by: Clark Williams <williams@redhat.com>
Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-07-01 18:54:31 -04:00
Pu Lehui
d1a426171d bpf: Use precise image size for struct_ops trampoline
For trampoline using bpf_prog_pack, we need to generate a rw_image
buffer with size of (image_end - image). For regular trampoline, we use
the precise image size generated by arch_bpf_trampoline_size to allocate
rw_image. But for struct_ops trampoline, we allocate rw_image directly
using close to PAGE_SIZE size. We do not need to allocate for that much,
as the patch size is usually much smaller than PAGE_SIZE. Let's use
precise image size for it too.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com> #riscv
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20240622030437.3973492-2-pulehui@huaweicloud.com
2024-07-01 17:10:46 +02:00
John Stultz
ddae0ca2a8 sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath
It was reported that in moving to 6.1, a larger then 10%
regression was seen in the performance of
clock_gettime(CLOCK_THREAD_CPUTIME_ID,...).

Using a simple reproducer, I found:
5.10:
100000000 calls in 24345994193 ns => 243.460 ns per call
100000000 calls in 24288172050 ns => 242.882 ns per call
100000000 calls in 24289135225 ns => 242.891 ns per call

6.1:
100000000 calls in 28248646742 ns => 282.486 ns per call
100000000 calls in 28227055067 ns => 282.271 ns per call
100000000 calls in 28177471287 ns => 281.775 ns per call

The cause of this was finally narrowed down to the addition of
psi_account_irqtime() in update_rq_clock_task(), in commit
52b1364ba0 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ
pressure").

In my initial attempt to resolve this, I leaned towards moving
all accounting work out of the clock_gettime() call path, but it
wasn't very pretty, so it will have to wait for a later deeper
rework. Instead, Peter shared this approach:

Rework psi_account_irqtime() to use its own psi_irq_time base
for accounting, and move it out of the hotpath, calling it
instead from sched_tick() and __schedule().

In testing this, we found the importance of ensuring
psi_account_irqtime() is run under the rq_lock, which Johannes
Weiner helpfully explained, so also add some lockdep annotations
to make that requirement clear.

With this change the performance is back in-line with 5.10:
6.1+fix:
100000000 calls in 24297324597 ns => 242.973 ns per call
100000000 calls in 24318869234 ns => 243.189 ns per call
100000000 calls in 24291564588 ns => 242.916 ns per call

Reported-by: Jimmy Shiu <jimmyshiu@google.com>
Originally-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Link: https://lore.kernel.org/r/20240618215909.4099720-1-jstultz@google.com
2024-07-01 13:01:44 +02:00
Wander Lairson Costa
b58652db66 sched/deadline: Fix task_struct reference leak
During the execution of the following stress test with linux-rt:

stress-ng --cyclic 30 --timeout 30 --minimize --quiet

kmemleak frequently reported a memory leak concerning the task_struct:

unreferenced object 0xffff8881305b8000 (size 16136):
  comm "stress-ng", pid 614, jiffies 4294883961 (age 286.412s)
  object hex dump (first 32 bytes):
    02 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00  .@..............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  debug hex dump (first 16 bytes):
    53 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00  S...............
  backtrace:
    [<00000000046b6790>] dup_task_struct+0x30/0x540
    [<00000000c5ca0f0b>] copy_process+0x3d9/0x50e0
    [<00000000ced59777>] kernel_clone+0xb0/0x770
    [<00000000a50befdc>] __do_sys_clone+0xb6/0xf0
    [<000000001dbf2008>] do_syscall_64+0x5d/0xf0
    [<00000000552900ff>] entry_SYSCALL_64_after_hwframe+0x6e/0x76

The issue occurs in start_dl_timer(), which increments the task_struct
reference count and sets a timer. The timer callback, dl_task_timer,
is supposed to decrement the reference count upon expiration. However,
if enqueue_task_dl() is called before the timer expires and cancels it,
the reference count is not decremented, leading to the leak.

This patch fixes the reference leak by ensuring the task_struct
reference count is properly decremented when the timer is canceled.

Fixes: feff2e65ef ("sched/deadline: Unthrottle PI boosted threads while enqueuing")
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20240620125618.11419-1-wander@redhat.com
2024-07-01 13:01:44 +02:00
Josh Don
2feab2492d Revert "sched/fair: Make sure to try to detach at least one movable task"
This reverts commit b0defa7ae0.

b0defa7ae0 changed the load balancing logic to ignore env.max_loop if
all tasks examined to that point were pinned. The goal of the patch was
to make it more likely to be able to detach a task buried in a long list
of pinned tasks. However, this has the unfortunate side effect of
creating an O(n) iteration in detach_tasks(), as we now must fully
iterate every task on a cpu if all or most are pinned. Since this load
balance code is done with rq lock held, and often in softirq context, it
is very easy to trigger hard lockups. We observed such hard lockups with
a user who affined O(10k) threads to a single cpu.

When I discussed this with Vincent he initially suggested that we keep
the limit on the number of tasks to detach, but increase the number of
tasks we can search. However, after some back and forth on the mailing
list, he recommended we instead revert the original patch, as it seems
likely no one was actually getting hit by the original issue.

Fixes: b0defa7ae0 ("sched/fair: Make sure to try to detach at least one movable task")
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240620214450.316280-1-joshdon@google.com
2024-07-01 13:01:43 +02:00
Linus Torvalds
3e334486ec TTY/Serial/Console fixes for 6.10-rc6
Here are a bunch of fixes/reverts for 6.10-rc6.  Include in here are:
   - revert the bunch of tty/serial/console changes that landed in -rc1
     that didn't quite work properly yet.  Everyone agreed to just revert
     them for now and will work on making them better for a future
     release instead of trying to quick fix the existing changes this
     late in the release cycle
   - 8250 driver port count bugfix
   - Other tiny serial port bugfixes for reported issues
 
 All of these have been in linux-next this week with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZoFmvg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymziACgvoDTxuDHHfPOd6h/1qrHqYpFK1YAn2IDMJGj
 Ng4/I/gwnkJeeHQC5JSn
 =g9o4
 -----END PGP SIGNATURE-----

Merge tag 'tty-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty / serial / console fixes from Greg KH:
 "Here are a bunch of fixes/reverts for 6.10-rc6.  Include in here are:

   - revert the bunch of tty/serial/console changes that landed in -rc1
     that didn't quite work properly yet.

     Everyone agreed to just revert them for now and will work on making
     them better for a future release instead of trying to quick fix the
     existing changes this late in the release cycle

   - 8250 driver port count bugfix

   - Other tiny serial port bugfixes for reported issues

  All of these have been in linux-next this week with no reported
  issues"

* tag 'tty-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  Revert "printk: Save console options for add_preferred_console_match()"
  Revert "printk: Don't try to parse DEVNAME:0.0 console options"
  Revert "printk: Flag register_console() if console is set on command line"
  Revert "serial: core: Add support for DEVNAME:0.0 style naming for kernel console"
  Revert "serial: core: Handle serial console options"
  Revert "serial: 8250: Add preferred console in serial8250_isa_init_ports()"
  Revert "Documentation: kernel-parameters: Add DEVNAME:0.0 format for serial ports"
  Revert "serial: 8250: Fix add preferred console for serial8250_isa_init_ports()"
  Revert "serial: core: Fix ifdef for serial base console functions"
  serial: bcm63xx-uart: fix tx after conversion to uart_port_tx_limited()
  serial: core: introduce uart_port_tx_limited_flags()
  Revert "serial: core: only stop transmit when HW fifo is empty"
  serial: imx: set receiver level before starting uart
  tty: mcf: MCF54418 has 10 UARTS
  serial: 8250_omap: Implementation of Errata i2310
  tty: serial: 8250: Fix port count mismatch with the device
2024-06-30 08:57:43 -07:00
Linus Torvalds
3ffea9a7a6 - Fix "nosmp" and "maxcpus=0" after the parallel CPU bringup work went in and broke them
- Make sure CPU hotplug dynamic prepare states are actually executed
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaBF4QACgkQEsHwGGHe
 VUq3JA//UOmjzHAdcRnGNnh6h1dMKyQW4KH98eiQMXaSuvDeBOCAd6Y4tq6YF/Om
 AIxHgLlhOY5O1OSVJZhtxf/lALkolCAEIAkIvGvn6EpPjog5UtNoIf6XAzwLzMn3
 O8WVASO2fkypaAYBY+tUEQoLY6CAfkxogV0lzNA8HGMr6Yf/YWueiK2GO63z9Bgt
 n0h0362xqACMdUbFnPGrX2wpMDA+WuhHwl8Z1Z1TB0rprYiA/tFCMLcVkT3Fezjh
 hx7sYMwBM8cunMya8p9ucd4kBUJROrfNo4SfHWfG0lsitW/cflTgRXOfLp4GFLvp
 z0OI9oeSHQyRATOU9yiXrWcbO8M3rFRw4/YcdRZ+5mlydJWDM00DZPqPcuxs4R3Q
 nH3gE82CvzWchLU5InHwYhi5oqwNUq1N8mz2bN4T9Yjtaj7zArSLqjqIafhxpJqV
 9DllV9gGroAUawlRSgo5dpl2XvPcbr9Sx8bIJqwn36esuBb2qZwL6pOtVJIBr88O
 QWamnvUH6NnIqweUUR9lRRjO5WjR3Xf2ECpEt5rNnqHXLn92usNaphEhBDo3tvrG
 +O3pjNER3sTEgF43yYpDX0gMZmHuXfmN+fT6QDcDGk764As+/UawIHStyI3nustI
 R7gM6SUx8Fv3883LuzZtQ7KNLuhPvLxf8YD2I626HpTtLA9tn5k=
 =qGvT
 -----END PGP SIGNATURE-----

Merge tag 'smp_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull smp fixes from Borislav Petkov:

 - Fix "nosmp" and "maxcpus=0" after the parallel CPU bringup work went
   in and broke them

 - Make sure CPU hotplug dynamic prepare states are actually executed

* tag 'smp_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu: Fix broken cmdline "nosmp" and "maxcpus=0"
  cpu/hotplug: Fix dynstate assignment in __cpuhp_setup_state_cpuslocked()
2024-06-30 08:41:42 -07:00
Linus Torvalds
03c8b0bd46 - Warn when an hrtimer doesn't get a callback supplied
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaBFDwACgkQEsHwGGHe
 VUqPnw//aRU3MsjXkoBKmK98O7M+6qoHL2rFRGcvw1o0GxzVB4gODgE2mZWeirG7
 JLRp/lVX4xhR85NSBpKlmsnnkC8UCDnpXLRpO24ZTdlc84xEyJGsN0gHqJPjpm9M
 GkBLRPOwDiSEBzL++6IyR/m3f88WDucQJXVyFa/LQIkSiFdzPBbLwX4otuIieD19
 6niyXlqQQ+iAyvkDIH7tNELrOHxivPpH3+QQEfAdtE7TWamv5dkQpu9Kbf811vQb
 DUsaD4E2+kQUY9ulevvz9OnsGpyhd3m30PUOHKdsrUfaE9bM/RTBDpnQ1dR3lPFD
 kEb4OXsrcM0z++eIUUTBMpRATVjxl17nSgkDg5S6GLTq/Om4KQP33Co7iXE5D4sI
 ephbA9jlnHAOtaNh/C1/95pIBidMBHw5HE63XcHJGei1x1pRtFx1apI9UezGGc9H
 IwRzpKR2UorojCcJedZFiXGt54nJL9UUg7d7sybiVurlKOxIxnaB7cfg1MgeG9ke
 yUGj6ElXvEAoEmnaf7ScAQnQ5VmkyJYTE8PUlR8h8dumQ3tyBEHanOUxqkQAlZ2P
 TzVqNeCymh8XGChKKs9pHHUeySkQKYMBOZhEGwSGte0kw8JLJuEsTFud+vcONkda
 4MqkH73ebPUdsH5pBNDX7eeDFLvrbpwPNh9u3wQtAnpMGLTKH4k=
 =et0y
 -----END PGP SIGNATURE-----

Merge tag 'timers_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Borislav Petkov:

 - Warn when an hrtimer doesn't get a callback supplied

* tag 'timers_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Prevent queuing of hrtimer without a function callback
2024-06-30 08:31:08 -07:00
Jeff Johnson
6073496a20 resource: add missing MODULE_DESCRIPTION()
Fix the 'make W=1' warning:
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/resource_kunit.o

Link: https://lkml.kernel.org/r/20240529-md-kernel-resource_kunit-v1-1-bb719784b714@quicinc.com
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-28 19:36:30 -07:00
James Morse
4e1a7df454 cpumask: Add enabled cpumask for present CPUs that can be brought online
The 'offline' file in sysfs shows all offline CPUs, including those
that aren't present. User-space is expected to remove not-present CPUs
from this list to learn which CPUs could be brought online.

CPUs can be present but not-enabled. These CPUs can't be brought online
until the firmware policy changes, which comes with an ACPI notification
that will register the CPUs.

With only the offline and present files, user-space is unable to
determine which CPUs it can try to bring online. Add a new CPU mask
that shows this based on all the registered CPUs.

Signed-off-by: James Morse <james.morse@arm.com>
Tested-by: Miguel Luis <miguel.luis@oracle.com>
Tested-by: Vishnu Pajjuri <vishnu@os.amperecomputing.com>
Tested-by: Jianyong Wu <jianyong.wu@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20240529133446.28446-20-Jonathan.Cameron@huawei.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-28 18:38:33 +01:00
Chen Ridong
1be59c97c8 cgroup/cpuset: Prevent UAF in proc_cpuset_show()
An UAF can happen when /proc/cpuset is read as reported in [1].

This can be reproduced by the following methods:
1.add an mdelay(1000) before acquiring the cgroup_lock In the
 cgroup_path_ns function.
2.$cat /proc/<pid>/cpuset   repeatly.
3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/
$umount /sys/fs/cgroup/cpuset/   repeatly.

The race that cause this bug can be shown as below:

(umount)		|	(cat /proc/<pid>/cpuset)
css_release		|	proc_cpuset_show
css_release_work_fn	|	css = task_get_css(tsk, cpuset_cgrp_id);
css_free_rwork_fn	|	cgroup_path_ns(css->cgroup, ...);
cgroup_destroy_root	|	mutex_lock(&cgroup_mutex);
rebind_subsystems	|
cgroup_free_root 	|
			|	// cgrp was freed, UAF
			|	cgroup_path_ns_locked(cgrp,..);

When the cpuset is initialized, the root node top_cpuset.css.cgrp
will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount operation will
allocate cgroup_root, and top_cpuset.css.cgrp will point to the allocated
&cgroup_root.cgrp. When the umount operation is executed,
top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp.

The problem is that when rebinding to cgrp_dfl_root, there are cases
where the cgroup_root allocated by setting up the root for cgroup v1
is cached. This could lead to a Use-After-Free (UAF) if it is
subsequently freed. The descendant cgroups of cgroup v1 can only be
freed after the css is released. However, the css of the root will never
be released, yet the cgroup_root should be freed when it is unmounted.
This means that obtaining a reference to the css of the root does
not guarantee that css.cgrp->root will not be freed.

Fix this problem by using rcu_read_lock in proc_cpuset_show().
As cgroup_root is kfree_rcu after commit d23b5c5777
("cgroup: Make operations on the cgroup root_list RCU safe"),
css->cgroup won't be freed during the critical section.
To call cgroup_path_ns_locked, css_set_lock is needed, so it is safe to
replace task_get_css with task_css.

[1] https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd

Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-28 07:10:31 -10:00
Andrei Vagin
bfafe5efa9 seccomp: release task filters when the task exits
Previously, seccomp filters were released in release_task(), which
required the process to exit and its zombie to be collected. However,
exited threads/processes can't trigger any seccomp events, making it
more logical to release filters upon task exits.

This adjustment simplifies scenarios where a parent is tracing its child
process. The parent process can now handle all events from a seccomp
listening descriptor and then call wait to collect a child zombie.

seccomp_filter_release takes the siglock to avoid races with
seccomp_sync_threads. There was an idea to bypass taking the lock by
checking PF_EXITING, but it can be set without holding siglock if
threads have SIGNAL_GROUP_EXIT. This means it can happen concurently
with seccomp_filter_release.

This change also fixes another minor problem. Suppose that a group
leader installs the new filter without SECCOMP_FILTER_FLAG_TSYNC, exits,
and becomes a zombie. Without this change, SECCOMP_FILTER_FLAG_TSYNC
from any other thread can never succeed, seccomp_can_sync_threads() will
check a zombie leader and is_ancestor() will fail.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Link: https://lore.kernel.org/r/20240628021014.231976-3-avagin@google.com
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Kees Cook <kees@kernel.org>
2024-06-28 09:37:11 -07:00
Andrei Vagin
95036a79e7 seccomp: interrupt SECCOMP_IOCTL_NOTIF_RECV when all users have exited
SECCOMP_IOCTL_NOTIF_RECV promptly returns when a seccomp filter becomes
unused, as a filter without users can't trigger any events.

Previously, event listeners had to rely on epoll to detect when all
processes had exited.

The change is based on the 'commit 99cdb8b9a5 ("seccomp: notify about
unused filter")' which implemented (E)POLLHUP notifications.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrei Vagin <avagin@google.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240628021014.231976-2-avagin@google.com
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Kees Cook <kees@kernel.org>
2024-06-28 09:37:11 -07:00
Frederic Weisbecker
677ab23bdf rcu/exp: Remove redundant full memory barrier at the end of GP
A full memory barrier is necessary at the end of the expedited grace
period to order:

1) The grace period completion (pictured by the GP sequence
   number) with all preceding accesses. This pairs with rcu_seq_end()
   performed by the concurrent kworker.

2) The grace period completion and subsequent post-GP update side
   accesses. Pairs again against rcu_seq_end().

This full barrier is already provided by the final sync_exp_work_done()
test, making the subsequent explicit one redundant. Remove it and
improve comments.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-06-28 06:44:12 -07:00
Frederic Weisbecker
55911a9f42 rcu: Remove full memory barrier on RCU stall printout
RCU stall printout fetches the EQS state of a CPU with a preceding full
memory barrier. However there is nothing to order this read against at
this debugging stage. It is inherently racy when performed remotely.

Do a plain read instead.

This was the last user of rcu_dynticks_snap().

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-06-28 06:44:12 -07:00
Frederic Weisbecker
e7a3c8ea6e rcu: Remove full memory barrier on boot time eqs sanity check
When the boot CPU initializes the per-CPU data on behalf of all possible
CPUs, a sanity check is performed on each of them to make sure none is
initialized in an extended quiescent state.

This check involves a full memory barrier which is useless at this early
boot stage.

Do a plain access instead.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-06-28 06:44:12 -07:00
Frederic Weisbecker
33c0860bf7 rcu/exp: Remove superfluous full memory barrier upon first EQS snapshot
When the grace period kthread checks the extended quiescent state
counter of a CPU, full ordering is necessary to ensure that either:

* If the GP kthread observes the remote target in an extended quiescent
  state, then that target must observe all accesses prior to the current
  grace period, including the current grace period sequence number, once
  it exits that extended quiescent state.

or:

* If the GP kthread observes the remote target NOT in an extended
  quiescent state, then the target further entering in an extended
  quiescent state must observe all accesses prior to the current
  grace period, including the current grace period sequence number, once
  it enters that extended quiescent state.

This ordering is enforced through a full memory barrier placed right
before taking the first EQS snapshot. However this is superfluous
because the snapshot is taken while holding the target's rnp lock which
provides the necessary ordering through its chain of
smp_mb__after_unlock_lock().

Remove the needless explicit barrier before the snapshot and put a
comment about the implicit barrier newly relied upon here.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-28 06:43:34 -07:00
Frederic Weisbecker
9a7e73c9be rcu: Remove superfluous full memory barrier upon first EQS snapshot
When the grace period kthread checks the extended quiescent state
counter of a CPU, full ordering is necessary to ensure that either:

* If the GP kthread observes the remote target in an extended quiescent
  state, then that target must observe all accesses prior to the current
  grace period, including the current grace period sequence number, once
  it exits that extended quiescent state.

or:

* If the GP kthread observes the remote target NOT in an extended
  quiescent state, then the target further entering in an extended
  quiescent state must observe all accesses prior to the current
  grace period, including the current grace period sequence number, once
  it enters that extended quiescent state.

This ordering is enforced through a full memory barrier placed right
before taking the first EQS snapshot. However this is superfluous
because the snapshot is taken while holding the target's rnp lock which
provides the necessary ordering through its chain of
smp_mb__after_unlock_lock().

Remove the needless explicit barrier before the snapshot and put a
comment about the implicit barrier newly relied upon here.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-28 06:43:34 -07:00
Frederic Weisbecker
0a5e9bd31e rcu: Remove full ordering on second EQS snapshot
When the grace period kthread checks the extended quiescent state
counter of a CPU, full ordering is necessary to ensure that either:

* If the GP kthread observes the remote target in an extended quiescent
  state, then that target must observe all accesses prior to the current
  grace period, including the current grace period sequence number, once
  it exits that extended quiescent state. Also the GP kthread must
  observe all accesses performed by the target prior it entering in
  EQS.

or:

* If the GP kthread observes the remote target NOT in an extended
  quiescent state, then the target further entering in an extended
  quiescent state must observe all accesses prior to the current
  grace period, including the current grace period sequence number, once
  it enters that extended quiescent state. Also the GP kthread later
  observing that EQS must also observe all accesses performed by the
  target prior it entering in EQS.

This ordering is explicitly performed both on the first EQS snapshot
and on the second one as well through the combination of a preceding
full barrier followed by an acquire read. However the second snapshot's
full memory barrier is redundant and not needed to enforce the above
guarantees:

    GP kthread                  Remote target
    ----                        -----
    // Access prior GP
    WRITE_ONCE(A, 1)
    // first snapshot
    smp_mb()
    x = smp_load_acquire(EQS)
                               // Access prior GP
                               WRITE_ONCE(B, 1)
                               // EQS enter
                               // implied full barrier by atomic_add_return()
                               atomic_add_return(RCU_DYNTICKS_IDX, EQS)
                               // implied full barrier by atomic_add_return()
                               READ_ONCE(A)
    // second snapshot
    y = smp_load_acquire(EQS)
    z = READ_ONCE(B)

If the GP kthread above fails to observe the remote target in EQS
(x not in EQS), the remote target will observe A == 1 after further
entering in EQS. Then the second snapshot taken by the GP kthread only
need to be an acquire read in order to observe z == 1.

Therefore remove the needless full memory barrier on second snapshot.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-28 06:43:33 -07:00
Jakub Kicinski
193b9b2002 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:
  e3f02f32a0 ("ionic: fix kernel panic due to multi-buffer handling")
  d9c0420999 ("ionic: Mark error paths in the data path as unlikely")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-27 12:14:11 -07:00
Linus Torvalds
adfbe3640b asm-generic fixes for 6.10
These are some bugfixes for system call ABI issues I found while
 working on a cleanup series. None of these are urgent since these
 bugs have gone unnoticed for many years, but I think we probably
 want to backport them all to stable kernels, so it makes sense
 to have the fixes included as early as possible.
 
 One more fix addresses a compile-time warning in kallsyms that was
 uncovered by a patch I did to enable additional warnings in 6.10. I had
 mistakenly thought that this fix was already merged through the module
 tree, but as Geert pointed out it was still missing.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmZ9iRQACgkQYKtH/8kJ
 UicHIxAA0ej8dMJ3znHovc/CQYkZMpb88bxLlqLotOYuOItEzvR6wd7vnu4cPeZf
 nHguBiP9RAnzCZhL3F7AS3p8NNJ+P1OZo+sj6tZOANO955mzj1VQ5p2fbSRw+WI3
 4Oc1HKvP6UMhHGjU3wHY0+Odd5bpoepN9/fnoiQcHPzq0LbUFM8e4D9KGr51I7fV
 r7tuDMy9xykEfs6umuDu9wOXih3JkpV9eSmefmjvzgxG3hKLdsvTbWVsVmnKXhZm
 xdFiTROOmiNvttfkQh0ruBd0drBl8aVhzCKPqIe0vQqS9rBmcf9WTkcJzpihq/fI
 BA3QjVQFvmHeXs+viaLZf4r/y0qabaTPRBMQxZyEFE0QgtwfxT4/ZnNEbH2s3pIC
 Pcm0JltLlHLbZs7V63drL6txCoFVndiPXdEBTBsqBwnuDHXCj/tvDcO3tuVTfYoz
 9G8TTOsYNEDLYmn8AmzzhJOh75gp6O6A2ui3TtcD9KFNaoTQqqzPJWp8IoxBfxcb
 3+rzRWQvXAhfSRBIaejv1quo2ZxoZk3KO3i+ysRITTUF1MLz7b0/Yy/8r74CqmOu
 8Iw2Q0BaFPtj1x+VjneQnL++iYWYPEh+ZBEg7AD/z6QHwMLz33SyHlD+/RgRkthV
 J/L9xUBs5HagWJxRYkVc+l0LOVclTqVJieKD2AWONZ5OFRB+CCI=
 =ieQy
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic-fixes-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm-generic fixes from Arnd Bergmann:
 "These are some bugfixes for system call ABI issues I found while
  working on a cleanup series. None of these are urgent since these bugs
  have gone unnoticed for many years, but I think we probably want to
  backport them all to stable kernels, so it makes sense to have the
  fixes included as early as possible.

  One more fix addresses a compile-time warning in kallsyms that was
  uncovered by a patch I did to enable additional warnings in 6.10. I
  had mistakenly thought that this fix was already merged through the
  module tree, but as Geert pointed out it was still missing"

* tag 'asm-generic-fixes-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  kallsyms: rework symbol lookup return codes
  linux/syscalls.h: add missing __user annotations
  syscalls: mmap(): use unsigned offset type consistently
  s390: remove native mmap2() syscall
  hexagon: fix fadvise64_64 calling conventions
  csky, hexagon: fix broken sys_sync_file_range
  sh: rework sync_file_range ABI
  powerpc: restore some missing spu syscalls
  parisc: use generic sys_fanotify_mark implementation
  parisc: use correct compat recv/recvfrom syscalls
  sparc: fix compat recv/recvfrom syscalls
  sparc: fix old compat_sys_select()
  syscalls: fix compat_sys_io_pgetevents_time64 usage
  ftruncate: pass a signed offset
2024-06-27 10:53:52 -07:00
Linus Torvalds
fd19d4a492 Including fixes from can, bpf and netfilter.
Current release - regressions:
 
   - core: add softirq safety to netdev_rename_lock
 
   - tcp: fix tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed TFO
 
   - batman-adv: fix RCU race at module unload time
 
 Current release - new code bugs:
 
 Previous releases - regressions:
 
   - openvswitch: get related ct labels from its master if it is not confirmed
 
   - eth: bonding: fix incorrect software timestamping report
 
   - eth: mlxsw: fix memory corruptions on spectrum-4 systems
 
   - eth: ionic: use dev_consume_skb_any outside of napi
 
 Previous releases - always broken:
 
   - netfilter: fully validate NFT_DATA_VALUE on store to data registers
 
   - unix: several fixes for OoB data
 
   - tcp: fix race for duplicate reqsk on identical SYN
 
   - bpf:
     - fix may_goto with negative offset.
     - fix the corner case with may_goto and jump to the 1st insn.
     - fix overrunning reservations in ringbuf
 
   - can:
     - j1939: recover socket queue on CAN bus error during BAM transmission
     - mcp251xfd: fix infinite loop when xmit fails
 
   - dsa: microchip: monitor potential faults in half-duplex mode
 
   - eth: vxlan: pull inner IP header in vxlan_xmit_one()
 
   - eth: ionic: fix kernel panic due to multi-buffer handling
 
 Misc:
 
   - selftest: unix tests refactor and a lot of new cases added
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmZ9ZlQSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkawoQAKLTWHswqM790uaAAgqP6jGuC4/waRS8
 MowEt5rHlwdMXcHhLrDSrLQoDJAZRsWmjniIgbsaeX+HtY4HXfF0tfDMPKiws3vx
 Z51qVj7zYjdT7IoZ7Yc8Zlwmt2kVgO4ba6gSigQSORQO9Qq/WNSb0q8BM6cDaYXT
 cXC7ikPeMlLnxKxsFRpZ3CUD06dI/aJFp/pefPEm7/X/EbROlSs5y+2GshPdp5t7
 tzOUsLHs6ORVq/6jg2nRHH+0D+LMuQG0Z0yCMmYerJMJNtRIxyW6tTYeAsWXeyn3
 UN3gaoQ/SIURDrNRZvHsaVDNO/u4rbYtFLoK7S5uPffPWqsGJY59FcH+xYFukFCD
 P5Lca4kKBr8xOahsRfSiO0uFbwQfQAauzNiz9Ue39n1hj+ZhZ/CliBLhUeoBl6Y6
 jSsxq+/8CZCQ7beek96cyLx83skAcWAU5BEC9xOVlOTuTL91Gxr9UzSx/FqLI34h
 Smgw9ZUPzJgvFLgB/OBQ/WYne9LfJ5RYQHZoAXObiozO3TX7NgBUfa0e1T9dLE3F
 TalysSO3/goiZNK5a/UNJcj3fAcSEs4M2z9UIK790i3P3GuRigs1sJEtTUqyowWk
 aaTFmWCXE0wdoshJjux3syh3Vk6phJWpOlMLYjy0v5s0BF/ZOfDaKQT/dGsvV1HE
 AFGpKpybizNV
 =BYgZ
 -----END PGP SIGNATURE-----

Merge tag 'net-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from can, bpf and netfilter.

  There are a bunch of regressions addressed here, but hopefully nothing
  spectacular. We are still waiting the driver fix from Intel, mentioned
  by Jakub in the previous networking pull.

  Current release - regressions:

   - core: add softirq safety to netdev_rename_lock

   - tcp: fix tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed
     TFO

   - batman-adv: fix RCU race at module unload time

  Previous releases - regressions:

   - openvswitch: get related ct labels from its master if it is not
     confirmed

   - eth: bonding: fix incorrect software timestamping report

   - eth: mlxsw: fix memory corruptions on spectrum-4 systems

   - eth: ionic: use dev_consume_skb_any outside of napi

  Previous releases - always broken:

   - netfilter: fully validate NFT_DATA_VALUE on store to data registers

   - unix: several fixes for OoB data

   - tcp: fix race for duplicate reqsk on identical SYN

   - bpf:
       - fix may_goto with negative offset
       - fix the corner case with may_goto and jump to the 1st insn
       - fix overrunning reservations in ringbuf

   - can:
       - j1939: recover socket queue on CAN bus error during BAM
         transmission
       - mcp251xfd: fix infinite loop when xmit fails

   - dsa: microchip: monitor potential faults in half-duplex mode

   - eth: vxlan: pull inner IP header in vxlan_xmit_one()

   - eth: ionic: fix kernel panic due to multi-buffer handling

  Misc:

   - selftest: unix tests refactor and a lot of new cases added"

* tag 'net-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (61 commits)
  net: mana: Fix possible double free in error handling path
  selftest: af_unix: Check SIOCATMARK after every send()/recv() in msg_oob.c.
  af_unix: Fix wrong ioctl(SIOCATMARK) when consumed OOB skb is at the head.
  selftest: af_unix: Check EPOLLPRI after every send()/recv() in msg_oob.c
  selftest: af_unix: Check SIGURG after every send() in msg_oob.c
  selftest: af_unix: Add SO_OOBINLINE test cases in msg_oob.c
  af_unix: Don't stop recv() at consumed ex-OOB skb.
  selftest: af_unix: Add non-TCP-compliant test cases in msg_oob.c.
  af_unix: Don't stop recv(MSG_DONTWAIT) if consumed OOB skb is at the head.
  af_unix: Stop recv(MSG_PEEK) at consumed OOB skb.
  selftest: af_unix: Add msg_oob.c.
  selftest: af_unix: Remove test_unix_oob.c.
  tracing/net_sched: NULL pointer dereference in perf_trace_qdisc_reset()
  netfilter: nf_tables: fully validate NFT_DATA_VALUE on store to data registers
  net: usb: qmi_wwan: add Telit FN912 compositions
  tcp: fix tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed TFO
  ionic: use dev_consume_skb_any outside of napi
  net: dsa: microchip: fix wrong register write when masking interrupt
  Fix race for duplicate reqsk on identical SYN
  ibmvnic: Add tx check to prevent skb leak
  ...
2024-06-27 10:05:35 -07:00
Arnd Bergmann
7e1f4eb9a6 kallsyms: rework symbol lookup return codes
Building with W=1 in some configurations produces a false positive
warning for kallsyms:

kernel/kallsyms.c: In function '__sprint_symbol.isra':
kernel/kallsyms.c:503:17: error: 'strcpy' source argument is the same as destination [-Werror=restrict]
  503 |                 strcpy(buffer, name);
      |                 ^~~~~~~~~~~~~~~~~~~~

This originally showed up while building with -O3, but later started
happening in other configurations as well, depending on inlining
decisions. The underlying issue is that the local 'name' variable is
always initialized to the be the same as 'buffer' in the called functions
that fill the buffer, which gcc notices while inlining, though it could
see that the address check always skips the copy.

The calling conventions here are rather unusual, as all of the internal
lookup functions (bpf_address_lookup, ftrace_mod_address_lookup,
ftrace_func_address_lookup, module_address_lookup and
kallsyms_lookup_buildid) already use the provided buffer and either return
the address of that buffer to indicate success, or NULL for failure,
but the callers are written to also expect an arbitrary other buffer
to be returned.

Rework the calling conventions to return the length of the filled buffer
instead of its address, which is simpler and easier to follow as well
as avoiding the warning. Leave only the kallsyms_lookup() calling conventions
unchanged, since that is called from 16 different functions and
adapting this would be a much bigger change.

Link: https://lore.kernel.org/lkml/20200107214042.855757-1-arnd@arndb.de/
Link: https://lore.kernel.org/lkml/20240326130647.7bfb1d92@gandalf.local.home/
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-06-27 17:43:40 +02:00