This reflects the functionality better. No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.103554618@linutronix.de
In preparation of the upcoming per device multi MSI domain support, change
the interface to support lookups based on domain id and zero based index
within the domain.
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230314.044613697@linutronix.de
To support multiple MSI interrupt domains per device it is necessary to
segment the xarray MSI descriptor storage. Each domain gets up to
MSI_MAX_INDEX entries.
Change the iterators so they operate with domain ids and take the domain
offsets into account.
The publicly available iterators which are mostly used in legacy
implementations and the PCI/MSI core default to MSI_DEFAULT_DOMAIN (0)
which is the id for the existing "global" domains.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.985498981@linutronix.de
With the upcoming per device MSI interrupt domain support it is necessary
to store the domain pointers per device.
Instead of delegating that storage to device drivers or subsystems add a
domain pointer to the msi_dev_domain array in struct msi_device_data.
This pointer is also used to take care of tearing down the irq domains when
msi_device_data is cleaned up via devres.
The interfaces into the MSI core will be changed from irqdomain pointer
based interfaces to domain id based interfaces to support multiple MSI
domains on a single device (e.g. PCI/MSI[-X] and PCI/IMS.
Once the per device domain support is complete the irq domain pointer in
struct device::msi.domain will not longer contain a pointer to the "global"
MSI domain. It will contain a pointer to the MSI parent domain instead.
It would be a horrible maze of conditionals to evaluate all over the place
which domain pointer should be used, i.e. the "global" one in
device::msi::domain or one from the internal pointer array.
To avoid this evaluate in msi_setup_device_data() whether the irq domain
which is associated to a device is a "global" or a parent MSI domain. If it
is global then copy the pointer into the first entry of the msi_dev_domain
array.
This allows to convert interfaces and implementation to domain ids while
keeping everything existing working.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.923860399@linutronix.de
The upcoming support for multiple MSI domains per device requires storage
for the MSI descriptors and in a second step storage for the irqdomain
pointers.
Move the xarray into a separate data structure msi_dev_domain and create an
array with size 1 in msi_device_data, which can be expanded later when the
support for per device domains is implemented.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.864887773@linutronix.de
In the upcoming per device MSI domain concept the MSI parent domains are
not allowed to be used as regular MSI domains where the MSI allocation/free
operations are applicable.
Add appropriate checks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.806128070@linutronix.de
irq_domain::dev is a misnomer as it's usually the rule that a device
pointer points to something which is directly related to the instance.
irq_domain::dev can point to some other device for power management to
ensure that this underlying device is not powered down when an interrupt is
allocated.
The upcoming per device MSI domains really require a pointer to the device
which instantiated the irq domain and not to some random other device which
is required for power management down the chain.
Rename irq_domain::dev to irq_domain::pm_dev and fixup the few sites which
use that pointer.
Conversion was done with the help of coccinelle.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.574541683@linutronix.de
It's truly a MSI only flag and for the upcoming per device MSI domains this
must be in the MSI flags so it can be set during domain setup without
exposing this quirk outside of x86.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.454246167@linutronix.de
Fault injection tests trigger warnings like this:
kernfs: can not remove 'chip_name', no directory
WARNING: CPU: 0 PID: 253 at fs/kernfs/dir.c:1616 kernfs_remove_by_name_ns+0xce/0xe0
RIP: 0010:kernfs_remove_by_name_ns+0xce/0xe0
Call Trace:
<TASK>
remove_files.isra.1+0x3f/0xb0
sysfs_remove_group+0x68/0xe0
sysfs_remove_groups+0x41/0x70
__kobject_del+0x45/0xc0
kobject_del+0x29/0x40
free_desc+0x42/0x70
irq_free_descs+0x5e/0x90
The reason is that the interrupt descriptor sysfs handling does not roll
back on a failing kobject_add() during allocation. If the descriptor is
freed later on, kobject_del() is invoked with a not added kobject resulting
in the above warnings.
A proper rollback in case of a kobject_add() failure would be the straight
forward solution. But this is not possible due to the way how interrupt
descriptor sysfs handling works.
Interrupt descriptors are allocated before sysfs becomes available. So the
sysfs files for the early allocated descriptors are added later in the boot
process. At this point there can be nothing useful done about a failing
kobject_add(). For consistency the interrupt descriptor allocation always
treats kobject_add() failures as non-critical and just emits a warning.
To solve this problem, keep track in the interrupt descriptor whether
kobject_add() was successful or not and make the invocation of
kobject_del() conditional on that.
[ tglx: Massage changelog, comments and use a state bit. ]
Fixes: ecb3f394c5 ("genirq: Expose interrupt information through sysfs")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20221128151612.1786122-1-yangyingliang@huawei.com
Adjust to reality and remove another layer of pointless Kconfig
indirection. CONFIG_GENERIC_MSI_IRQ is good enough to serve
all purposes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20221111122014.524842979@linutronix.de
Add a bus token member to struct msi_domain_info and let
msi_create_irq_domain() set the bus token.
That allows to remove the bus token updates at the call sites.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20221111122014.294554462@linutronix.de
Now that the last user is gone, confine it to the core code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20221111122014.179595843@linutronix.de
To prepare for removing the exposure of __msi_domain_free_irqs() provide a
post_free() callback in the MSI domain ops which can be used to solve
the problem of the only user of __msi_domain_free_irqs() in arch/powerpc.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20221111122014.063153448@linutronix.de
When a range of descriptors is freed then all of them are not associated to
a linux interrupt. Remove the filter and add a warning to the free function.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20221111122013.888850936@linutronix.de
There are no associated MSI descriptors in the requested range when the MSI
descriptor allocation fails. Use MSI_DESC_ALL as the filter which prepares
the next step to get rid of the filter for freeing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20221111122013.831151822@linutronix.de
Current release - new code bugs:
- can: af_can: can_exit(): add missing dev_remove_pack() of canxl_packet
Previous releases - regressions:
- bpf, sockmap: fix the sk->sk_forward_alloc warning
- wifi: mac80211: fix general-protection-fault in
ieee80211_subif_start_xmit()
- can: af_can: fix NULL pointer dereference in can_rx_register()
- can: dev: fix skb drop check, avoid o-o-b access
- nfnetlink: fix potential dead lock in nfnetlink_rcv_msg()
Previous releases - always broken:
- bpf: fix wrong reg type conversion in release_reference()
- gso: fix panic on frag_list with mixed head alloc types
- wifi: brcmfmac: fix buffer overflow in brcmf_fweh_event_worker()
- wifi: mac80211: set TWT Information Frame Disabled bit as 1
- eth: macsec offload related fixes, make sure to clear the keys
from memory
- tun: fix memory leaks in the use of napi_get_frags
- tun: call napi_schedule_prep() to ensure we own a napi
- tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent
- ipv6: addrlabel: fix infoleak when sending struct ifaddrlblmsg
to network
- tipc: fix a msg->req tlv length check
- sctp: clear out_curr if all frag chunks of current msg are pruned,
avoid list corruption
- mctp: fix an error handling path in mctp_init(), avoid leaks
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmNtnlEACgkQMUZtbf5S
IrvSfg//axNePPwFiAdbYUmSNmnnv2Zpyz1l9a2/WvKKMeyAH3d4zuQGyTz7VgoJ
at4k1fr14vm+3qBhlL0UFdd+h/wBewwuuWLiogIfhgqDO7KavZsbTJWQ59DSHH08
ujihvt7dF9ByVd3hOpUDjrYGd2rPghqXk8l/2gpPp/KIrbj1jSW0DdF7Y48/0RRw
PYzNYZ9tqICw1crBT52ZilNEebGaUuWpPLzV2owlhJpzqyRLcgd9GWN9DkKieiiw
wF0Wi7A8b/+cR/Wo93RAXtvEayN9vp/t6iyiI1opv3Yg6bhAMlzDUX/v79ccnAM6
wJ3b8bKyLgph5ZTNmbL8GwC2pwl/20hOgCVLb/Haykqrk4oO2+xD39fjKniFP/71
IBYuLCethi0zmiSyR8yO4iyrfJCnkJffoxtcG8O5x+FuCfMI1xQWx44bSc34KlqT
vDw/VmnIfXH9K3F+QdWtlZfLiM0F6vd7RNGIxX0cC2wQCwaubCo0LOs5vl2+jpR8
Xclo+OquQtX5XRqGGQDtA7kCM9jfuc/DWla1v10wy7ZagiKkdfrV7Zu7r431Dtwn
BWeKZAA38o9WNRb4FD5GGUN0dK5R5V25LmbpvYuerq5Ub3pGJgHMsdA15LqsqTnW
MGIokGFhu7ToAQEnaRkF96jh3c3yoMU/sWXsqh7x/G6Tir7JGUw=
=WPta
-----END PGP SIGNATURE-----
Merge tag 'net-6.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, wifi, can and bpf.
Current release - new code bugs:
- can: af_can: can_exit(): add missing dev_remove_pack() of
canxl_packet
Previous releases - regressions:
- bpf, sockmap: fix the sk->sk_forward_alloc warning
- wifi: mac80211: fix general-protection-fault in
ieee80211_subif_start_xmit()
- can: af_can: fix NULL pointer dereference in can_rx_register()
- can: dev: fix skb drop check, avoid o-o-b access
- nfnetlink: fix potential dead lock in nfnetlink_rcv_msg()
Previous releases - always broken:
- bpf: fix wrong reg type conversion in release_reference()
- gso: fix panic on frag_list with mixed head alloc types
- wifi: brcmfmac: fix buffer overflow in brcmf_fweh_event_worker()
- wifi: mac80211: set TWT Information Frame Disabled bit as 1
- eth: macsec offload related fixes, make sure to clear the keys from
memory
- tun: fix memory leaks in the use of napi_get_frags
- tun: call napi_schedule_prep() to ensure we own a napi
- tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent
- ipv6: addrlabel: fix infoleak when sending struct ifaddrlblmsg to
network
- tipc: fix a msg->req tlv length check
- sctp: clear out_curr if all frag chunks of current msg are pruned,
avoid list corruption
- mctp: fix an error handling path in mctp_init(), avoid leaks"
* tag 'net-6.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (101 commits)
eth: sp7021: drop free_netdev() from spl2sw_init_netdev()
MAINTAINERS: Move Vivien to CREDITS
net: macvlan: fix memory leaks of macvlan_common_newlink
ethernet: tundra: free irq when alloc ring failed in tsi108_open()
net: mv643xx_eth: disable napi when init rxq or txq failed in mv643xx_eth_open()
ethernet: s2io: disable napi when start nic failed in s2io_card_up()
net: atlantic: macsec: clear encryption keys from the stack
net: phy: mscc: macsec: clear encryption keys when freeing a flow
stmmac: dwmac-loongson: fix missing of_node_put() while module exiting
stmmac: dwmac-loongson: fix missing pci_disable_device() in loongson_dwmac_probe()
stmmac: dwmac-loongson: fix missing pci_disable_msi() while module exiting
cxgb4vf: shut down the adapter when t4vf_update_port_info() failed in cxgb4vf_open()
mctp: Fix an error handling path in mctp_init()
stmmac: intel: Update PCH PTP clock rate from 200MHz to 204.8MHz
net: cxgb3_main: disable napi when bind qsets failed in cxgb_up()
net: cpsw: disable napi in cpsw_ndo_open()
iavf: Fix VF driver counting VLAN 0 filters
ice: Fix spurious interrupt during removal of trusted VF
net/mlx5e: TC, Fix slab-out-of-bounds in parse_tc_actions
net/mlx5e: E-Switch, Fix comparing termination table instance
...
fixed microcode revisions checking quirk
- Update Icelake and Sapphire Rapids events constraints
- Use the standard energy unit for Sapphire Rapids in RAPL
- Fix the hw_breakpoint test to fail more graciously on !SMP configs
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmNnr/4ACgkQEsHwGGHe
VUrFRg//dyB0lnQcdvIaPd7DWn3WGop+MeZv0NZI7uYk+SqjtJ3yJ/c4ktcaIgJV
MhTk8Q/gxHvuT+MZarC/f1QYtTqzRQ//rKD2aO/l9Gr813Hu4R0z2AEwrNKDmzyd
BYy3O5GXGeBAiLxtmKZ2bDlS5z8a9L3dlbLCWqjq6iGIVncljWmEDmNQmA3YPury
v8f+V8EqfSE4iWcpnNsZOdrmkMkXEzA8X5vRswQ9l2y6qMmnEeUk9Hn9mFlG+QK4
VDyxkQEB+vZVfWL2UjD3dpEaH5LVyfCQBwOaVdFfHhMmLhoTO2VmRMLza3Qd9ejZ
RIE1hlRibqGMqyHDTjZvnkPgnz4QQqayDf8UIIwVdaMVdIaZmxcIQwfsbQS12E5b
9EBzbaD6TJx42E56WuQHM+ZYt6nz0ktPz0IeBFJIwbU30gqJwdi0uIz2kXNpkthC
eX4Bq/iM9C41A58mj9+uerF9jshi/DJU74KcMGUZiJ7IeGDJgL9CfViOTueMOjr2
OI8nvLOtwBpj8X3AO1nEVkevSt4KPoTD+NVCNpXmjVm9DNFvMRo2EUsRHHrCkLJN
EO7iF14rTlSI7IAE+qxNgRsmXPCyuVBhB3S3/3YmCqsH1kQXqlgxT/2eOJN6kCGz
tlaWnD3TEaifH/DQQVGmv9nNFjS0C49MSxrZ7Oe7phnmSn3vaGY=
=midC
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Add Cooper Lake's stepping to the PEBS guest/host events isolation
fixed microcode revisions checking quirk
- Update Icelake and Sapphire Rapids events constraints
- Use the standard energy unit for Sapphire Rapids in RAPL
- Fix the hw_breakpoint test to fail more graciously on !SMP configs
* tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Add Cooper Lake stepping to isolation_ucodes[]
perf/x86/intel: Fix pebs event constraints for SPR
perf/x86/intel: Fix pebs event constraints for ICL
perf/x86/rapl: Use standard Energy Unit for SPR Dram RAPL domain
perf/hw_breakpoint: test: Skip the test if dependencies unmet
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY2RS7QAKCRDbK58LschI
g6RVAQC1FdSXMrhn369NGCG1Vox1QYn2/5P32LSIV1BKqiQsywEAsxgYNrdCPTua
ie91Q5IJGT9pFl1UR50UrgL11DI5BgI=
=sdhO
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
bpf 2022-11-04
We've added 8 non-merge commits during the last 3 day(s) which contain
a total of 10 files changed, 113 insertions(+), 16 deletions(-).
The main changes are:
1) Fix memory leak upon allocation failure in BPF verifier's stack state
tracking, from Kees Cook.
2) Fix address leakage when BPF progs release reference to an object,
from Youlin Li.
3) Fix BPF CI breakage from buggy in.h uapi header dependency,
from Andrii Nakryiko.
4) Fix bpftool pin sub-command's argument parsing, from Pu Lehui.
5) Fix BPF sockmap lockdep warning by cancelling psock work outside
of socket lock, from Cong Wang.
6) Follow-up for BPF sockmap to fix sk_forward_alloc accounting,
from Wang Yufen.
bpf-for-netdev
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add verifier test for release_reference()
bpf: Fix wrong reg type conversion in release_reference()
bpf, sock_map: Move cancel_work_sync() out of sock lock
tools/headers: Pull in stddef.h to uapi to fix BPF selftests build in CI
net/ipv4: Fix linux/in.h header dependencies
bpftool: Fix NULL pointer dereference when pin {PROG, MAP, LINK} without FILE
bpf, sockmap: Fix the sk->sk_forward_alloc warning of sk_stream_kill_queues
bpf, verifier: Fix memory leak in array reallocation for stack state
====================
Link: https://lore.kernel.org/r/20221104000445.30761-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit ab51e15d53 ("fprobe: Introduce FPROBE_FL_KPROBE_SHARED flag
for fprobe") introduced fprobe_kprobe_handler() for fprobe::ops::func,
unregister_fprobe() fails to unregister the registered if user specifies
FPROBE_FL_KPROBE_SHARED flag.
Moreover, __register_ftrace_function() is possible to change the
ftrace_ops::func, thus we have to check fprobe::ops::saved_func instead.
To check it correctly, it should confirm the fprobe::ops::saved_func is
either fprobe_handler() or fprobe_kprobe_handler().
Link: https://lore.kernel.org/all/166677683946.1459107.15997653945538644683.stgit@devnote3/
Fixes: cad9931f64 ("fprobe: Add ftrace based probe APIs")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Check if fp->rethook succeeded to be allocated. Otherwise, if
rethook_alloc() fails, then we end up dereferencing a NULL pointer in
rethook_add_node().
Link: https://lore.kernel.org/all/20221025031209.954836-1-rafaelmendsr@gmail.com/
Fixes: 5b0ab78998 ("fprobe: Add exit_handler support")
Cc: stable@vger.kernel.org
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
In aggregate kprobe case, when arm_kprobe failed,
we need set the kp->flags with KPROBE_FLAG_DISABLED again.
If not, the 'kp' kprobe will been considered as enabled
but it actually not enabled.
Link: https://lore.kernel.org/all/20220902155820.34755-1-liq3ea@163.com/
Fixes: 12310e3437 ("kprobes: Propagate error from arm_kprobe_ftrace()")
Cc: stable@vger.kernel.org
Signed-off-by: Li Qiang <liq3ea@163.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Some helper functions will allocate memory. To avoid memory leaks, the
verifier requires the eBPF program to release these memories by calling
the corresponding helper functions.
When a resource is released, all pointer registers corresponding to the
resource should be invalidated. The verifier use release_references() to
do this job, by apply __mark_reg_unknown() to each relevant register.
It will give these registers the type of SCALAR_VALUE. A register that
will contain a pointer value at runtime, but of type SCALAR_VALUE, which
may allow the unprivileged user to get a kernel pointer by storing this
register into a map.
Using __mark_reg_not_init() while NOT allow_ptr_leaks can mitigate this
problem.
Fixes: fd978bf7fd ("bpf: Add reference tracking to verifier")
Signed-off-by: Youlin Li <liulin063@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221103093440.3161-1-liulin063@gmail.com
KASAN reported a use-after-free with ftrace ops [1]. It was found from
vmcore that perf had registered two ops with the same content
successively, both dynamic. After unregistering the second ops, a
use-after-free occurred.
In ftrace_shutdown(), when the second ops is unregistered, the
FTRACE_UPDATE_CALLS command is not set because there is another enabled
ops with the same content. Also, both ops are dynamic and the ftrace
callback function is ftrace_ops_list_func, so the
FTRACE_UPDATE_TRACE_FUNC command will not be set. Eventually the value
of 'command' will be 0 and ftrace_shutdown() will skip the rcu
synchronization.
However, ftrace may be activated. When the ops is released, another CPU
may be accessing the ops. Add the missing synchronization to fix this
problem.
[1]
BUG: KASAN: use-after-free in __ftrace_ops_list_func kernel/trace/ftrace.c:7020 [inline]
BUG: KASAN: use-after-free in ftrace_ops_list_func+0x2b0/0x31c kernel/trace/ftrace.c:7049
Read of size 8 at addr ffff56551965bbc8 by task syz-executor.2/14468
CPU: 1 PID: 14468 Comm: syz-executor.2 Not tainted 5.10.0 #7
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x40c arch/arm64/kernel/stacktrace.c:132
show_stack+0x30/0x40 arch/arm64/kernel/stacktrace.c:196
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b4/0x248 lib/dump_stack.c:118
print_address_description.constprop.0+0x28/0x48c mm/kasan/report.c:387
__kasan_report mm/kasan/report.c:547 [inline]
kasan_report+0x118/0x210 mm/kasan/report.c:564
check_memory_region_inline mm/kasan/generic.c:187 [inline]
__asan_load8+0x98/0xc0 mm/kasan/generic.c:253
__ftrace_ops_list_func kernel/trace/ftrace.c:7020 [inline]
ftrace_ops_list_func+0x2b0/0x31c kernel/trace/ftrace.c:7049
ftrace_graph_call+0x0/0x4
__might_sleep+0x8/0x100 include/linux/perf_event.h:1170
__might_fault mm/memory.c:5183 [inline]
__might_fault+0x58/0x70 mm/memory.c:5171
do_strncpy_from_user lib/strncpy_from_user.c:41 [inline]
strncpy_from_user+0x1f4/0x4b0 lib/strncpy_from_user.c:139
getname_flags+0xb0/0x31c fs/namei.c:149
getname+0x2c/0x40 fs/namei.c:209
[...]
Allocated by task 14445:
kasan_save_stack+0x24/0x50 mm/kasan/common.c:48
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc mm/kasan/common.c:479 [inline]
__kasan_kmalloc.constprop.0+0x110/0x13c mm/kasan/common.c:449
kasan_kmalloc+0xc/0x14 mm/kasan/common.c:493
kmem_cache_alloc_trace+0x440/0x924 mm/slub.c:2950
kmalloc include/linux/slab.h:563 [inline]
kzalloc include/linux/slab.h:675 [inline]
perf_event_alloc.part.0+0xb4/0x1350 kernel/events/core.c:11230
perf_event_alloc kernel/events/core.c:11733 [inline]
__do_sys_perf_event_open kernel/events/core.c:11831 [inline]
__se_sys_perf_event_open+0x550/0x15f4 kernel/events/core.c:11723
__arm64_sys_perf_event_open+0x6c/0x80 kernel/events/core.c:11723
[...]
Freed by task 14445:
kasan_save_stack+0x24/0x50 mm/kasan/common.c:48
kasan_set_track+0x24/0x34 mm/kasan/common.c:56
kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:358
__kasan_slab_free.part.0+0x11c/0x1b0 mm/kasan/common.c:437
__kasan_slab_free mm/kasan/common.c:445 [inline]
kasan_slab_free+0x2c/0x40 mm/kasan/common.c:446
slab_free_hook mm/slub.c:1569 [inline]
slab_free_freelist_hook mm/slub.c:1608 [inline]
slab_free mm/slub.c:3179 [inline]
kfree+0x12c/0xc10 mm/slub.c:4176
perf_event_alloc.part.0+0xa0c/0x1350 kernel/events/core.c:11434
perf_event_alloc kernel/events/core.c:11733 [inline]
__do_sys_perf_event_open kernel/events/core.c:11831 [inline]
__se_sys_perf_event_open+0x550/0x15f4 kernel/events/core.c:11723
[...]
Link: https://lore.kernel.org/linux-trace-kernel/20221103031010.166498-1-lihuafei1@huawei.com
Fixes: edb096e007 ("ftrace: Fix memleak when unregistering dynamic ops when tracing disabled")
Cc: stable@vger.kernel.org
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Li Huafei <lihuafei1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
On some machines the number of listed CPUs may be bigger than the actual
CPUs that exist. The tracing subsystem allocates a per_cpu directory with
access to the per CPU ring buffer via a cpuX file. But to save space, the
ring buffer will only allocate buffers for online CPUs, even though the
CPU array will be as big as the nr_cpu_ids.
With the addition of waking waiters on the ring buffer when closing the
file, the ring_buffer_wake_waiters() now needs to make sure that the
buffer is allocated (with the irq_work allocated with it) before trying to
wake waiters, as it will cause a NULL pointer dereference.
While debugging this, I added a NULL check for the buffer itself (which is
OK to do), and also NULL pointer checks against buffer->buffers (which is
not fine, and will WARN) as well as making sure the CPU number passed in
is within the nr_cpu_ids (which is also not fine if it isn't).
Link: https://lore.kernel.org/all/87h6zklb6n.wl-tiwai@suse.de/
Link: https://lore.kernel.org/all/CAM6Wdxc0KRJMXVAA0Y=u6Jh2V=uWB-_Fn6M4xRuNppfXzL1mUg@mail.gmail.com/
Link: https://lkml.kernel.org/linux-trace-kernel/20221101191009.1e7378c8@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Steven Noonan <steven.noonan@gmail.com>
Bugzilla: https://bugzilla.opensuse.org/show_bug.cgi?id=1204705
Reported-by: Takashi Iwai <tiwai@suse.de>
Reported-by: Roland Ruckerbauer <roland.rucky@gmail.com>
Fixes: f3ddb74ad0 ("tracing: Wake up ring buffer waiters on closing of the file")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Running the test currently fails on non-SMP systems, despite being
enabled by default. This means that running the test with:
./tools/testing/kunit/kunit.py run --arch x86_64 hw_breakpoint
results in every hw_breakpoint test failing with:
# test_one_cpu: failed to initialize: -22
not ok 1 - test_one_cpu
Instead, use kunit_skip(), which will mark the test as skipped, and give
a more comprehensible message:
ok 1 - test_one_cpu # SKIP not enough cpus
This makes it more obvious that the test is not suited to the test
environment, and so wasn't run, rather than having run and failed.
Signed-off-by: David Gow <davidgow@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Latypov <dlatypov@google.com>
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20221026141040.1609203-1-davidgow@google.com
If an error (NULL) is returned by krealloc(), callers of realloc_array()
were setting their allocation pointers to NULL, but on error krealloc()
does not touch the original allocation. This would result in a memory
resource leak. Instead, free the old allocation on the error handling
path.
The memory leak information is as follows as also reported by Zhengchao:
unreferenced object 0xffff888019801800 (size 256):
comm "bpf_repo", pid 6490, jiffies 4294959200 (age 17.170s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000b211474b>] __kmalloc_node_track_caller+0x45/0xc0
[<0000000086712a0b>] krealloc+0x83/0xd0
[<00000000139aab02>] realloc_array+0x82/0xe2
[<00000000b1ca41d1>] grow_stack_state+0xfb/0x186
[<00000000cd6f36d2>] check_mem_access.cold+0x141/0x1341
[<0000000081780455>] do_check_common+0x5358/0xb350
[<0000000015f6b091>] bpf_check.cold+0xc3/0x29d
[<000000002973c690>] bpf_prog_load+0x13db/0x2240
[<00000000028d1644>] __sys_bpf+0x1605/0x4ce0
[<00000000053f29bd>] __x64_sys_bpf+0x75/0xb0
[<0000000056fedaf5>] do_syscall_64+0x35/0x80
[<000000002bd58261>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: c69431aab6 ("bpf: verifier: Improve function state reallocation")
Reported-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Bill Wendling <morbo@google.com>
Cc: Lorenz Bauer <oss@lmb.io>
Link: https://lore.kernel.org/bpf/20221029025433.2533810-1-keescook@chromium.org
- Add Alder and Raptor Lakes support to RAPL
- Make sure raw sample data is output with tracepoints
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmNeRnEACgkQEsHwGGHe
VUpmTRAAmLvQhTN15L4qr6BSIUhlOk1xmM4pKtUXfpzX9Nki+bhPvH8sczaUXg1N
90u6pD8+uOIFsd2s+bUVyR/h3cWnjpy9Or1oSYlNTTPxwlqC1XsLqsWjy7/AA91d
YAUZNfmIsBNTUDtjygslnZ2yZIIPWXGI5utvrkS3W2cbfZtQhuDVTo5KAnx3+0fC
inKfiO+lAEouNu9l/+GdqPhgiDVB+oK12ROMosAr9++Ewuf61Jnk0nVEynNVoGT0
OLxbNT6xU3TlOm/n2zwmWnM95ZJ9sM5SEJg+c55VZ9biTAgayd+7Hw8H3CAqIhdD
utFoxkQpblp7Lq6IporcfjpGISA4WdbaiJaMN56azucGcZsk6VXUzNk6AimXvqjP
d8z7nVYDGDxYoIWyoSfO7XuIhqek38KRTEbl3qvyRZoF/FRjaWCvZeir9W32mRbx
bVKPTQ8FgSUtkBLhGZrldHP8PRsw1nf60wJb19p8s5aWNMzimgUN0As0kf78k6l+
fapTvhuU84EDVjiUS7BTrMq1r3ieaZiN2Ofi4EAG8c4R3S4C3hHKFQH3suIOp3vf
UpCyYi+29LfdTgiuNX+efklUSu5T2EccJXke07CJQM5BBppvuPqeG7YJET5/YDuz
tSPIbTZ9lxomeNWJSu9cyuynqPIS0f/j6FtpodA7MY1f/AX7OHM=
=zS3P
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.1_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Rename a perf memory level event define to denote it is of CXL type
- Add Alder and Raptor Lakes support to RAPL
- Make sure raw sample data is output with tracepoints
* tag 'perf_urgent_for_v6.1_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/mem: Rename PERF_MEM_LVLNUM_EXTN_MEM to PERF_MEM_LVLNUM_CXL
perf/x86/rapl: Add support for Intel Raptor Lake
perf/x86/rapl: Add support for Intel AlderLake-N
perf: Fix missing raw data on tracepoint events
- Make intel_pstate use what is known about the hardware instead of
relying on information from the platform firmware (ACPI CPPC in
particular) to establish the relationship between the HWP CPU
performance levels and frequencies on all hybrid platforms
available to date (Rafael Wysocki).
- Allow hybrid sleep to use suspend-to-idle as a system suspend method
if it is the current suspend method of choice (Mario Limonciello).
- Fix handling of unavailable/disabled idle states in the generic
power domains code (Sudeep Holla).
- Update the pm-graph suite of utilities to version 5.10 which is
fixes-mostly and does not add any new features (Todd Brandt).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmNb7MoSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxfeMQAIHJ+nLjWGPfoW1WxVYuDECx121+zdXa
zI0O2q6tREdT3YsvPjLG3UePXaTR5cI02JM+5CTgFylfHTRN3xe+L8LJwm2gDadd
+0DC5Q+YLJFVzqo7jsgKPOb6KhjWI77qrFNhvQxac2N/xMEE5z2p7iosP3qewInd
i43UzbfypGkIBin+HD5nM38lHuhp14geKEHocV6ftUzoPMcnuzQVkPIkbpfj7YOo
XGsGz4Seei9raWgquBPUnaM/sOWEwSOb86HBsFTwdiepQJPsfmPO60yBbno7C5hX
kx7b9vG+lh5rosTWructhkSPrLent0QiWd6J3B6glMt4rzlQf8h39hZrOgot7Ald
txNjXhrBTa9tpJanB3lKelgQwj2+6mKhkcFo8uA44jlX0nhOaFTnNKDGj92si8xS
Emj/M7jQozomE/4zXLpeb+Ovpk54svrCsKykE2aeo5sWsL7IZduzAk0ZvgItQ2a0
oIuqxUbnx/JTYqpxzAyZAJtVDfcum12uXmXNk0IcXtI4ewW9mw3YRQpbB9uir5+y
cMnyBgATt65f6I0tr+cJyQmiUxRRiAekZyeJtbF89iLM/nDeTCwI03LyNVAobw/R
FF0ctXdIjZvnWUXyz68+J0a+kMeMwQeIw5TlE8kxfQfIiEubj2V9xIDOe3cn520R
SzrO8xQXB/ay
=1teO
-----END PGP SIGNATURE-----
Merge tag 'pm-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These make the intel_pstate driver work as expected on all hybrid
platforms to date (regardless of possible platform firmware issues),
fix hybrid sleep on systems using suspend-to-idle by default, make the
generic power domains code handle disabled idle states properly and
update pm-graph.
Specifics:
- Make intel_pstate use what is known about the hardware instead of
relying on information from the platform firmware (ACPI CPPC in
particular) to establish the relationship between the HWP CPU
performance levels and frequencies on all hybrid platforms
available to date (Rafael Wysocki)
- Allow hybrid sleep to use suspend-to-idle as a system suspend
method if it is the current suspend method of choice (Mario
Limonciello)
- Fix handling of unavailable/disabled idle states in the generic
power domains code (Sudeep Holla)
- Update the pm-graph suite of utilities to version 5.10 which is
fixes-mostly and does not add any new features (Todd Brandt)"
* tag 'pm-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: domains: Fix handling of unavailable/disabled idle states
pm-graph v5.10
cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores
cpufreq: intel_pstate: Read all MSRs on the target CPU
PM: hibernate: Allow hybrid sleep to work with s2idle
Since commit 838d9bb62d ("perf: Use sample_flags for raw_data")
raw data is not being output on tracepoints due to the PERF_SAMPLE_RAW
field not being set. Fix this by setting it for tracepoint events.
This fixes the following test failure:
perf test "sched_switch" -vvv
35: Track with sched_switch
--- start ---
test child forked, pid 1828
...
Using CPUID 0x00000000410fd400
sched_switch: cpu: 2 prev_tid -14687 next_tid 0
sched_switch: cpu: 2 prev_tid -14687 next_tid 0
Missing sched_switch events
4613 events recorded
test child finished with -1
---- end ----
Track with sched_switch: FAILED!
Fixes: 838d9bb62d ("perf: Use sample_flags for raw_data")
Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: SeongJae Park <sj@kernel.org>
Tested-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Link: https://lore.kernel.org/r/20221012143857.48198-1-james.clark@arm.com
Hybrid sleep is currently hardcoded to only operate with S3 even
on systems that might not support it.
Instead of assuming this mode is what the user wants to use, for
hybrid sleep follow the setting of `mem_sleep_current` which
will respect mem_sleep_default kernel command line and policy
decisions made by the presence of the FADT low power idle bit.
Fixes: 81d45bdf89 ("PM / hibernate: Untangle power_down()")
Reported-and-tested-by: kolAflash <kolAflash@kolahilft.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216574
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Current release - regressions:
- eth: fman: re-expose location of the MAC address to userspace,
apparently some udev scripts depended on the exact value
Current release - new code bugs:
- bpf:
- wait for busy refill_work when destroying bpf memory allocator
- allow bpf_user_ringbuf_drain() callbacks to return 1
- fix dispatcher patchable function entry to 5 bytes nop
Previous releases - regressions:
- net-memcg: avoid stalls when under memory pressure
- tcp: fix indefinite deferral of RTO with SACK reneging
- tipc: fix a null-ptr-deref in tipc_topsrv_accept
- eth: macb: specify PHY PM management done by MAC
- tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
Previous releases - always broken:
- eth: amd-xgbe: SFP fixes and compatibility improvements
Misc:
- docs: netdev: offer performance feedback to contributors
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmNW024ACgkQMUZtbf5S
IrvX7w//SP/zKZwgzC13zd2rrCP16TX2QvkHPmSLvcldQDXdCypmsoc5Vb8UNpkG
jwAuy2pxqPy2oxTwTBQv9TNRT2oqEFOsFTK+w410whlL7g1wZ02aXU8qFhV2XumW
o4gRtM+UISPUKFbOnawdK1XlrNdeLF3bjETvW2GP9zxCb0iqoQXtDDNKxv2B2iQA
MSyTtzHA4n9GS7LKGtPgsP2Ose7h1Z+AjTIpQH1nvfEHJUf/wmxUdCK+fuwfeLjY
PhmYaPG/333j1bfBk1Ms/nUYA5KRXlEj9A/7jDtxhxNEwaTNKyLB19a6oVxXxpSQ
x/k+nZP1RColn5xeco5a1X9aHHQ46PJQ8wVAmxYDIeIA5XPMgShNmhAyjrq1ac+o
9vYeYpmnMGSTLdBMvGbWpynWHe7SddgF8LkbnYf2HLKbxe4bgkOnmxOUH4q9iinZ
MfVSknjax4DP0C7X1kGgR6WyltWnkrahOdUkINsIUNxj0KxJa/eStpJIbJrfkxNV
gHbOjB2/bF3SXENrS4A0IJCgsbO9YugN83Eyu0WDWQOw9wVgopzxOJx9R+H0wkVH
XpGGP8qi1DZiTE3iQiq1LHj6f6kirFmtt9QFH5yzaqtKBaqXakHaXwUO4VtD+BI9
NPFKvFL6jrp8EAn0PTM/RrvhJZN+V0bFXiyiMe0TLx+aR0UMxGc=
=dD6N
-----END PGP SIGNATURE-----
Merge tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf.
The net-memcg fix stands out, the rest is very run-off-the-mill. Maybe
I'm biased.
Current release - regressions:
- eth: fman: re-expose location of the MAC address to userspace,
apparently some udev scripts depended on the exact value
Current release - new code bugs:
- bpf:
- wait for busy refill_work when destroying bpf memory allocator
- allow bpf_user_ringbuf_drain() callbacks to return 1
- fix dispatcher patchable function entry to 5 bytes nop
Previous releases - regressions:
- net-memcg: avoid stalls when under memory pressure
- tcp: fix indefinite deferral of RTO with SACK reneging
- tipc: fix a null-ptr-deref in tipc_topsrv_accept
- eth: macb: specify PHY PM management done by MAC
- tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
Previous releases - always broken:
- eth: amd-xgbe: SFP fixes and compatibility improvements
Misc:
- docs: netdev: offer performance feedback to contributors"
* tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
net-memcg: avoid stalls when under memory pressure
tcp: fix indefinite deferral of RTO with SACK reneging
tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY
net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed
docs: netdev: offer performance feedback to contributors
kcm: annotate data-races around kcm->rx_wait
kcm: annotate data-races around kcm->rx_psock
net: fman: Use physical address for userspace interfaces
net/mlx5e: Cleanup MACsec uninitialization routine
atlantic: fix deadlock at aq_nic_stop
nfp: only clean `sp_indiff` when application firmware is unloaded
amd-xgbe: add the bit rate quirk for Molex cables
amd-xgbe: fix the SFP compliance codes check for DAC cables
amd-xgbe: enable PLL_CTL for fixed PHY modes only
amd-xgbe: use enums for mailbox cmd and sub_cmds
amd-xgbe: Yellow carp devices do not need rrc
bpf: Use __llist_del_all() whenever possbile during memory draining
bpf: Wait for busy refill_work when destroying bpf memory allocator
MAINTAINERS: add keyword match on PTP
...
This pull request contains a commit that fixes bf95b2bc3e ("rcu: Switch
polled grace-period APIs to ->gp_seq_polled"), which could incorrectly
leave interrupts enabled after an early-boot call to synchronize_rcu().
Such synchronize_rcu() calls must acquire leaf rcu_node locks in order to
properly interact with polled grace periods, but the code did not take
into account the possibility of synchronize_rcu() being invoked from
the portion of the boot sequence during which interrupts are disabled.
This commit therefore switches the lock acquisition and release from
irq to irqsave/irqrestore.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmNS54gTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jI5pD/0aURo25lH/gxCxt1A5gi/UI+/rlz+O
7rV51ZXW3cAPY1O7EP/57D8sXtgcALxYyRZ2JVh1WH+ScP+Se9YENs+UBp5zwy75
NIOkJ2LRxMCeX+T7Vw8jtsYd4bsgSQ1q2IXiGHCrRxtUwXewYMxLEVx4Bd89P45F
P49NJx66s5uXmKH2VV6UUcJT7yfyn8+USYxTaDmhhnEGE1frKBoYVqsqoaVfa0ON
r8O50/06bj6VdSRooGGw9vUcuoUlsrmnnXDd2aITkWqtBkeBOA6S3Nx+LUiRRXaw
m9e0s0yLulTlMsXVx/UAM+eBaXP3SDKK7wF1xXoyCqLdPKtZ4ABM0RQzXMsFBQQy
xs04Ba/0CBe1wVv9BijXuR1WgmcRUtoqxAFKXR6bIwh7uPtFDINRO2hfi9vEEVSQ
+4XbnoqsyHaul8tvBWV7O1Fo7+hNgFgJwAZq9I2xOJzh8wbVCc1+5E13K0Oktbq4
HERDAzeqA8Cr+6VxicrXt8/5xw7GsUJbYoXMpttB1WQBIps3/ayUxww1RyK9qegi
DnpUSAUdd0bF1ZfVtkh7V59uk+BBsL9MKNowTRrT1nVWO2VcAbf9IVvONHBFlIw4
d2IKn4IbpGsQR9ZJUfo/6y7GkZYBLU46lknduW+n7vuP+7j2B3KkFQw7z71e1c7u
LSEgmtPZS/c7SA==
=o6hI
-----END PGP SIGNATURE-----
Merge tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fix from Paul McKenney:
"Fix a regression caused by commit bf95b2bc3e ("rcu: Switch polled
grace-period APIs to ->gp_seq_polled"), which could incorrectly leave
interrupts enabled after an early-boot call to synchronize_rcu().
Such synchronize_rcu() calls must acquire leaf rcu_node locks in order
to properly interact with polled grace periods, but the code did not
take into account the possibility of synchronize_rcu() being invoked
from the portion of the boot sequence during which interrupts are
disabled.
This commit therefore switches the lock acquisition and release from
irq to irqsave/irqrestore"
* tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
rcu: Keep synchronize_rcu() from enabling irqs in early boot
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmNVkYkACgkQ6rmadz2v
bTqzHw/+NYMwfLm5Ck+BK0+HiYU5VVLoG4jp8G7B3sJL/6nUDduajzoqa+nM19Xl
+HEjbMza7CizmhkCRkzIs1VVtx8mtvKdTxbhvm77SU2+GBn+X1es+XhtFd4EOpok
MINNHs+cOC/HlnPD/QbFgvxKiKkjyjWxInjUp6Y/mLMcKCn7l9KOkc07/la9Dj3j
RI0gXCywq1pJaPuTCnt0/wcYLJvzn6QsZnKmmksQwt59GQqOd11HWid3rBWZhDp6
beEoHDIMGHROtu60vm4DB0p4l6tauZfeXyPCeu3Tx5ZSsypJIyU1iTdKiIUjG963
ilpy55nrX9bWxadB7LIKHyYfW3in4o+D1ZZaUvLIato/69CZJZ7Uc4kU1RF4Ay1F
Df1Fmal2WeNAxxETPmQPvVeCePvQvwLHl4KNogdZZvd/67cyc1cDhnuTJp37iPak
FALHaaw0VOrTdTvxsWym7yEbkhPbCHpPrKYFZFHgGrRTFk/GM2k38mM07lcLxFGw
aKyooS+eoIZMEgtK5Hma2wpeIVSlkJiJk1d0K20OxdnIUyYEsMXmI+uV1gMxq/8z
EHNi0+296xOoxy22I1Bd5Tu7fIeecHFN44q7YFmpGsB54UNLpFsP0vYUmYT/6hLI
Y0KVZu4c3oQDX7ttifMvkeOCURDJBPrZx37bpNpNXF55fB5ehNk=
=eV7W
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:
====================
pull-request: bpf 2022-10-23
We've added 7 non-merge commits during the last 18 day(s) which contain
a total of 8 files changed, 69 insertions(+), 5 deletions(-).
The main changes are:
1) Wait for busy refill_work when destroying bpf memory allocator, from Hou.
2) Allow bpf_user_ringbuf_drain() callbacks to return 1, from David.
3) Fix dispatcher patchable function entry to 5 bytes nop, from Jiri.
4) Prevent decl_tag from being referenced in func_proto, from Stanislav.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Use __llist_del_all() whenever possbile during memory draining
bpf: Wait for busy refill_work when destroying bpf memory allocator
bpf: Fix dispatcher patchable function entry to 5 bytes nop
bpf: prevent decl_tag from being referenced in func_proto
selftests/bpf: Add reproducer for decl_tag in func_proto return type
selftests/bpf: Make bpf_user_ringbuf_drain() selftest callback return 1
bpf: Allow bpf_user_ringbuf_drain() callbacks to return 1
====================
Link: https://lore.kernel.org/r/20221023192244.81137-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Rework how SIGTRAPs get delivered to events to address a bunch of
problems with it. Add a selftest for that too
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmNVE9UACgkQEsHwGGHe
VUpodRAAsn3s+xX02qfxRG0mYX/1nnOmnFnDhmUEPAZpDXjB6g3PFXAh26F9tw92
dLm8fbfp8W3tK7TQrGyOGWMnb7oj4gEoXAcQBXvsq2KOJMVwRHHKwCeSSJ89DLAW
zZEl2nzgea2cGyZn2RZJvhmbm8YdFON3v++Vv34yovs1MMoCcD7FOPLLGWkD2gk5
6PK6lIlWEI/+vYKWhZdxgk6/PanBInXRUnbaBVj42US1XwfDLzPBEi9yyUUJQrht
CQhfTpHn4Z5MC/hJTlFat4Jlaajql4JBcQ1SS5LW59M+6gdlBK4tL6zFt10zvU2m
+kywOOIWiYLRRgFf6idGO45P5BuWOdmsXEaEg5KW7b6nJfGvgkd7WUYgCVOgtEY1
r4Mf4hAQUunDHGQ4e8eHk7XemFJsoBSweYCTQ2O0yr/QzO2M6QBi/BR9PzUajyH+
yShKEfrxXt4595BMH0nonSMpTKcE4Zxdj06LZSnGecEN8UUlx/n49uYwhFNdUkqM
s6Wz6kSR76YRlKUmYnNzP1gkY6nJZ1nR6z7SjmMkioxf3VxhT9SY8K587r6hRUlr
/NVA69iUhJy75VdttxZEmZzB03A7AjdudmZEisF0ImEmB1hxzYLHcDKJMTIj/r4/
f8OXCg5ACKhFlnx1SdBVRtA+6+5ab368Fs2rItJQ4dzdxRi6VVY=
=DXG7
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Fix raw data handling when perf events are used in bpf
- Rework how SIGTRAPs get delivered to events to address a bunch of
problems with it. Add a selftest for that too
* tag 'perf_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
bpf: Fix sample_flags for bpf_perf_event_output
selftests/perf_events: Add a SIGTRAP stress test with disables
perf: Fix missing SIGTRAPs
Except for waiting_for_gp list, there are no concurrent operations on
free_by_rcu, free_llist and free_llist_extra lists, so use
__llist_del_all() instead of llist_del_all(). waiting_for_gp list can be
deleted by RCU callback concurrently, so still use llist_del_all().
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221021114913.60508-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A busy irq work is an unfinished irq work and it can be either in the
pending state or in the running state. When destroying bpf memory
allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
work is invoked in a per-CPU RT-kthread. It is also possible for kernel
with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host or
mips) and irq work is inovked in timer interrupt.
The busy refill_work leads to various issues. The obvious one is that
there will be concurrent operations on free_by_rcu and free_list between
irq work and memory draining. Another one is call_rcu_in_progress will
not be reliable for the checking of pending RCU callback because
do_call_rcu() may have not been invoked by irq work yet. The other is
there will be use-after-free if irq work is freed before the callback
of irq work is invoked as shown below:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
Oops: 0010 [#1] PREEMPT_RT SMP
CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
......
Call Trace:
<TASK>
irq_work_single+0x24/0x60
irq_work_run_list+0x24/0x30
run_irq_workd+0x23/0x30
smpboot_thread_fn+0x203/0x300
kthread+0x126/0x150
ret_from_fork+0x1f/0x30
</TASK>
Considering the ease of concurrency handling, no overhead for
irq_work_sync() under non-PREEMPT_RT kernel and has-irq-work-interrupt
kernel and the short wait time used for irq_work_sync() under PREEMPT_RT
(When running two test_maps on PREEMPT_RT kernel and 72-cpus host, the
max wait time is about 8ms and the 99th percentile is 10us), just using
irq_work_sync() to wait for busy refill_work to complete before memory
draining and memory freeing.
Fixes: 7c8199e24f ("bpf: Introduce any context BPF specific memory allocator.")
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221021114913.60508-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
post-6.0 issues.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY1IgYgAKCRDdBJ7gKXxA
jpyRAQDkfa1LDkfbA4dQBZShkUhBX1k3AyRO1NWMjwwTxP3H8wD9HUz1BB3ynoKc
ipzQs7q5jbBvndczEksHiG2AC7SvQAI=
=wD9I
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2022-10-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morron:
"Seventeen hotfixes, mainly for MM.
Five are cc:stable and the remainder address post-6.0 issues"
* tag 'mm-hotfixes-stable-2022-10-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
nouveau: fix migrate_to_ram() for faulting page
mm/huge_memory: do not clobber swp_entry_t during THP split
hugetlb: fix memory leak associated with vma_lock structure
mm/page_alloc: reduce potential fragmentation in make_alloc_exact()
mm: /proc/pid/smaps_rollup: fix maple tree search
mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages
mm/mmap: fix MAP_FIXED address return on VMA merge
mm/mmap.c: __vma_adjust(): suppress uninitialized var warning
mm/mmap: undo ->mmap() when mas_preallocate() fails
init: Kconfig: fix spelling mistake "satify" -> "satisfy"
ocfs2: clear dinode links count in case of error
ocfs2: fix BUG when iput after ocfs2_mknod fails
gcov: support GCC 12.1 and newer compilers
zsmalloc: zs_destroy_pool: add size_class NULL check
mm/mempolicy: fix mbind_range() arguments to vma_merge()
mailmap: update email for Qais Yousef
mailmap: update Dan Carpenter's email address
Starting with GCC 12.1, the created .gcda format can't be read by gcov
tool. There are 2 significant changes to the .gcda file format that
need to be supported:
a) [gcov: Use system IO buffering]
(23eb66d1d46a34cb28c4acbdf8a1deb80a7c5a05) changed that all sizes in
the format are in bytes and not in words (4B)
b) [gcov: make profile merging smarter]
(72e0c742bd01f8e7e6dcca64042b9ad7e75979de) add a new checksum to the
file header.
Tested with GCC 7.5, 10.4, 12.2 and the current master.
Link: https://lkml.kernel.org/r/624bda92-f307-30e9-9aaa-8cc678b2dfb2@suse.cz
Signed-off-by: Martin Liska <mliska@suse.cz>
Tested-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The patchable_function_entry(5) might output 5 single nop
instructions (depends on toolchain), which will clash with
bpf_arch_text_poke check for 5 bytes nop instruction.
Adding early init call for dispatcher that checks and change
the patchable entry into expected 5 nop instruction if needed.
There's no need to take text_mutex, because we are using it
in early init call which is called at pre-smp time.
Fixes: ceea991a01 ("bpf: Move bpf_dispatcher function out of ftrace locations")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221018075934.574415-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Making polled RCU grace periods account for expedited grace periods
required acquiring the leaf rcu_node structure's lock during early boot,
but after rcu_init() was called. This lock is irq-disabled, but the
code incorrectly assumes that irqs are always disabled when invoking
synchronize_rcu(). The exception is early boot before the scheduler has
started, which means that upon return from synchronize_rcu(), irqs will
be incorrectly enabled.
This commit fixes this bug by using irqsave/irqrestore locking primitives.
Fixes: bf95b2bc3e ("rcu: Switch polled grace-period APIs to ->gp_seq_polled")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
As previous commit, 'blk_trace_cleanup' will stop block trace if
block trace's state is 'Blktrace_running'.
So remove unnessary stop block trace in 'blk_trace_shutdown'.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019033602.752383-4-yebin@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When test as follows:
step1: ioctl(sda, BLKTRACESETUP, &arg)
step2: ioctl(sda, BLKTRACESTART, NULL)
step3: ioctl(sda, BLKTRACETEARDOWN, NULL)
step4: ioctl(sda, BLKTRACESETUP, &arg)
Got issue as follows:
debugfs: File 'dropped' in directory 'sda' already present!
debugfs: File 'msg' in directory 'sda' already present!
debugfs: File 'trace0' in directory 'sda' already present!
And also find syzkaller report issue like "KASAN: use-after-free Read in relay_switch_subbuf"
"https://syzkaller.appspot.com/bug?id=13849f0d9b1b818b087341691be6cc3ac6a6bfb7"
If remove block trace without stop(BLKTRACESTOP) block trace, '__blk_trace_remove'
will just set 'q->blk_trace' with NULL. However, debugfs file isn't removed, so
will report file already present when call BLKTRACESETUP.
static int __blk_trace_remove(struct request_queue *q)
{
struct blk_trace *bt;
bt = rcu_replace_pointer(q->blk_trace, NULL,
lockdep_is_held(&q->debugfs_mutex));
if (!bt)
return -EINVAL;
if (bt->trace_state != Blktrace_running)
blk_trace_cleanup(q, bt);
return 0;
}
If do test as follows:
step1: ioctl(sda, BLKTRACESETUP, &arg)
step2: ioctl(sda, BLKTRACESTART, NULL)
step3: ioctl(sda, BLKTRACETEARDOWN, NULL)
step4: remove sda
There will remove debugfs directory which will remove recursively all file
under directory.
>> blk_release_queue
>> debugfs_remove_recursive(q->debugfs_dir)
So all files which created in 'do_blk_trace_setup' are removed, and
'dentry->d_inode' is NULL. But 'q->blk_trace' is still in 'running_trace_lock',
'trace_note_tsk' will traverse 'running_trace_lock' all nodes.
>>trace_note_tsk
>> trace_note
>> relay_reserve
>> relay_switch_subbuf
>> d_inode(buf->dentry)->i_size
To solve above issues, reference commit '5afedf670caf', call 'blk_trace_cleanup'
unconditionally in '__blk_trace_remove' and first stop block trace in
'blk_trace_cleanup'.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019033602.752383-3-yebin@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>