If the user has requested a counting threshold for the CPU cycles event,
then the fixed cycle counter can't be assigned as it lacks threshold
support. Currently, the thresholds will work or not randomly depending
on which counter the event is assigned.
While using thresholds for CPU cycles doesn't make much sense, it can be
useful for testing purposes.
Fixes: 816c267544 ("arm64: perf: Add support for event counting threshold")
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20240626-arm-pmu-3-9-icntr-v2-1-c9784b4f4065@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
i.MX95 has a DDR PMU which is almostly same as i.MX93, it now supports
read beat and write beat filter capabilities. This will add support for
i.MX95 and enhance the driver to support specific filter handling for it.
Usage:
For read beat:
~# perf stat -a -I 1000 -e imx9_ddr0/eddrtq_pm_rd_beat_filt2,axi_mask=ID_MASK,axi_id=ID/
~# perf stat -a -I 1000 -e imx9_ddr0/eddrtq_pm_rd_beat_filt1,axi_mask=ID_MASK,axi_id=ID/
~# perf stat -a -I 1000 -e imx9_ddr0/eddrtq_pm_rd_beat_filt0,axi_mask=ID_MASK,axi_id=ID/
eg: For edma2: perf stat -a -I 1000 -e imx9_ddr0/eddrtq_pm_rd_beat_filt0,axi_mask=0x00f,axi_id=0x00c/
For write beat:
~# perf stat -a -I 1000 -e imx9_ddr0/eddrtq_pm_wr_beat_filt,axi_mask=ID_MASK,axi_id=ID/
eg: For edma2: perf stat -a -I 1000 -e imx9_ddr0/eddrtq_pm_wr_beat_filt,axi_mask=0x00f,axi_id=0x00c/
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Link: https://lore.kernel.org/r/20240529080358.703784-6-xu.yang_2@nxp.com
Signed-off-by: Will Deacon <will@kernel.org>
In current driver, the counter will start firstly and then be configured.
This sequence is not correct for AXI filter events since the correct
AXI_MASK and AXI_ID are not set yet. Then the results may be inaccurate.
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Fixes: 55691f99d4 ("drivers/perf: imx_ddr: Add support for NXP i.MX9 SoC DDRC PMU driver")
cc: stable@vger.kernel.org
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Link: https://lore.kernel.org/r/20240529080358.703784-5-xu.yang_2@nxp.com
Signed-off-by: Will Deacon <will@kernel.org>
This driver is initinally used to support imx93 Soc and now it's time to
add support for imx95 Soc. However, some macro definitions and events are
different on these two Socs. For preparing imx95 supports, this will
refactor driver for imx93.
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Link: https://lore.kernel.org/r/20240529080358.703784-4-xu.yang_2@nxp.com
Signed-off-by: Will Deacon <will@kernel.org>
In current design, the user of perf app needs to input counter ID to count
events. However, this is not user-friendly since the user needs to lookup
the map table to find the counter. Instead of letting the user to input
the counter, let this driver to manage the counters in this patch.
This will be implemented by:
1. allocate counter 0 for cycle event.
2. find unused counter from 1-10 for reference events.
3. allocate specific counter for counter-specific events.
In this patch, counter attr will be kept for back-compatible but all the
value passed down by counter=<n> will be ignored. To mark counter-specific
events, counter ID will be encoded into perf_pmu_events_attr.id.
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Link: https://lore.kernel.org/r/20240529080358.703784-3-xu.yang_2@nxp.com
Signed-off-by: Will Deacon <will@kernel.org>
The user can set event and counter in cmdline and the driver need to parse
it using 'config' attr value. This will add macro definitions to avoid
hard-code in driver.
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Link: https://lore.kernel.org/r/20240529080358.703784-2-xu.yang_2@nxp.com
Signed-off-by: Will Deacon <will@kernel.org>
Add support for the Arm Cortex-A725, Cortex-X925, Neoverse N3,
Neoverse V2, Neoverse V3 and Neoverse V3AE.
This just adds the names and connects them with their DT compatible
strings.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/20240628145612.1291329-3-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Previously, wp_config0/2 registers were used for primary match group and
wp_config1/3 registers for secondary match group. In order to support
tertiary match group, this patch decouples the registers and the groups.
Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20240618005056.3092866-2-ilkka@os.amperecomputing.com
Signed-off-by: Will Deacon <will@kernel.org>
- Fix broken FP register state tracking which resulted in filesystem
corruption when dm-crypt is used
- Workarounds for Arm CPU errata affecting the SSBS Spectre mitigation
- Fix lockdep assertion in DMC620 memory controller PMU driver
- Fix alignment of BUG table when CONFIG_DEBUG_BUGVERBOSE is disabled
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmZN3xcQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNMWjCACBIwegWWitCxgvujTPzOc0AwbxJjJWVGF4
0Y3sthbirIJc8e5K7HYv4wbbCHbaqHX4T9noAKx3wvskEomcNqYyI5Wzr/KTR82f
OHWHeMebFCAvo+UKTBa71JZcjgB4wi4+UuXIV1tViuMvGRKJW3nXKSwIt4SSQOYM
VmS8bvqyyJZtnpNDgniY6QHRCWatagHpQFNFePkvsJiSoi78+FZWb2k2h55rz0iE
EG2Vuzw5r1MNqXHCpPaU7fNwsLFbNYiJz3CQYisBLondyDDMsK1XUkLWoxWgGJbK
SNbE3becd0C2SlOTwllV4R59AsmMPvA7tOHbD41aGOSBlKY1Hi91
=ivar
-----END PGP SIGNATURE-----
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"The major fix here is for a filesystem corruption issue reported on
Apple M1 as a result of buggy management of the floating point
register state introduced in 6.8. I initially reverted one of the
offending patches, but in the end Ard cooked a proper fix so there's a
revert+reapply in the series.
Aside from that, we've got some CPU errata workarounds and misc other
fixes.
- Fix broken FP register state tracking which resulted in filesystem
corruption when dm-crypt is used
- Workarounds for Arm CPU errata affecting the SSBS Spectre
mitigation
- Fix lockdep assertion in DMC620 memory controller PMU driver
- Fix alignment of BUG table when CONFIG_DEBUG_BUGVERBOSE is
disabled"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64/fpsimd: Avoid erroneous elide of user state reload
Reapply "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD"
arm64: asm-bug: Add .align 2 to the end of __BUG_ENTRY
perf/arm-dmc620: Fix lockdep assert in ->event_init()
Revert "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD"
arm64: errata: Add workaround for Arm errata 3194386 and 3312417
arm64: cputype: Add Neoverse-V3 definitions
arm64: cputype: Add Cortex-X4 definitions
arm64: barrier: Restore spec_bar() macro
Here is the small set of driver core and kernfs changes for 6.10-rc1.
Nothing major here at all, just a small set of changes for some driver
core apis, and minor fixups. Included in here are:
- sysfs_bin_attr_simple_read() helper added and used
- device_show_string() helper added and used
All usages of these were acked by the various maintainers. Also in here
are:
- kernfs minor cleanup
- removed unused functions
- typo fix in documentation
- pay attention to sysfs_create_link() failures in module.c finally.
All of these have been in linux-next for a very long time with no
reported problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZk3+hQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylfTwCfUyHWkDZuZ7ehdtjzfmcd4EKZBK8An3AAV99G
ox8PXMxuFTaUEdT/69FQ
=2sEo
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the small set of driver core and kernfs changes for 6.10-rc1.
Nothing major here at all, just a small set of changes for some driver
core apis, and minor fixups. Included in here are:
- sysfs_bin_attr_simple_read() helper added and used
- device_show_string() helper added and used
All usages of these were acked by the various maintainers. Also in
here are:
- kernfs minor cleanup
- removed unused functions
- typo fix in documentation
- pay attention to sysfs_create_link() failures in module.c finally
All of these have been in linux-next for a very long time with no
reported problems"
* tag 'driver-core-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
device property: Fix a typo in the description of device_get_child_node_count()
kernfs: mount: Remove unnecessary ‘NULL’ values from knparent
scsi: Use device_show_string() helper for sysfs attributes
platform/x86: Use device_show_string() helper for sysfs attributes
perf: Use device_show_string() helper for sysfs attributes
IB/qib: Use device_show_string() helper for sysfs attributes
hwmon: Use device_show_string() helper for sysfs attributes
driver core: Add device_show_string() helper for sysfs attributes
treewide: Use sysfs_bin_attr_simple_read() helper
sysfs: Add sysfs_bin_attr_simple_read() helper
module: don't ignore sysfs_create_link() failures
driver core: Remove unused platform_notify, platform_notify_remove
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmZLzNIUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vwr/Q//STe2XGKI8bAKqP2wbbkzm+ISnK4A
Lqf3FEAIXunxDRspszfXKKV2p4vaIkmOFiwIdtp/kWvd0DQn5+ATXJ/iQtp8aFX/
R+6BQ7EZc2G7fN5fbQuK54+CvmWEpkKEMbXYbd6ivQ14Cijdb3Nbu+w+DYFjS+6C
k2a9lS1bTW7Xcy0fyiO1w6GQiWqtmOH8U3OlQtIrI0EVkDG9OG1LsLuc92/FgkOo
REN+sU+hX1K5fHrvm2CtjYDn/9/B6bJ/It22H1dPgUL9nKvKC67fYzosMtUCOX1M
6XSPjZIuXOmQGeZXHhpSlVwaidxoUjYO98I7nMquxKdCy6yct3geK7ULG/xeQCgD
ML7MGQB4+sTiSWalXUQaziKqF1FIDEvU3HMGXFWnoBL5l56eRp8KS1EI9Eqk9pU3
pk9fJaCkcFnkzPtMFzqPOm5q9zUZ6bGbfYb0hs72TUKplmVDhFo2T1YsW2AOyHZ7
mjuDzUYZX0H7uM1tntA56IgZX+oNOrLvhBt5L5M/BQeCsZFBBUfIcAEaYoL9LwXO
AYgIG3jdqzHHyAUzutJF+XHKinJLMHm0XVYbFmO6saPhFzrUJSNHqT7NzW1DGGTl
OnO8e1WNMX1EcnKvnc6fXyGmM3SgVwy45FsbG/zRnhn4uBKqKtjrh6uX/myA22LK
CSeqSUK9XmXxFNA=
=xjoS
-----END PGP SIGNATURE-----
Merge tag 'pci-v6.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Skip E820 checks for MCFG ECAM regions for new (2016+) machines,
since there's no requirement to describe them in E820 and some
platforms require ECAM to work (Bjorn Helgaas)
- Rename PCI_IRQ_LEGACY to PCI_IRQ_INTX to be more specific (Damien
Le Moal)
- Remove last user and pci_enable_device_io() (Heiner Kallweit)
- Wait for Link Training==0 to avoid possible race (Ilpo Järvinen)
- Skip waiting for devices that have been disconnected while
suspended (Ilpo Järvinen)
- Clear Secondary Status errors after enumeration since Master Aborts
and Unsupported Request errors are an expected part of enumeration
(Vidya Sagar)
MSI:
- Remove unused IMS (Interrupt Message Store) support (Bjorn Helgaas)
Error handling:
- Mask Genesys GL975x SD host controller Replay Timer Timeout
correctable errors caused by a hardware defect; the errors cause
interrupts that prevent system suspend (Kai-Heng Feng)
- Fix EDR-related _DSM support, which previously evaluated revision 5
but assumed revision 6 behavior (Kuppuswamy Sathyanarayanan)
ASPM:
- Simplify link state definitions and mask calculation (Ilpo
Järvinen)
Power management:
- Avoid D3cold for HP Pavilion 17 PC/1972 PCIe Ports, where BIOS
apparently doesn't know how to put them back in D0 (Mario
Limonciello)
CXL:
- Support resetting CXL devices; special handling required because
CXL Ports mask Secondary Bus Reset by default (Dave Jiang)
DOE:
- Support DOE Discovery Version 2 (Alexey Kardashevskiy)
Endpoint framework:
- Set endpoint BAR to be 64-bit if the driver says that's all the
device supports, in addition to doing so if the size is >2GB
(Niklas Cassel)
- Simplify endpoint BAR allocation and setting interfaces (Niklas
Cassel)
Cadence PCIe controller driver:
- Drop DT binding redundant msi-parent and pci-bus.yaml (Krzysztof
Kozlowski)
Cadence PCIe endpoint driver:
- Configure endpoint BARs to be 64-bit based on the BAR type, not the
BAR value (Niklas Cassel)
Freescale Layerscape PCIe controller driver:
- Convert DT binding to YAML (Frank Li)
MediaTek MT7621 PCIe controller driver:
- Add DT binding missing 'reg' property for child Root Ports
(Krzysztof Kozlowski)
- Fix theoretical string truncation in PHY name (Sergio Paracuellos)
NVIDIA Tegra194 PCIe controller driver:
- Return success for endpoint probe instead of falling through to the
failure path (Vidya Sagar)
Renesas R-Car PCIe controller driver:
- Add DT binding missing IOMMU properties (Geert Uytterhoeven)
- Add DT binding R-Car V4H compatible for host and endpoint mode
(Yoshihiro Shimoda)
Rockchip PCIe controller driver:
- Configure endpoint BARs to be 64-bit based on the BAR type, not the
BAR value (Niklas Cassel)
- Add DT binding missing maxItems to ep-gpios (Krzysztof Kozlowski)
- Set the Subsystem Vendor ID, which was previously zero because it
was masked incorrectly (Rick Wertenbroek)
Synopsys DesignWare PCIe controller driver:
- Restructure DBI register access to accommodate devices where this
requires Refclk to be active (Manivannan Sadhasivam)
- Remove the deinit() callback, which was only need by the
pcie-rcar-gen4, and do it directly in that driver (Manivannan
Sadhasivam)
- Add dw_pcie_ep_cleanup() so drivers that support PERST# can clean
up things like eDMA (Manivannan Sadhasivam)
- Rename dw_pcie_ep_exit() to dw_pcie_ep_deinit() to make it parallel
to dw_pcie_ep_init() (Manivannan Sadhasivam)
- Rename dw_pcie_ep_init_complete() to dw_pcie_ep_init_registers() to
reflect the actual functionality (Manivannan Sadhasivam)
- Call dw_pcie_ep_init_registers() directly from all the glue
drivers, not just those that require active Refclk from the host
(Manivannan Sadhasivam)
- Remove the "core_init_notifier" flag, which was an obscure way for
glue drivers to indicate that they depend on Refclk from the host
(Manivannan Sadhasivam)
TI J721E PCIe driver:
- Add DT binding J784S4 SoC Device ID (Siddharth Vadapalli)
- Add DT binding J722S SoC support (Siddharth Vadapalli)
TI Keystone PCIe controller driver:
- Add DT binding missing num-viewport, phys and phy-name properties
(Jan Kiszka)
Miscellaneous:
- Constify and annotate with __ro_after_init (Heiner Kallweit)
- Convert DT bindings to YAML (Krzysztof Kozlowski)
- Check for kcalloc() failure in of_pci_prop_intr_map() (Duoming
Zhou)"
* tag 'pci-v6.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (97 commits)
PCI: Do not wait for disconnected devices when resuming
x86/pci: Skip early E820 check for ECAM region
PCI: Remove unused pci_enable_device_io()
ata: pata_cs5520: Remove unnecessary call to pci_enable_device_io()
PCI: Update pci_find_capability() stub return types
PCI: Remove PCI_IRQ_LEGACY
scsi: vmw_pvscsi: Do not use PCI_IRQ_LEGACY instead of PCI_IRQ_LEGACY
scsi: pmcraid: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
scsi: mpt3sas: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
scsi: megaraid_sas: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
scsi: ipr: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
scsi: hpsa: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
scsi: arcmsr: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
wifi: rtw89: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
dt-bindings: PCI: rockchip,rk3399-pcie: Add missing maxItems to ep-gpios
Revert "genirq/msi: Provide constants for PCI/IMS support"
Revert "x86/apic/msi: Enable PCI/IMS"
Revert "iommu/vt-d: Enable PCI/IMS"
Revert "iommu/amd: Enable PCI/IMS"
Revert "PCI/MSI: Provide IMS (Interrupt Message Store) support"
...
for_each_sibling_event() checks leader's ctx but it doesn't have the ctx
yet if it's the leader. Like in perf_event_validate_size(), we should
skip checking siblings in that case.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Fixes: f3c0eba287 ("perf: Add a few assertions")
Reported-by: Greg Thelen <gthelen@google.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Tuan Phan <tuanphan@os.amperecomputing.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20240514180050.182454-1-namhyung@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
* Move a lot of state that was previously stored on a per vcpu
basis into a per-CPU area, because it is only pertinent to the
host while the vcpu is loaded. This results in better state
tracking, and a smaller vcpu structure.
* Add full handling of the ERET/ERETAA/ERETAB instructions in
nested virtualisation. The last two instructions also require
emulating part of the pointer authentication extension.
As a result, the trap handling of pointer authentication has
been greatly simplified.
* Turn the global (and not very scalable) LPI translation cache
into a per-ITS, scalable cache, making non directly injected
LPIs much cheaper to make visible to the vcpu.
* A batch of pKVM patches, mostly fixes and cleanups, as the
upstreaming process seems to be resuming. Fingers crossed!
* Allocate PPIs and SGIs outside of the vcpu structure, allowing
for smaller EL2 mapping and some flexibility in implementing
more or less than 32 private IRQs.
* Purge stale mpidr_data if a vcpu is created after the MPIDR
map has been created.
* Preserve vcpu-specific ID registers across a vcpu reset.
* Various minor cleanups and improvements.
LoongArch:
* Add ParaVirt IPI support.
* Add software breakpoint support.
* Add mmio trace events support.
RISC-V:
* Support guest breakpoints using ebreak
* Introduce per-VCPU mp_state_lock and reset_cntx_lock
* Virtualize SBI PMU snapshot and counter overflow interrupts
* New selftests for SBI PMU and Guest ebreak
* Some preparatory work for both TDX and SNP page fault handling.
This also cleans up the page fault path, so that the priorities
of various kinds of fauls (private page, no memory, write
to read-only slot, etc.) are easier to follow.
x86:
* Minimize amount of time that shadow PTEs remain in the special
REMOVED_SPTE state. This is a state where the mmu_lock is held for
reading but concurrent accesses to the PTE have to spin; shortening
its use allows other vCPUs to repopulate the zapped region while
the zapper finishes tearing down the old, defunct page tables.
* Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field,
which is defined by hardware but left for software use. This lets KVM
communicate its inability to map GPAs that set bits 51:48 on hosts
without 5-level nested page tables. Guest firmware is expected to
use the information when mapping BARs; this avoids that they end up at
a legal, but unmappable, GPA.
* Fixed a bug where KVM would not reject accesses to MSR that aren't
supposed to exist given the vCPU model and/or KVM configuration.
* As usual, a bunch of code cleanups.
x86 (AMD):
* Implement a new and improved API to initialize SEV and SEV-ES VMs, which
will also be extendable to SEV-SNP. The new API specifies the desired
encryption in KVM_CREATE_VM and then separately initializes the VM.
The new API also allows customizing the desired set of VMSA features;
the features affect the measurement of the VM's initial state, and
therefore enabling them cannot be done tout court by the hypervisor.
While at it, the new API includes two bugfixes that couldn't be
applied to the old one without a flag day in userspace or without
affecting the initial measurement. When a SEV-ES VM is created with
the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are
rejected once the VMSA has been encrypted. Also, the FPU and AVX
state will be synchronized and encrypted too.
* Support for GHCB version 2 as applicable to SEV-ES guests. This, once
more, is only accessible when using the new KVM_SEV_INIT2 flow for
initialization of SEV-ES VMs.
x86 (Intel):
* An initial bunch of prerequisite patches for Intel TDX were merged.
They generally don't do anything interesting. The only somewhat user
visible change is a new debugging mode that checks that KVM's MMU
never triggers a #VE virtualization exception in the guest.
* Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to
L1, as per the SDM.
Generic:
* Use vfree() instead of kvfree() for allocations that always use vcalloc()
or __vcalloc().
* Remove .change_pte() MMU notifier - the changes to non-KVM code are
small and Andrew Morton asked that I also take those through the KVM
tree. The callback was only ever implemented by KVM (which was also the
original user of MMU notifiers) but it had been nonfunctional ever since
calls to set_pte_at_notify were wrapped with invalidate_range_start
and invalidate_range_end... in 2012.
Selftests:
* Enhance the demand paging test to allow for better reporting and stressing
of UFFD performance.
* Convert the steal time test to generate TAP-friendly output.
* Fix a flaky false positive in the xen_shinfo_test due to comparing elapsed
time across two different clock domains.
* Skip the MONITOR/MWAIT test if the host doesn't actually support MWAIT.
* Avoid unnecessary use of "sudo" in the NX hugepage test wrapper shell
script, to play nice with running in a minimal userspace environment.
* Allow skipping the RSEQ test's sanity check that the vCPU was able to
complete a reasonable number of KVM_RUNs, as the assert can fail on a
completely valid setup. If the test is run on a large-ish system that is
otherwise idle, and the test isn't affined to a low-ish number of CPUs, the
vCPU task can be repeatedly migrated to CPUs that are in deep sleep states,
which results in the vCPU having very little net runtime before the next
migration due to high wakeup latencies.
* Define _GNU_SOURCE for all selftests to fix a warning that was introduced by
a change to kselftest_harness.h late in the 6.9 cycle, and because forcing
every test to #define _GNU_SOURCE is painful.
* Provide a global pseudo-RNG instance for all tests, so that library code can
generate random, but determinstic numbers.
* Use the global pRNG to randomly force emulation of select writes from guest
code on x86, e.g. to help validate KVM's emulation of locked accesses.
* Allocate and initialize x86's GDT, IDT, TSS, segments, and default exception
handlers at VM creation, instead of forcing tests to manually trigger the
related setup.
Documentation:
* Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmZE878UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOukQf+LcvZsWtrC7Wd5K9SQbYXaS4Rk6P6
JHoQW2d0hUN893J2WibEw+l1J/0vn5JumqHXyZgJ7CbaMtXkWWQTwDSDLuURUKpv
XNB3Sb17G87NH+s1tOh0tA9h5upbtlHVHvrtIwdbb9+XHgQ6HTL4uk+HdfO/p9fW
cWBEZAKoWcCIa99Numv3pmq5vdrvBlNggwBugBS8TH69EKMw+V1Vu1SFkIdNDTQk
NJJ28cohoP3wnwlIHaXSmU4RujipPH3Lm/xupyA5MwmzO713eq2yUqV49jzhD5/I
MA4Ruvgrdm4wpp89N9lQMyci91u6q7R9iZfMu0tSg2qYI3UPKIdstd8sOA==
=2lED
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- Move a lot of state that was previously stored on a per vcpu basis
into a per-CPU area, because it is only pertinent to the host while
the vcpu is loaded. This results in better state tracking, and a
smaller vcpu structure.
- Add full handling of the ERET/ERETAA/ERETAB instructions in nested
virtualisation. The last two instructions also require emulating
part of the pointer authentication extension. As a result, the trap
handling of pointer authentication has been greatly simplified.
- Turn the global (and not very scalable) LPI translation cache into
a per-ITS, scalable cache, making non directly injected LPIs much
cheaper to make visible to the vcpu.
- A batch of pKVM patches, mostly fixes and cleanups, as the
upstreaming process seems to be resuming. Fingers crossed!
- Allocate PPIs and SGIs outside of the vcpu structure, allowing for
smaller EL2 mapping and some flexibility in implementing more or
less than 32 private IRQs.
- Purge stale mpidr_data if a vcpu is created after the MPIDR map has
been created.
- Preserve vcpu-specific ID registers across a vcpu reset.
- Various minor cleanups and improvements.
LoongArch:
- Add ParaVirt IPI support
- Add software breakpoint support
- Add mmio trace events support
RISC-V:
- Support guest breakpoints using ebreak
- Introduce per-VCPU mp_state_lock and reset_cntx_lock
- Virtualize SBI PMU snapshot and counter overflow interrupts
- New selftests for SBI PMU and Guest ebreak
- Some preparatory work for both TDX and SNP page fault handling.
This also cleans up the page fault path, so that the priorities of
various kinds of fauls (private page, no memory, write to read-only
slot, etc.) are easier to follow.
x86:
- Minimize amount of time that shadow PTEs remain in the special
REMOVED_SPTE state.
This is a state where the mmu_lock is held for reading but
concurrent accesses to the PTE have to spin; shortening its use
allows other vCPUs to repopulate the zapped region while the zapper
finishes tearing down the old, defunct page tables.
- Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID
field, which is defined by hardware but left for software use.
This lets KVM communicate its inability to map GPAs that set bits
51:48 on hosts without 5-level nested page tables. Guest firmware
is expected to use the information when mapping BARs; this avoids
that they end up at a legal, but unmappable, GPA.
- Fixed a bug where KVM would not reject accesses to MSR that aren't
supposed to exist given the vCPU model and/or KVM configuration.
- As usual, a bunch of code cleanups.
x86 (AMD):
- Implement a new and improved API to initialize SEV and SEV-ES VMs,
which will also be extendable to SEV-SNP.
The new API specifies the desired encryption in KVM_CREATE_VM and
then separately initializes the VM. The new API also allows
customizing the desired set of VMSA features; the features affect
the measurement of the VM's initial state, and therefore enabling
them cannot be done tout court by the hypervisor.
While at it, the new API includes two bugfixes that couldn't be
applied to the old one without a flag day in userspace or without
affecting the initial measurement. When a SEV-ES VM is created with
the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are rejected
once the VMSA has been encrypted. Also, the FPU and AVX state will
be synchronized and encrypted too.
- Support for GHCB version 2 as applicable to SEV-ES guests.
This, once more, is only accessible when using the new
KVM_SEV_INIT2 flow for initialization of SEV-ES VMs.
x86 (Intel):
- An initial bunch of prerequisite patches for Intel TDX were merged.
They generally don't do anything interesting. The only somewhat
user visible change is a new debugging mode that checks that KVM's
MMU never triggers a #VE virtualization exception in the guest.
- Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig
VM-Exit to L1, as per the SDM.
Generic:
- Use vfree() instead of kvfree() for allocations that always use
vcalloc() or __vcalloc().
- Remove .change_pte() MMU notifier - the changes to non-KVM code are
small and Andrew Morton asked that I also take those through the
KVM tree.
The callback was only ever implemented by KVM (which was also the
original user of MMU notifiers) but it had been nonfunctional ever
since calls to set_pte_at_notify were wrapped with
invalidate_range_start and invalidate_range_end... in 2012.
Selftests:
- Enhance the demand paging test to allow for better reporting and
stressing of UFFD performance.
- Convert the steal time test to generate TAP-friendly output.
- Fix a flaky false positive in the xen_shinfo_test due to comparing
elapsed time across two different clock domains.
- Skip the MONITOR/MWAIT test if the host doesn't actually support
MWAIT.
- Avoid unnecessary use of "sudo" in the NX hugepage test wrapper
shell script, to play nice with running in a minimal userspace
environment.
- Allow skipping the RSEQ test's sanity check that the vCPU was able
to complete a reasonable number of KVM_RUNs, as the assert can fail
on a completely valid setup.
If the test is run on a large-ish system that is otherwise idle,
and the test isn't affined to a low-ish number of CPUs, the vCPU
task can be repeatedly migrated to CPUs that are in deep sleep
states, which results in the vCPU having very little net runtime
before the next migration due to high wakeup latencies.
- Define _GNU_SOURCE for all selftests to fix a warning that was
introduced by a change to kselftest_harness.h late in the 6.9
cycle, and because forcing every test to #define _GNU_SOURCE is
painful.
- Provide a global pseudo-RNG instance for all tests, so that library
code can generate random, but determinstic numbers.
- Use the global pRNG to randomly force emulation of select writes
from guest code on x86, e.g. to help validate KVM's emulation of
locked accesses.
- Allocate and initialize x86's GDT, IDT, TSS, segments, and default
exception handlers at VM creation, instead of forcing tests to
manually trigger the related setup.
Documentation:
- Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (225 commits)
selftests/kvm: remove dead file
KVM: selftests: arm64: Test vCPU-scoped feature ID registers
KVM: selftests: arm64: Test that feature ID regs survive a reset
KVM: selftests: arm64: Store expected register value in set_id_regs
KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scope
KVM: arm64: Only reset vCPU-scoped feature ID regs once
KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs()
KVM: arm64: Rename is_id_reg() to imply VM scope
KVM: arm64: Destroy mpidr_data for 'late' vCPU creation
KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support
KVM: arm64: Fix hvhe/nvhe early alias parsing
KVM: SEV: Allow per-guest configuration of GHCB protocol version
KVM: SEV: Add GHCB handling for termination requests
KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests
KVM: SEV: Add support to handle AP reset MSR protocol
KVM: x86: Explicitly zero kvm_caps during vendor module load
KVM: x86: Fully re-initialize supported_mce_cap on vendor module load
KVM: x86: Fully re-initialize supported_vm_types on vendor module load
KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns
KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values
...
Move PCI_DVSEC_VENDOR_ID_CXL in CXL private code to PCI_VENDOR_ID_CXL in
pci_ids.h in order to be utilized in PCI subsystem.
While the CXL Vendor ID (0x1e98) is not listed in the PCI SIG "Member
Companies" database at https://pcisig.com/membership/member-companies, the
SIG has confirmed that it is reserved by CXL.
Link: https://lore.kernel.org/r/20240502165851.1948523-2-dave.jiang@intel.com
Suggested-by: Bjorn Helgaas <helgaas@kernel.org>
Link: https://lore.kernel.org/linux-cxl/20240402172323.GA1818777@bhelgaas/
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
[bhelgaas: update commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Deduplicate sysfs ->show() callbacks which expose a string at a static
memory location. Use the newly introduced device_show_string() helper
in the driver core instead by declaring those sysfs attributes with
DEVICE_STRING_ATTR_RO().
No functional change intended.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Link: https://lore.kernel.org/r/3a297850312b4ecb62d6872121de04496900f502.1713608122.git.lukas@wunner.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
pci_alloc_irq_vectors() allocates an irq vector. When devm_add_action()
fails, the irq vector is not freed, which leads to a memory leak.
Replace the devm_add_action with devm_add_action_or_reset to ensure
the irq vector can be destroyed when it fails.
Fixes: 66637ab137 ("drivers/perf: hisi: add driver for HNS3 PMU")
Signed-off-by: Hao Chen <chenhao418@huawei.com>
Signed-off-by: Junhao He <hejunhao3@huawei.com>
Reviewed-by: Jijie Shao <shaojijie@huawei.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20240425124627.13764-4-hejunhao3@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
The perf tool allows users to create event groups through following
cmd [1], but the driver does not check whether the array index is out
of bounds when writing data to the event_group array. If the number of
events in an event_group is greater than HNS3_PMU_MAX_HW_EVENTS, the
memory write overflow of event_group array occurs.
Add array index check to fix the possible array out of bounds violation,
and return directly when write new events are written to array bounds.
There are 9 different events in an event_group.
[1] perf stat -e '{pmu/event1/, ... ,pmu/event9/}
Fixes: 66637ab137 ("drivers/perf: hisi: add driver for HNS3 PMU")
Signed-off-by: Junhao He <hejunhao3@huawei.com>
Signed-off-by: Hao Chen <chenhao418@huawei.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Jijie Shao <shaojijie@huawei.com>
Link: https://lore.kernel.org/r/20240425124627.13764-3-hejunhao3@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
The perf tool allows users to create event groups through following
cmd [1], but the driver does not check whether the array index is out of
bounds when writing data to the event_group array. If the number of events
in an event_group is greater than HISI_PCIE_MAX_COUNTERS, the memory write
overflow of event_group array occurs.
Add array index check to fix the possible array out of bounds violation,
and return directly when write new events are written to array bounds.
There are 9 different events in an event_group.
[1] perf stat -e '{pmu/event1/, ... ,pmu/event9/}'
Fixes: 8404b0fbc7 ("drivers/perf: hisi: Add driver for HiSilicon PCIe PMU")
Signed-off-by: Junhao He <hejunhao3@huawei.com>
Reviewed-by: Jijie Shao <shaojijie@huawei.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20240425124627.13764-2-hejunhao3@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
SBI v2.0 SBI introduced PMU snapshot feature which adds the following
features.
1. Read counter values directly from the shared memory instead of
csr read.
2. Start multiple counters with initial values with one SBI call.
These functionalities optimizes the number of traps to the higher
privilege mode. If the kernel is in VS mode while the hypervisor
deploy trap & emulate method, this would minimize all the hpmcounter
CSR read traps. If the kernel is running in S-mode, the benefits
reduced to CSR latency vs DRAM/cache latency as there is no trap
involved while accessing the hpmcounter CSRs.
In both modes, it does saves the number of ecalls while starting
multiple counter together with an initial values. This is a likely
scenario if multiple counters overflow at the same time.
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-10-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
For RV32, used_hw_ctrs can have more than 1 word if the firmware chooses
to interleave firmware/hardware counters indicies. Even though it's a
unlikely scenario, handle that case by iterating over all the words
instead of just using the first word.
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-9-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
It is a good practice to use BIT() instead of (1 << x).
Replace the current usages with BIT().
Take this opportunity to replace few (1UL << x) with BIT() as well
for consistency.
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-5-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
SBI v2.0 introduced a explicit function to read the upper 32 bits
for any firmware counter width that is longer than 32bits.
This is only applicable for RV32 where firmware counter can be
64 bit.
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-4-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Now that perf supports giving the PMU device a parent, we can use our
platform device to make the relationship between CMN instances and PMU
IDs trivially discoverable, from either nominal direction:
root@crazy-taxi:~# ls /sys/devices/platform/ARMHC600:00 | grep cmn
arm_cmn_0
root@crazy-taxi:~# realpath /sys/bus/event_source/devices/arm_cmn_0/..
/sys/devices/platform/ARMHC600:00
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/25d4428df1ddad966c74a3ed60171cd3ca6c8b66.1712682917.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
In general it's preferable to avoid placing cpumasks on the stack, as
for large values of NR_CPUS these can consume significant amounts of
stack space and make stack overflows more likely.
Use cpumask_any_and_but() to avoid the need for a temporary cpumask on
the stack.
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Link: https://lore.kernel.org/r/20240403155950.2068109-11-dawei.li@shingroup.cn
Signed-off-by: Will Deacon <will@kernel.org>
In general it's preferable to avoid placing cpumasks on the stack, as
for large values of NR_CPUS these can consume significant amounts of
stack space and make stack overflows more likely.
Use cpumask_any_and_but() to avoid the need for a temporary cpumask on
the stack.
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Link: https://lore.kernel.org/r/20240403155950.2068109-10-dawei.li@shingroup.cn
Signed-off-by: Will Deacon <will@kernel.org>
In general it's preferable to avoid placing cpumasks on the stack, as
for large values of NR_CPUS these can consume significant amounts of
stack space and make stack overflows more likely.
Use cpumask_any_and_but() to avoid the need for a temporary cpumask on
the stack.
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20240403155950.2068109-9-dawei.li@shingroup.cn
Signed-off-by: Will Deacon <will@kernel.org>
In general it's preferable to avoid placing cpumasks on the stack, as
for large values of NR_CPUS these can consume significant amounts of
stack space and make stack overflows more likely.
Use cpumask_any_and_but() to avoid the need for a temporary cpumask on
the stack.
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20240403155950.2068109-8-dawei.li@shingroup.cn
Signed-off-by: Will Deacon <will@kernel.org>
In general it's preferable to avoid placing cpumasks on the stack, as
for large values of NR_CPUS these can consume significant amounts of
stack space and make stack overflows more likely.
Use cpumask_any_and_but() to avoid the need for a temporary cpumask on
the stack.
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240403155950.2068109-7-dawei.li@shingroup.cn
Signed-off-by: Will Deacon <will@kernel.org>
In general it's preferable to avoid placing cpumasks on the stack, as
for large values of NR_CPUS these can consume significant amounts of
stack space and make stack overflows more likely.
Use cpumask_any_and_but() to avoid the need for a temporary cpumask on
the stack.
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Link: https://lore.kernel.org/r/20240403155950.2068109-6-dawei.li@shingroup.cn
Signed-off-by: Will Deacon <will@kernel.org>
In general it's preferable to avoid placing cpumasks on the stack, as
for large values of NR_CPUS these can consume significant amounts of
stack space and make stack overflows more likely.
Use cpumask_any_and_but() to avoid the need for a temporary cpumask on
the stack.
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Link: https://lore.kernel.org/r/20240403155950.2068109-5-dawei.li@shingroup.cn
Signed-off-by: Will Deacon <will@kernel.org>