We have a cross dependency between KVM and VFIO when using
s390 vfio_pci_zdev extensions for PCI passthrough
To be able to keep both subsystem modular we add a registering
hook inside the S390 core code.
This fixes a build problem when VFIO is built-in and KVM is built
as a module.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Pierre Morel <pmorel@linux.ibm.com>
Fixes: 09340b2fca ("KVM: s390: pci: add routines to start/stop interpretive execution")
Cc: <stable@vger.kernel.org>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20220819122945.9309-1-pmorel@linux.ibm.com
Message-Id: <20220819122945.9309-1-pmorel@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
- Cleanup use of extern in function prototypes (Alex Williamson)
- Simplify bus_type usage and convert to device IOMMU interfaces
(Robin Murphy)
- Check missed return value and fix comment typos (Bo Liu)
- Split migration ops from device ops and fix races in mlx5 migration
support (Yishai Hadas)
- Fix missed return value check in noiommu support (Liam Ni)
- Hardening to clear buffer pointer to avoid use-after-free (Schspa Shi)
- Remove requirement that only the same mm can unmap a previously
mapped range (Li Zhe)
- Adjust semaphore release vs device open counter (Yi Liu)
- Remove unused arg from SPAPR support code (Deming Wang)
- Rework vfio-ccw driver to better fit new mdev framework (Eric Farman,
Michael Kawano)
- Replace DMA unmap notifier with callbacks (Jason Gunthorpe)
- Clarify SPAPR support comment relative to iommu_ops (Alexey Kardashevskiy)
- Revise page pinning API towards compatibility with future iommufd support
(Nicolin Chen)
- Resolve issues in vfio-ccw, including use of DMA unmap callback
(Eric Farman)
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmLqvYMbHGFsZXgud2ls
bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiHM0P/1n/bszel20PRC7x+NLI
P7b/0aonW4Qtei2HORwowmaznb4NgRE5GCm5RU+a9+AwQKnK44j3lqy0skcfgZXr
f4viFlxOyd0H4blOhUZ+FuPNkUMAyz6HerzvJ9jQFG426pL5vr7UKWBuJPYB5RCT
4jEy3EUTSH8/Zt8ApLysFTyR64xN3Sk7vSUcj9rEhu5T3FWq8t9+jb3tE/HW/Xaw
pMwdC+ctYzYaBD/oA7Ns2IebNS9AUIUjKMXC25oCmc83WGgGOqgLB2mAthQ2NKB5
5capKBYuYl7PWERvpGpsPILEWvR6m+Rxh8r4Pqjcoyfq4k7vp+A/AFKiD7AEYBdy
BtfLWO59w6vuRQ5XXOa6Hu4ef6BcMvH4StrHxlHkKcgI4PJA0QscIXiJPQSt7Crr
m+kCNgPPgrfZDu7lmZTiWbXOYSkJR3Mxkhf2iNHudW9SsJT9pUAVEiGVVA/kC1Y/
fNBziRQeVF6JUW8M4pveXEWEbA8iE1HQeJA6aVRonxAkJk1KBaQgm/GKJlPXCHIR
R6lI90NXZHz/3ndIX1znKOm0qli+8auX/FH8iWUffZxGmtINOGGMYebD6YxFdCCJ
sWalL8vlQNCams2MZdovu/5BowXWtwOMm6KNG9RXSyWIWZEcNVbAzhTr+rrDdHZd
AJiUNCGO9UlO9FZM+ntfQTSr
=4BE8
-----END PGP SIGNATURE-----
Merge tag 'vfio-v6.0-rc1' of https://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Cleanup use of extern in function prototypes (Alex Williamson)
- Simplify bus_type usage and convert to device IOMMU interfaces (Robin
Murphy)
- Check missed return value and fix comment typos (Bo Liu)
- Split migration ops from device ops and fix races in mlx5 migration
support (Yishai Hadas)
- Fix missed return value check in noiommu support (Liam Ni)
- Hardening to clear buffer pointer to avoid use-after-free (Schspa
Shi)
- Remove requirement that only the same mm can unmap a previously
mapped range (Li Zhe)
- Adjust semaphore release vs device open counter (Yi Liu)
- Remove unused arg from SPAPR support code (Deming Wang)
- Rework vfio-ccw driver to better fit new mdev framework (Eric Farman,
Michael Kawano)
- Replace DMA unmap notifier with callbacks (Jason Gunthorpe)
- Clarify SPAPR support comment relative to iommu_ops (Alexey
Kardashevskiy)
- Revise page pinning API towards compatibility with future iommufd
support (Nicolin Chen)
- Resolve issues in vfio-ccw, including use of DMA unmap callback (Eric
Farman)
* tag 'vfio-v6.0-rc1' of https://github.com/awilliam/linux-vfio: (40 commits)
vfio/pci: fix the wrong word
vfio/ccw: Check return code from subchannel quiesce
vfio/ccw: Remove FSM Close from remove handlers
vfio/ccw: Add length to DMA_UNMAP checks
vfio: Replace phys_pfn with pages for vfio_pin_pages()
vfio/ccw: Add kmap_local_page() for memcpy
vfio: Rename user_iova of vfio_dma_rw()
vfio/ccw: Change pa_pfn list to pa_iova list
vfio/ap: Change saved_pfn to saved_iova
vfio: Pass in starting IOVA to vfio_pin/unpin_pages API
vfio/ccw: Only pass in contiguous pages
vfio/ap: Pass in physical address of ind to ap_aqic()
drm/i915/gvt: Replace roundup with DIV_ROUND_UP
vfio: Make vfio_unpin_pages() return void
vfio/spapr_tce: Fix the comment
vfio: Replace the iommu notifier with a device list
vfio: Replace the DMA unmapping notifier with a callback
vfio/ccw: Move FSM open/close to MDEV open/close
vfio/ccw: Refactor vfio_ccw_mdev_reset
vfio/ccw: Create a CLOSE FSM event
...
* Unwinder implementations for both nVHE modes (classic and
protected), complete with an overflow stack
* Rework of the sysreg access from userspace, with a complete
rewrite of the vgic-v3 view to allign with the rest of the
infrastructure
* Disagregation of the vcpu flags in separate sets to better track
their use model.
* A fix for the GICv2-on-v3 selftest
* A small set of cosmetic fixes
RISC-V:
* Track ISA extensions used by Guest using bitmap
* Added system instruction emulation framework
* Added CSR emulation framework
* Added gfp_custom flag in struct kvm_mmu_memory_cache
* Added G-stage ioremap() and iounmap() functions
* Added support for Svpbmt inside Guest
s390:
* add an interface to provide a hypervisor dump for secure guests
* improve selftests to use TAP interface
* enable interpretive execution of zPCI instructions (for PCI passthrough)
* First part of deferred teardown
* CPU Topology
* PV attestation
* Minor fixes
x86:
* Permit guests to ignore single-bit ECC errors
* Intel IPI virtualization
* Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS
* PEBS virtualization
* Simplify PMU emulation by just using PERF_TYPE_RAW events
* More accurate event reinjection on SVM (avoid retrying instructions)
* Allow getting/setting the state of the speaker port data bit
* Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent
* "Notify" VM exit (detect microarchitectural hangs) for Intel
* Use try_cmpxchg64 instead of cmpxchg64
* Ignore benign host accesses to PMU MSRs when PMU is disabled
* Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior
* Allow NX huge page mitigation to be disabled on a per-vm basis
* Port eager page splitting to shadow MMU as well
* Enable CMCI capability by default and handle injected UCNA errors
* Expose pid of vcpu threads in debugfs
* x2AVIC support for AMD
* cleanup PIO emulation
* Fixes for LLDT/LTR emulation
* Don't require refcounted "struct page" to create huge SPTEs
* Miscellaneous cleanups:
** MCE MSR emulation
** Use separate namespaces for guest PTEs and shadow PTEs bitmasks
** PIO emulation
** Reorganize rmap API, mostly around rmap destruction
** Do not workaround very old KVM bugs for L0 that runs with nesting enabled
** new selftests API for CPUID
Generic:
* Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache
* new selftests API using struct kvm_vcpu instead of a (vm, id) tuple
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmLnyo4UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMtQQf/XjVWiRcWLPR9dqzRM/vvRXpiG+UL
jU93R7m6ma99aqTtrxV/AE+kHgamBlma3Cwo+AcWk9uCVNbIhFjv2YKg6HptKU0e
oJT3zRYp+XIjEo7Kfw+TwroZbTlG6gN83l1oBLFMqiFmHsMLnXSI2mm8MXyi3dNB
vR2uIcTAl58KIprqNNsYJ2dNn74ogOMiXYx9XzoA9/5Xb6c0h4rreHJa5t+0s9RO
Gz7Io3PxumgsbJngjyL1Ve5oxhlIAcZA8DU0PQmjxo3eS+k6BcmavGFd45gNL5zg
iLpCh4k86spmzh8CWkAAwWPQE4dZknK6jTctJc0OFVad3Z7+X7n0E8TFrA==
=PM8o
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"Quite a large pull request due to a selftest API overhaul and some
patches that had come in too late for 5.19.
ARM:
- Unwinder implementations for both nVHE modes (classic and
protected), complete with an overflow stack
- Rework of the sysreg access from userspace, with a complete rewrite
of the vgic-v3 view to allign with the rest of the infrastructure
- Disagregation of the vcpu flags in separate sets to better track
their use model.
- A fix for the GICv2-on-v3 selftest
- A small set of cosmetic fixes
RISC-V:
- Track ISA extensions used by Guest using bitmap
- Added system instruction emulation framework
- Added CSR emulation framework
- Added gfp_custom flag in struct kvm_mmu_memory_cache
- Added G-stage ioremap() and iounmap() functions
- Added support for Svpbmt inside Guest
s390:
- add an interface to provide a hypervisor dump for secure guests
- improve selftests to use TAP interface
- enable interpretive execution of zPCI instructions (for PCI
passthrough)
- First part of deferred teardown
- CPU Topology
- PV attestation
- Minor fixes
x86:
- Permit guests to ignore single-bit ECC errors
- Intel IPI virtualization
- Allow getting/setting pending triple fault with
KVM_GET/SET_VCPU_EVENTS
- PEBS virtualization
- Simplify PMU emulation by just using PERF_TYPE_RAW events
- More accurate event reinjection on SVM (avoid retrying
instructions)
- Allow getting/setting the state of the speaker port data bit
- Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls
are inconsistent
- "Notify" VM exit (detect microarchitectural hangs) for Intel
- Use try_cmpxchg64 instead of cmpxchg64
- Ignore benign host accesses to PMU MSRs when PMU is disabled
- Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior
- Allow NX huge page mitigation to be disabled on a per-vm basis
- Port eager page splitting to shadow MMU as well
- Enable CMCI capability by default and handle injected UCNA errors
- Expose pid of vcpu threads in debugfs
- x2AVIC support for AMD
- cleanup PIO emulation
- Fixes for LLDT/LTR emulation
- Don't require refcounted "struct page" to create huge SPTEs
- Miscellaneous cleanups:
- MCE MSR emulation
- Use separate namespaces for guest PTEs and shadow PTEs bitmasks
- PIO emulation
- Reorganize rmap API, mostly around rmap destruction
- Do not workaround very old KVM bugs for L0 that runs with nesting enabled
- new selftests API for CPUID
Generic:
- Fix races in gfn->pfn cache refresh; do not pin pages tracked by
the cache
- new selftests API using struct kvm_vcpu instead of a (vm, id)
tuple"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (606 commits)
selftests: kvm: set rax before vmcall
selftests: KVM: Add exponent check for boolean stats
selftests: KVM: Provide descriptive assertions in kvm_binary_stats_test
selftests: KVM: Check stat name before other fields
KVM: x86/mmu: remove unused variable
RISC-V: KVM: Add support for Svpbmt inside Guest/VM
RISC-V: KVM: Use PAGE_KERNEL_IO in kvm_riscv_gstage_ioremap()
RISC-V: KVM: Add G-stage ioremap() and iounmap() functions
KVM: Add gfp_custom flag in struct kvm_mmu_memory_cache
RISC-V: KVM: Add extensible CSR emulation framework
RISC-V: KVM: Add extensible system instruction emulation framework
RISC-V: KVM: Factor-out instruction emulation into separate sources
RISC-V: KVM: move preempt_disable() call in kvm_arch_vcpu_ioctl_run
RISC-V: KVM: Make kvm_riscv_guest_timer_init a void function
RISC-V: KVM: Fix variable spelling mistake
RISC-V: KVM: Improve ISA extension by using a bitmap
KVM, x86/mmu: Fix the comment around kvm_tdp_mmu_zap_leafs()
KVM: SVM: Dump Virtual Machine Save Area (VMSA) to klog
KVM: x86/mmu: Treat NX as a valid SPTE bit for NPT
KVM: x86: Do not block APIC write for non ICR registers
...
When doing load/store interpretation, the maximum store block length is
determined by the underlying firmware, not the host kernel API. Reflect
that in the associated Query PCI Function Group clp capability and let
userspace decide which is appropriate to present to the guest.
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20220606203325.110625-20-mjrosato@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
The function handle is a system-wide unique identifier for a zPCI
device. With zPCI instruction interpretation, the host will no
longer be executing the zPCI instructions on behalf of the guest.
As a result, the guest needs to use the real function handle in
order for firmware to associate the instruction with the proper
PCI function. Let's provide that handle to the guest.
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20220606203325.110625-19-mjrosato@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
During vfio-pci open_device, pass the KVM associated with the vfio group
(if one exists). This is needed in order to pass a special indicator
(GISA) to firmware to allow zPCI interpretation facilities to be used
for only the specific KVM associated with the vfio-pci device. During
vfio-pci close_device, unregister the notifier.
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Link: https://lore.kernel.org/r/20220606203325.110625-18-mjrosato@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
The current contents of vfio-pci-zdev are today only useful in a KVM
environment; let's tie everything currently under vfio-pci-zdev to
this Kconfig statement and require KVM in this case, reducing complexity
(e.g. symbol lookups).
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Link: https://lore.kernel.org/r/20220606203325.110625-11-mjrosato@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
vfio core checks whether the driver sets some migration op (e.g.
set_state/get_state) and accordingly calls its op.
However, currently mlx5 driver sets the above ops without regards to its
migration caps.
This might lead to unexpected usage/Oops if user space may call to the
above ops even if the driver doesn't support migration. As for example,
the migration state_mutex is not initialized in that case.
The cleanest way to manage that seems to split the migration ops from
the main device ops, this will let the driver setting them separately
from the main ops when it's applicable.
As part of that, validate ops construction on registration and include a
check for VFIO_MIGRATION_STOP_COPY since the uAPI claims it must be set
in migration_flags.
HISI driver was changed as well to match this scheme.
This scheme may enable down the road to come with some extra group of
ops (e.g. DMA log) that can be set without regards to the other options
based on driver caps.
Fixes: 6fadb02126 ("vfio/mlx5: Implement vfio_pci driver for mlx5 devices")
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220628155910.171454-3-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Protect mlx5vf_disable_fds() upon close device to be called under the
state mutex as done in all other places.
This will prevent a race with any other flow which calls
mlx5vf_disable_fds() as of health/recovery upon
MLX5_PF_NOTIFY_DISABLE_VF event.
Encapsulate this functionality in a separate function named
mlx5vf_cmd_close_migratable() to consider migration caps and for further
usage upon close device.
Fixes: 6fadb02126 ("vfio/mlx5: Implement vfio_pci driver for mlx5 devices")
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220628155910.171454-2-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Console drivers can create conflicts with PCI resources resulting in
userspace getting mmap failures to memory BARs. This is especially
evident when trying to re-use the system primary console for userspace
drivers. Use the aperture helpers to remove these conflicts.
v3:
* call aperture_remove_conflicting_pci_devices()
Reported-by: Laszlo Ersek <lersek@redhat.com>
Suggested-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Tested-by: Laszlo Ersek <lersek@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220622140134.12763-4-tzimmermann@suse.de
When the iommu series adding driver_managed_dma was rebased it missed that
new VFIO drivers were added and did not update them too.
Without this vfio will claim the groups are not viable.
Add driver_managed_dma to mlx5 and hisi.
Fixes: 70693f4708 ("vfio: Set DMA ownership for VFIO devices")
Reported-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/0-v1-f9dfa642fab0+2b3-vfio_managed_dma_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently, there is very limited power management support
available in the upstream vfio_pci_core based drivers. If there
are no users of the device, then the PCI device will be moved into
D3hot state by writing directly into PCI PM registers. This D3hot
state help in saving power but we can achieve zero power consumption
if we go into the D3cold state. The D3cold state cannot be possible
with native PCI PM. It requires interaction with platform firmware
which is system-specific. To go into low power states (including D3cold),
the runtime PM framework can be used which internally interacts with PCI
and platform firmware and puts the device into the lowest possible
D-States.
This patch registers vfio_pci_core based drivers with the
runtime PM framework.
1. The PCI core framework takes care of most of the runtime PM
related things. For enabling the runtime PM, the PCI driver needs to
decrement the usage count and needs to provide 'struct dev_pm_ops'
at least. The runtime suspend/resume callbacks are optional and needed
only if we need to do any extra handling. Now there are multiple
vfio_pci_core based drivers. Instead of assigning the
'struct dev_pm_ops' in individual parent driver, the vfio_pci_core
itself assigns the 'struct dev_pm_ops'. There are other drivers where
the 'struct dev_pm_ops' is being assigned inside core layer
(For example, wlcore_probe() and some sound based driver, etc.).
2. This patch provides the stub implementation of 'struct dev_pm_ops'.
The subsequent patch will provide the runtime suspend/resume
callbacks. All the config state saving, and PCI power management
related things will be done by PCI core framework itself inside its
runtime suspend/resume callbacks (pci_pm_runtime_suspend() and
pci_pm_runtime_resume()).
3. Inside pci_reset_bus(), all the devices in dev_set needs to be
runtime resumed. vfio_pci_dev_set_pm_runtime_get() will take
care of the runtime resume and its error handling.
4. Inside vfio_pci_core_disable(), the device usage count always needs
to be decremented which was incremented in vfio_pci_core_enable().
5. Since the runtime PM framework will provide the same functionality,
so directly writing into PCI PM config register can be replaced with
the use of runtime PM routines. Also, the use of runtime PM can help
us in more power saving.
In the systems which do not support D3cold,
With the existing implementation:
// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3hot
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D0
With runtime PM:
// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3hot
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D3hot
So, with runtime PM, the upstream bridge or root port will also go
into lower power state which is not possible with existing
implementation.
In the systems which support D3cold,
// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3hot
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D0
With runtime PM:
// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3cold
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D3cold
So, with runtime PM, both the PCI device and upstream bridge will
go into D3cold state.
6. If 'disable_idle_d3' module parameter is set, then also the runtime
PM will be enabled, but in this case, the usage count should not be
decremented.
7. vfio_pci_dev_set_try_reset() return value is unused now, so this
function return type can be changed to void.
8. Use the runtime PM API's in vfio_pci_core_sriov_configure().
The device can be in low power state either with runtime
power management (when there is no user) or PCI_PM_CTRL register
write by the user. In both the cases, the PF should be moved to
D0 state. For preventing any runtime usage mismatch, pci_num_vf()
has been called explicitly during disable.
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220518111612.16985-5-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If any PME event will be generated by PCI, then it will be mostly
handled in the host by the root port PME code. For example, in the case
of PCIe, the PME event will be sent to the root port and then the PME
interrupt will be generated. This will be handled in
drivers/pci/pcie/pme.c at the host side. Inside this, the
pci_check_pme_status() will be called where PME_Status and PME_En bits
will be cleared. So, the guest OS which is using vfio-pci device will
not come to know about this PME event.
To handle these PME events inside guests, we need some framework so
that if any PME events will happen, then it needs to be forwarded to
virtual machine monitor. We can virtualize PME related registers bits
and initialize these bits to zero so vfio-pci device user will assume
that it is not capable of asserting the PME# signal from any power state.
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220518111612.16985-4-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
According to [PCIe v5 9.6.2] for PF Device Power Management States
"The PF's power management state (D-state) has global impact on its
associated VFs. If a VF does not implement the Power Management
Capability, then it behaves as if it is in an equivalent
power state of its associated PF.
If a VF implements the Power Management Capability, the Device behavior
is undefined if the PF is placed in a lower power state than the VF.
Software should avoid this situation by placing all VFs in lower power
state before lowering their associated PF's power state."
From the vfio driver side, user can enable SR-IOV when the PF is in D3hot
state. If VF does not implement the Power Management Capability, then
the VF will be actually in D3hot state and then the VF BAR access will
fail. If VF implements the Power Management Capability, then VF will
assume that its current power state is D0 when the PF is D3hot and
in this case, the behavior is undefined.
To support PF power management, we need to create power management
dependency between PF and its VF's. The runtime power management support
may help with this where power management dependencies are supported
through device links. But till we have such support in place, we can
disallow the PF to go into low power state, if PF has VF enabled.
There can be a case, where user first enables the VF's and then
disables the VF's. If there is no user of PF, then the PF can put into
D3hot state again. But with this patch, the PF will still be in D0
state after disabling VF's since detecting this case inside
vfio_pci_core_sriov_configure() requires access to
struct vfio_device::open_count along with its locks. But the subsequent
patches related to runtime PM will handle this case since runtime PM
maintains its own usage count.
Also, vfio_pci_core_sriov_configure() can be called at any time
(with and without vfio pci device user), so the power state change
and SR-IOV enablement need to be protected with the required locks.
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220518111612.16985-3-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
According to [PCIe v5 5.3.1.4.1] for D3hot state
"Configuration and Message requests are the only TLPs accepted by a
Function in the D3Hot state. All other received Requests must be
handled as Unsupported Requests, and all received Completions may
optionally be handled as Unexpected Completions."
Currently, if the vfio PCI device has been put into D3hot state and if
user makes non-config related read/write request in D3hot state, these
requests will be forwarded to the host and this access may cause
issues on a few systems.
This patch leverages the memory-disable support added in commit
'abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on
disabled memory")' to generate page fault on mmap access and
return error for the direct read/write. If the device is D3hot state,
then the error will be returned for MMIO access. The IO access generally
does not make the system unresponsive so the IO access can still happen
in D3hot state. The default value should be returned in this case
without bringing down the complete system.
Also, the power related structure fields need to be protected so
we can use the same 'memory_lock' to protect these fields also.
This protection is mainly needed when user changes the PCI
power state by writing into PCI_PM_CTRL register.
vfio_lock_and_set_power_state() wrapper function will take the
required locks and then it will invoke the vfio_pci_set_power_state().
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220518111612.16985-2-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIO PCI does a security check as part of hot reset to prove that the user
has permission to manipulate all the devices that will be impacted by the
reset.
Use a new API vfio_file_has_dev() to perform this security check against
the struct file directly and remove the vfio_group from VFIO PCI.
Since VFIO PCI was the last user of vfio_group_get_external_user() and
vfio_group_put_external_user() remove it as well.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/8-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The last user of this function is in PCI callbacks that want to convert
their struct pci_dev to a vfio_device. Instead of searching use the
vfio_device available trivially through the drvdata.
When a callback in the device_driver is called, the caller must hold the
device_lock() on dev. The purpose of the device_lock is to prevent
remove() from being called (see __device_release_driver), and allow the
driver to safely interact with its drvdata without races.
The PCI core correctly follows this and holds the device_lock() when
calling error_detected (see report_error_detected) and
sriov_configure (see sriov_numvfs_store).
Further, since the drvdata holds a positive refcount on the vfio_device
any access of the drvdata, under the device_lock(), from a driver callback
needs no further protection or refcounting.
Thus the remark in the vfio_device_get_from_dev() comment does not apply
here, VFIO PCI drivers all call vfio_unregister_group_dev() from their
remove callbacks under the device_lock() and cannot race with the
remaining callers.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v4-c841817a0349+8f-vfio_get_from_dev_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Having a consistent pointer in the drvdata will allow the next patch to
make use of the drvdata from some of the core code helpers.
Use a WARN_ON inside vfio_pci_core_register_device() to detect drivers
that miss this.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v4-c841817a0349+8f-vfio_get_from_dev_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
From Yishai:
This series improves mlx5 live migration driver in few aspects as of
below.
Refactor to enable running migration commands in parallel over the PF
command interface.
To achieve that we exposed from mlx5_core an API to let the VF be
notified before that the PF command interface goes down/up. (e.g. PF
reload upon health recovery).
Once having the above functionality in place mlx5 vfio doesn't need any
more to obtain the global PF lock upon using the command interface but
can rely on the above mechanism to be in sync with the PF.
This can enable parallel VFs migration over the PF command interface
from kernel driver point of view.
In addition,
Moved to use the PF async command mode for the SAVE state command.
This enables returning earlier to user space upon issuing successfully
the command and improve latency by let things run in parallel.
Alex, as this series touches mlx5_core we may need to send this in a
pull request format to VFIO to avoid conflicts before acceptance.
Link: https://lore.kernel.org/all/20220510090206.90374-1-yishaih@nvidia.com
Signed-of-by: Leon Romanovsky <leonro@nvidia.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQT1m3YD37UfMCUQBNwp8NhrnBAZsQUCYntY3AAKCRAp8NhrnBAZ
scRWAP0QzEqg/Xqk/geUAQ3dliFrA2DZJm8v9B3x5tA5nEAazAD9HqC17MvDzY8T
6KBP7G37JNg2NCkxnKnt2gCIT+O4lgA=
=zwWT
-----END PGP SIGNATURE-----
Merge tag 'mlx5-lm-parallel' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into v5.19/vfio/next
Improve mlx5 live migration driver
From Yishai:
This series improves mlx5 live migration driver in few aspects as of
below.
Refactor to enable running migration commands in parallel over the PF
command interface.
To achieve that we exposed from mlx5_core an API to let the VF be
notified before that the PF command interface goes down/up. (e.g. PF
reload upon health recovery).
Once having the above functionality in place mlx5 vfio doesn't need any
more to obtain the global PF lock upon using the command interface but
can rely on the above mechanism to be in sync with the PF.
This can enable parallel VFs migration over the PF command interface
from kernel driver point of view.
In addition,
Moved to use the PF async command mode for the SAVE state command.
This enables returning earlier to user space upon issuing successfully
the command and improve latency by let things run in parallel.
Alex, as this series touches mlx5_core we may need to send this in a
pull request format to VFIO to avoid conflicts before acceptance.
Link: https://lore.kernel.org/all/20220510090206.90374-1-yishaih@nvidia.com
Signed-of-by: Leon Romanovsky <leonro@nvidia.com>
Use the PF asynchronous command mode for the SAVE state command.
This enables returning earlier to user space upon issuing successfully
the command and improve latency by let things run in parallel.
Link: https://lore.kernel.org/r/20220510090206.90374-5-yishaih@nvidia.com
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Refactor to enable different VFs to run their commands over the PF
command interface in parallel and to not block one each other.
This is done by not using the global PF lock that was used before but
relying on the VF attach/detach mechanism to sync.
Link: https://lore.kernel.org/r/20220510090206.90374-4-yishaih@nvidia.com
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Manage the VF attach/detach callback from the PF.
This lets the driver to enable parallel VFs migration as will be
introduced in the next patch.
As part of this, reorganize the VF is migratable code to be in a
separate function and rename it to be set_migratable() to match its
functionality.
Link: https://lore.kernel.org/r/20220510090206.90374-3-yishaih@nvidia.com
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Claim group dma ownership when an IOMMU group is set to a container,
and release the dma ownership once the iommu group is unset from the
container.
This change disallows some unsafe bridge drivers to bind to non-ACS
bridges while devices under them are assigned to user space. This is an
intentional enhancement and possibly breaks some existing
configurations. The recommendation to such an affected user would be
that the previously allowed host bridge driver was unsafe for this use
case and to continue to enable assignment of devices within that group,
the driver should be unbound from the bridge device or replaced with the
pci-stub driver.
For any bridge driver, we consider it unsafe if it satisfies any of the
following conditions:
1) The bridge driver uses DMA. Calling pci_set_master() or calling any
kernel DMA API (dma_map_*() and etc.) is an indicate that the
driver is doing DMA.
2) If the bridge driver uses MMIO, it should be tolerant to hostile
userspace also touching the same MMIO registers via P2P DMA
attacks.
If the bridge driver turns out to be a safe one, it could be used as
before by setting the driver's .driver_managed_dma field, just like what
we have done in the pcieport driver.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20220418005000.897664-8-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
get_pf_vdev() tries to check if a PF is a VFIO PF by looking at the driver:
if (pci_dev_driver(physfn) != pci_dev_driver(vdev->pdev)) {
However now that we have multiple VF and PF drivers this is no longer
reliable.
This means that security tests realted to vf_token can be skipped by
mixing and matching different VFIO PCI drivers.
Instead of trying to use the driver core to find the PF devices maintain a
linked list of all PF vfio_pci_core_device's that we have called
pci_enable_sriov() on.
When registering a VF just search the list to see if the PF is present and
record the match permanently in the struct. PCI core locking prevents a PF
from passing pci_disable_sriov() while VF drivers are attached so the VFIO
owned PF becomes a static property of the VF.
In common cases where vfio does not own the PF the global list remains
empty and the VF's pointer is statically NULL.
This also fixes a lockdep splat from recursive locking of the
vfio_group::device_lock between vfio_device_get_from_name() and
vfio_device_get_from_dev(). If the VF and PF share the same group this
would deadlock.
Fixes: ff53edf6d6 ("vfio/pci: Split the pci_driver code out of vfio_pci_core.c")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v3-876570980634+f2e8-vfio_vf_token_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Register private handler for pci_error_handlers.reset_done and update
state accordingly.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Longfang Liu <liulongfang@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20220308184902.2242-10-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VMs assigned with HiSilicon ACC VF devices can now perform live migration
if the VF devices are bind to the hisi_acc_vfio_pci driver.
Just like ACC PF/VF drivers this VFIO driver also make use of the HiSilicon
QM interface. QM stands for Queue Management which is a generic IP used by
ACC devices. It provides a generic PCIe interface for the CPU and the ACC
devices to share a group of queues.
QM integrated into an accelerator provides queue management service.
Queues can be assigned to PF and VFs, and queues can be controlled by
unified mailboxes and doorbells.
The QM driver (drivers/crypto/hisilicon/qm.c) provides generic
interfaces to ACC drivers to manage the QM.
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20220308184902.2242-9-shameerali.kolothum.thodi@huawei.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
HiSilicon ACC VF device BAR2 region consists of both functional register
space and migration control register space. Unnecessarily exposing the
migration BAR region to the Guest has the potential to prevent/corrupt
the Guest migration.
Hence, introduce a separate struct vfio_device_ops for migration support
which will override the ioctl/read/write/mmap methods to hide the
migration region and limit the Guest access only to the functional
register space.
This will be used in subsequent patches when we add migration support
to the driver.
Please note that it is OK to export the entire VF BAR if migration is
not supported or required as this cannot affect the PF configurations.
Reviewed-by: Longfang Liu <liulongfang@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20220308184902.2242-6-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add a vendor-specific vfio_pci driver for HiSilicon ACC devices.
This will be extended in subsequent patches to add support for VFIO
live migration feature.
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20220308184902.2242-5-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Register its own handler for pci_error_handlers.reset_done and update
state accordingly.
Link: https://lore.kernel.org/all/20220224142024.147653-16-yishaih@nvidia.com
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Expose vfio_pci_core_aer_err_detected() to be used by drivers as part of
their pci_error_handlers structure.
Next patch for mlx5 driver will use it.
Link: https://lore.kernel.org/all/20220224142024.147653-15-yishaih@nvidia.com
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
This patch adds support for vfio_pci driver for mlx5 devices.
It uses vfio_pci_core to register to the VFIO subsystem and then
implements the mlx5 specific logic in the migration area.
The migration implementation follows the definition from uapi/vfio.h and
uses the mlx5 VF->PF command channel to achieve it.
This patch implements the suspend/resume flows.
Link: https://lore.kernel.org/all/20220224142024.147653-14-yishaih@nvidia.com
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Expose migration commands over the device, it includes: suspend, resume,
get vhca id, query/save/load state.
As part of this adds the APIs and data structure that are needed to manage
the migration data.
Link: https://lore.kernel.org/all/20220224142024.147653-13-yishaih@nvidia.com
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Invoke a new device op 'device_feature' to handle just the data array
portion of the command. This lifts the ioctl validation to the core code
and makes it simpler for either the core code, or layered drivers, to
implement their own feature values.
Provide vfio_check_feature() to consolidate checking the flags/etc against
what the driver supports.
Link: https://lore.kernel.org/all/20220224142024.147653-9-yishaih@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
If 'vfio_pci_core_device::needs_pm_restore' is set (PCI device does
not have No_Soft_Reset bit set in its PMCSR config register), then the
current PCI state will be saved locally in
'vfio_pci_core_device::pm_save' during D0->D3hot transition and same
will be restored back during D3hot->D0 transition. For reset-related
functionalities, vfio driver uses PCI reset API's. These
API's internally change the PCI power state back to D0 first if
the device power state is non-D0. This state change to D0 will happen
without the involvement of vfio driver.
Let's consider the following example:
1. The device is in D3hot.
2. User invokes VFIO_DEVICE_RESET ioctl.
3. pci_try_reset_function() will be called which internally
invokes pci_dev_save_and_disable().
4. pci_set_power_state(dev, PCI_D0) will be called first.
5. pci_save_state() will happen then.
Now, for the devices which has NoSoftRst-, the pci_set_power_state()
can trigger soft reset and the original PCI config state will be lost
at step (4) and this state cannot be restored again. This original PCI
state can include any setting which is performed by SBIOS or host
linux kernel (for example LTR, ASPM L1 substates, etc.). When this
soft reset will be triggered, then all these settings will be reset,
and the device state saved at step (5) will also have this setting
cleared so it cannot be restored. Since the vfio driver only exposes
limited PCI capabilities to its user, so the vfio driver user also
won't have the option to save and restore these capabilities state
either and these original settings will be permanently lost.
For pci_reset_bus() also, we can have the above situation.
The other functions/devices can be in D3hot and the reset will change
the power state of all devices to D0 without the involvement of vfio
driver.
So, before calling any reset-related API's, we need to make sure that
the device state is D0. This is mainly to preserve the state around
soft reset.
For vfio_pci_core_disable(), we use __pci_reset_function_locked()
which internally can use pci_pm_reset() for the function reset.
pci_pm_reset() requires the device power state to be in D0, otherwise
it returns error.
This patch changes the device power state to D0 by invoking
vfio_pci_set_power_state() explicitly before calling any reset related
API's.
Fixes: 51ef3a004b ("vfio/pci: Restore device state on PM transition")
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220217122107.22434-3-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If 'vfio_pci_core_device::needs_pm_restore' is set (PCI device does
not have No_Soft_Reset bit set in its PMCSR config register), then
the current PCI state will be saved locally in
'vfio_pci_core_device::pm_save' during D0->D3hot transition and same
will be restored back during D3hot->D0 transition.
For saving the PCI state locally, pci_store_saved_state() is being
used and the pci_load_and_free_saved_state() will free the allocated
memory.
But for reset related IOCTLs, vfio driver calls PCI reset-related
API's which will internally change the PCI power state back to D0. So,
when the guest resumes, then it will get the current state as D0 and it
will skip the call to vfio_pci_set_power_state() for changing the
power state to D0 explicitly. In this case, the memory pointed by
'pm_save' will never be freed. In a malicious sequence, the state changing
to D3hot followed by VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be
run in a loop and it can cause an OOM situation.
This patch frees the earlier allocated memory first before overwriting
'pm_save' to prevent the mentioned memory leak.
Fixes: 51ef3a004b ("vfio/pci: Restore device state on PM transition")
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220217122107.22434-2-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Sparse warns:
sparse warnings: (new ones prefixed by >>)
>> drivers/vfio/pci/vfio_pci_igd.c:146:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned short [addressable] [usertype] val @@ got restricted __le16 [usertype] @@
drivers/vfio/pci/vfio_pci_igd.c:146:21: sparse: expected unsigned short [addressable] [usertype] val
drivers/vfio/pci/vfio_pci_igd.c:146:21: sparse: got restricted __le16 [usertype]
>> drivers/vfio/pci/vfio_pci_igd.c:161:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned int [addressable] [usertype] val @@ got restricted __le32 [usertype] @@
drivers/vfio/pci/vfio_pci_igd.c:161:21: sparse: expected unsigned int [addressable] [usertype] val
drivers/vfio/pci/vfio_pci_igd.c:161:21: sparse: got restricted __le32 [usertype]
drivers/vfio/pci/vfio_pci_igd.c:176:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned short [addressable] [usertype] val @@ got restricted __le16 [usertype] @@
drivers/vfio/pci/vfio_pci_igd.c:176:21: sparse: expected unsigned short [addressable] [usertype] val
drivers/vfio/pci/vfio_pci_igd.c:176:21: sparse: got restricted __le16 [usertype]
These are due to trying to use an unsigned to store the result of
a cpu_to_leXX() conversion. These are small variables, so pointer
tricks are wasteful and casting just generates different sparse
warnings. Store to and copy results from a separate little endian
variable.
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/r/202111290026.O3vehj03-lkp@intel.com/
Link: https://lore.kernel.org/r/163840226123.138003.7668320168896210328.stgit@omen
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is to fix incorrect pointer arithmetic which caused wrong
OpRegion version returned, then VM driver got error to get wanted
VBT block. We need to be safe to return correct data, so force
pointer type for byte access.
Fixes: 49ba1a2976 ("vfio/pci: Add OpRegion 2.0+ Extended VBT support.")
Cc: Colin Xu <colin.xu@gmail.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dmitry Torokhov <dtor@chromium.org>
Cc: "Xu, Terrence" <terrence.xu@intel.com>
Cc: "Gao, Fred" <fred.gao@intel.com>
Acked-by: Colin Xu <colin.xu@gmail.com>
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Link: https://lore.kernel.org/r/20211125051328.3359902-1-zhenyuw@linux.intel.com
[aw: line wrap]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Due to historical reason, some legacy shipped system doesn't follow
OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
VBT is not contiguous after OpRegion in physical address, but any
location pointed by RVDA via absolute address. Also although current
OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
RVDA is the relative address to OpRegion head, the extended VBT location
may change to non-contiguous to OpRegion. In both cases, it's impossible
to map a contiguous range to hold both OpRegion and the extended VBT and
expose via one vfio region.
The only difference between OpRegion 2.0 and 2.1 is where extended
VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
while for 2.1, RVDA is the relative address of extended VBT to OpRegion
baes, and there is no other difference between OpRegion 2.0 and 2.1.
To support the non-contiguous region case as described, the updated read
op will patch OpRegion version and RVDA on-the-fly accordingly. So that
from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
after OpRegion is exposed, regardless the underneath host OpRegion is
2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
2.0 extended VBT systems with on the market, and support OpRegion 2.1+
where the extended VBT isn't contiguous after OpRegion.
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Hang Yuan <hang.yuan@linux.intel.com>
Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
Cc: Fred Gao <fred.gao@intel.com>
Signed-off-by: Colin Xu <colin.xu@intel.com>
Link: https://lore.kernel.org/r/20211012124855.52463-1-colin.xu@gmail.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We don't need to hold a reference to the group in the driver as well as
obtain a reference to the same group as the first thing
vfio_register_group_dev() does.
Since the drivers never use the group move this all into the core code.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210924155705.4258-2-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The function prototype is missing an identifier name. Add one.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20210902212631.54260-1-colin.king@canonical.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
- Fix dma-valid return WAITED implementation (Anthony Yznaga)
- SPDX license cleanups (Cai Huoqing)
- Split vfio-pci-core from vfio-pci and enhance PCI driver matching
to support future vendor provided vfio-pci variants (Yishai Hadas,
Max Gurtovoy, Jason Gunthorpe)
- Replace duplicated reflck with core support for managing first
open, last close, and device sets (Jason Gunthorpe, Max Gurtovoy,
Yishai Hadas)
- Fix non-modular mdev support and don't nag about request callback
support (Christoph Hellwig)
- Add semaphore to protect instruction intercept handler and replace
open-coded locks in vfio-ap driver (Tony Krowiak)
- Convert vfio-ap to vfio_register_group_dev() API (Jason Gunthorpe)
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmEvwWkbHGFsZXgud2ls
bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsi+1UP/3CRizghroINVYR+cJ99
Tjz7lB/wlzxmRfX+SL4NAVe1SSB2VeCgU4B0PF6kywELLS8OhCO3HXYXVsz244fW
Gk5UIns86+TFTrfCOMpwYBV0P86zuaa1ZnvCnkhMK1i2pTZ+oX8hUH1Yj5clHuU+
YgC7JfEuTIAX73q2bC/llLvNE9ke1QCoDX3+HAH87ttqutnRWcnnq56PTEqwe+EW
eMA+glB1UG6JAqXxoJET4155arNOny1/ZMprfBr3YXZTiXDF/lSzuMyUtbp526Sf
hsvlnqtE6TCdfKbog0Lxckl+8E9NCq8jzFBKiZhbccrQv3vVaoP6dOsPWcT35Kp1
IjzMLiHIbl4wXOL+Xap/biz3LCM5BMdT/OhW5LUC007zggK71ndRvb9F8ptW83Bv
0Uh9DNv7YIQ0su3JHZEsJ3qPFXQXceP199UiADOGSeV8U1Qig3YKsHUDMuALfFvN
t+NleeJ4qCWao+W4VCfyDfKurVnMj/cThXiDEWEeq5gMOO+6YKBIFWJVKFxUYDbf
MgGdg0nQTUECuXKXxLD4c1HAWH9xi207OnLvhW1Icywp20MsYqOWt0vhg+PRdMBT
DK6STxP18aQxCaOuQN9Vf81LjhXNTeg+xt3mMyViOZPcKfX6/wAC9qLt4MucJDdw
FBfOz2UL2F56dhAYT+1vHoUM
=nzK7
-----END PGP SIGNATURE-----
Merge tag 'vfio-v5.15-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Fix dma-valid return WAITED implementation (Anthony Yznaga)
- SPDX license cleanups (Cai Huoqing)
- Split vfio-pci-core from vfio-pci and enhance PCI driver matching to
support future vendor provided vfio-pci variants (Yishai Hadas, Max
Gurtovoy, Jason Gunthorpe)
- Replace duplicated reflck with core support for managing first open,
last close, and device sets (Jason Gunthorpe, Max Gurtovoy, Yishai
Hadas)
- Fix non-modular mdev support and don't nag about request callback
support (Christoph Hellwig)
- Add semaphore to protect instruction intercept handler and replace
open-coded locks in vfio-ap driver (Tony Krowiak)
- Convert vfio-ap to vfio_register_group_dev() API (Jason Gunthorpe)
* tag 'vfio-v5.15-rc1' of git://github.com/awilliam/linux-vfio: (37 commits)
vfio/pci: Introduce vfio_pci_core.ko
vfio: Use kconfig if XX/endif blocks instead of repeating 'depends on'
vfio: Use select for eventfd
PCI / VFIO: Add 'override_only' support for VFIO PCI sub system
PCI: Add 'override_only' field to struct pci_device_id
vfio/pci: Move module parameters to vfio_pci.c
vfio/pci: Move igd initialization to vfio_pci.c
vfio/pci: Split the pci_driver code out of vfio_pci_core.c
vfio/pci: Include vfio header in vfio_pci_core.h
vfio/pci: Rename ops functions to fit core namings
vfio/pci: Rename vfio_pci_device to vfio_pci_core_device
vfio/pci: Rename vfio_pci_private.h to vfio_pci_core.h
vfio/pci: Rename vfio_pci.c to vfio_pci_core.c
vfio/ap_ops: Convert to use vfio_register_group_dev()
s390/vfio-ap: replace open coded locks for VFIO_GROUP_NOTIFY_SET_KVM notification
s390/vfio-ap: r/w lock for PQAP interception handler function pointer
vfio/type1: Fix vfio_find_dma_valid return
vfio-pci/zdev: Remove repeated verbose license text
vfio: platform: reset: Convert to SPDX identifier
vfio: Remove struct vfio_device_ops open/release
...