mirror of
https://github.com/torvalds/linux.git
synced 2024-11-25 21:51:40 +00:00
Documentation: hyperv: Add overview of PCI pass-thru device support
Add documentation topic for PCI pass-thru devices in Linux guests on Hyper-V and for the associated PCI controller driver (pci-hyperv.c). Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Link: https://lore.kernel.org/r/20240222200710.305259-1-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240222200710.305259-1-mhklinux@outlook.com>
This commit is contained in:
parent
9645e74414
commit
04ed680e76
@ -10,3 +10,4 @@ Hyper-V Enlightenments
|
||||
overview
|
||||
vmbus
|
||||
clocks
|
||||
vpci
|
||||
|
316
Documentation/virt/hyperv/vpci.rst
Normal file
316
Documentation/virt/hyperv/vpci.rst
Normal file
@ -0,0 +1,316 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
PCI pass-thru devices
|
||||
=========================
|
||||
In a Hyper-V guest VM, PCI pass-thru devices (also called
|
||||
virtual PCI devices, or vPCI devices) are physical PCI devices
|
||||
that are mapped directly into the VM's physical address space.
|
||||
Guest device drivers can interact directly with the hardware
|
||||
without intermediation by the host hypervisor. This approach
|
||||
provides higher bandwidth access to the device with lower
|
||||
latency, compared with devices that are virtualized by the
|
||||
hypervisor. The device should appear to the guest just as it
|
||||
would when running on bare metal, so no changes are required
|
||||
to the Linux device drivers for the device.
|
||||
|
||||
Hyper-V terminology for vPCI devices is "Discrete Device
|
||||
Assignment" (DDA). Public documentation for Hyper-V DDA is
|
||||
available here: `DDA`_
|
||||
|
||||
.. _DDA: https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/plan-for-deploying-devices-using-discrete-device-assignment
|
||||
|
||||
DDA is typically used for storage controllers, such as NVMe,
|
||||
and for GPUs. A similar mechanism for NICs is called SR-IOV
|
||||
and produces the same benefits by allowing a guest device
|
||||
driver to interact directly with the hardware. See Hyper-V
|
||||
public documentation here: `SR-IOV`_
|
||||
|
||||
.. _SR-IOV: https://learn.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-single-root-i-o-virtualization--sr-iov-
|
||||
|
||||
This discussion of vPCI devices includes DDA and SR-IOV
|
||||
devices.
|
||||
|
||||
Device Presentation
|
||||
-------------------
|
||||
Hyper-V provides full PCI functionality for a vPCI device when
|
||||
it is operating, so the Linux device driver for the device can
|
||||
be used unchanged, provided it uses the correct Linux kernel
|
||||
APIs for accessing PCI config space and for other integration
|
||||
with Linux. But the initial detection of the PCI device and
|
||||
its integration with the Linux PCI subsystem must use Hyper-V
|
||||
specific mechanisms. Consequently, vPCI devices on Hyper-V
|
||||
have a dual identity. They are initially presented to Linux
|
||||
guests as VMBus devices via the standard VMBus "offer"
|
||||
mechanism, so they have a VMBus identity and appear under
|
||||
/sys/bus/vmbus/devices. The VMBus vPCI driver in Linux at
|
||||
drivers/pci/controller/pci-hyperv.c handles a newly introduced
|
||||
vPCI device by fabricating a PCI bus topology and creating all
|
||||
the normal PCI device data structures in Linux that would
|
||||
exist if the PCI device were discovered via ACPI on a bare-
|
||||
metal system. Once those data structures are set up, the
|
||||
device also has a normal PCI identity in Linux, and the normal
|
||||
Linux device driver for the vPCI device can function as if it
|
||||
were running in Linux on bare-metal. Because vPCI devices are
|
||||
presented dynamically through the VMBus offer mechanism, they
|
||||
do not appear in the Linux guest's ACPI tables. vPCI devices
|
||||
may be added to a VM or removed from a VM at any time during
|
||||
the life of the VM, and not just during initial boot.
|
||||
|
||||
With this approach, the vPCI device is a VMBus device and a
|
||||
PCI device at the same time. In response to the VMBus offer
|
||||
message, the hv_pci_probe() function runs and establishes a
|
||||
VMBus connection to the vPCI VSP on the Hyper-V host. That
|
||||
connection has a single VMBus channel. The channel is used to
|
||||
exchange messages with the vPCI VSP for the purpose of setting
|
||||
up and configuring the vPCI device in Linux. Once the device
|
||||
is fully configured in Linux as a PCI device, the VMBus
|
||||
channel is used only if Linux changes the vCPU to be interrupted
|
||||
in the guest, or if the vPCI device is removed from
|
||||
the VM while the VM is running. The ongoing operation of the
|
||||
device happens directly between the Linux device driver for
|
||||
the device and the hardware, with VMBus and the VMBus channel
|
||||
playing no role.
|
||||
|
||||
PCI Device Setup
|
||||
----------------
|
||||
PCI device setup follows a sequence that Hyper-V originally
|
||||
created for Windows guests, and that can be ill-suited for
|
||||
Linux guests due to differences in the overall structure of
|
||||
the Linux PCI subsystem compared with Windows. Nonetheless,
|
||||
with a bit of hackery in the Hyper-V virtual PCI driver for
|
||||
Linux, the virtual PCI device is setup in Linux so that
|
||||
generic Linux PCI subsystem code and the Linux driver for the
|
||||
device "just work".
|
||||
|
||||
Each vPCI device is set up in Linux to be in its own PCI
|
||||
domain with a host bridge. The PCI domainID is derived from
|
||||
bytes 4 and 5 of the instance GUID assigned to the VMBus vPCI
|
||||
device. The Hyper-V host does not guarantee that these bytes
|
||||
are unique, so hv_pci_probe() has an algorithm to resolve
|
||||
collisions. The collision resolution is intended to be stable
|
||||
across reboots of the same VM so that the PCI domainIDs don't
|
||||
change, as the domainID appears in the user space
|
||||
configuration of some devices.
|
||||
|
||||
hv_pci_probe() allocates a guest MMIO range to be used as PCI
|
||||
config space for the device. This MMIO range is communicated
|
||||
to the Hyper-V host over the VMBus channel as part of telling
|
||||
the host that the device is ready to enter d0. See
|
||||
hv_pci_enter_d0(). When the guest subsequently accesses this
|
||||
MMIO range, the Hyper-V host intercepts the accesses and maps
|
||||
them to the physical device PCI config space.
|
||||
|
||||
hv_pci_probe() also gets BAR information for the device from
|
||||
the Hyper-V host, and uses this information to allocate MMIO
|
||||
space for the BARs. That MMIO space is then setup to be
|
||||
associated with the host bridge so that it works when generic
|
||||
PCI subsystem code in Linux processes the BARs.
|
||||
|
||||
Finally, hv_pci_probe() creates the root PCI bus. At this
|
||||
point the Hyper-V virtual PCI driver hackery is done, and the
|
||||
normal Linux PCI machinery for scanning the root bus works to
|
||||
detect the device, to perform driver matching, and to
|
||||
initialize the driver and device.
|
||||
|
||||
PCI Device Removal
|
||||
------------------
|
||||
A Hyper-V host may initiate removal of a vPCI device from a
|
||||
guest VM at any time during the life of the VM. The removal
|
||||
is instigated by an admin action taken on the Hyper-V host and
|
||||
is not under the control of the guest OS.
|
||||
|
||||
A guest VM is notified of the removal by an unsolicited
|
||||
"Eject" message sent from the host to the guest over the VMBus
|
||||
channel associated with the vPCI device. Upon receipt of such
|
||||
a message, the Hyper-V virtual PCI driver in Linux
|
||||
asynchronously invokes Linux kernel PCI subsystem calls to
|
||||
shutdown and remove the device. When those calls are
|
||||
complete, an "Ejection Complete" message is sent back to
|
||||
Hyper-V over the VMBus channel indicating that the device has
|
||||
been removed. At this point, Hyper-V sends a VMBus rescind
|
||||
message to the Linux guest, which the VMBus driver in Linux
|
||||
processes by removing the VMBus identity for the device. Once
|
||||
that processing is complete, all vestiges of the device having
|
||||
been present are gone from the Linux kernel. The rescind
|
||||
message also indicates to the guest that Hyper-V has stopped
|
||||
providing support for the vPCI device in the guest. If the
|
||||
guest were to attempt to access that device's MMIO space, it
|
||||
would be an invalid reference. Hypercalls affecting the device
|
||||
return errors, and any further messages sent in the VMBus
|
||||
channel are ignored.
|
||||
|
||||
After sending the Eject message, Hyper-V allows the guest VM
|
||||
60 seconds to cleanly shutdown the device and respond with
|
||||
Ejection Complete before sending the VMBus rescind
|
||||
message. If for any reason the Eject steps don't complete
|
||||
within the allowed 60 seconds, the Hyper-V host forcibly
|
||||
performs the rescind steps, which will likely result in
|
||||
cascading errors in the guest because the device is now no
|
||||
longer present from the guest standpoint and accessing the
|
||||
device MMIO space will fail.
|
||||
|
||||
Because ejection is asynchronous and can happen at any point
|
||||
during the guest VM lifecycle, proper synchronization in the
|
||||
Hyper-V virtual PCI driver is very tricky. Ejection has been
|
||||
observed even before a newly offered vPCI device has been
|
||||
fully setup. The Hyper-V virtual PCI driver has been updated
|
||||
several times over the years to fix race conditions when
|
||||
ejections happen at inopportune times. Care must be taken when
|
||||
modifying this code to prevent re-introducing such problems.
|
||||
See comments in the code.
|
||||
|
||||
Interrupt Assignment
|
||||
--------------------
|
||||
The Hyper-V virtual PCI driver supports vPCI devices using
|
||||
MSI, multi-MSI, or MSI-X. Assigning the guest vCPU that will
|
||||
receive the interrupt for a particular MSI or MSI-X message is
|
||||
complex because of the way the Linux setup of IRQs maps onto
|
||||
the Hyper-V interfaces. For the single-MSI and MSI-X cases,
|
||||
Linux calls hv_compse_msi_msg() twice, with the first call
|
||||
containing a dummy vCPU and the second call containing the
|
||||
real vCPU. Furthermore, hv_irq_unmask() is finally called
|
||||
(on x86) or the GICD registers are set (on arm64) to specify
|
||||
the real vCPU again. Each of these three calls interact
|
||||
with Hyper-V, which must decide which physical CPU should
|
||||
receive the interrupt before it is forwarded to the guest VM.
|
||||
Unfortunately, the Hyper-V decision-making process is a bit
|
||||
limited, and can result in concentrating the physical
|
||||
interrupts on a single CPU, causing a performance bottleneck.
|
||||
See details about how this is resolved in the extensive
|
||||
comment above the function hv_compose_msi_req_get_cpu().
|
||||
|
||||
The Hyper-V virtual PCI driver implements the
|
||||
irq_chip.irq_compose_msi_msg function as hv_compose_msi_msg().
|
||||
Unfortunately, on Hyper-V the implementation requires sending
|
||||
a VMBus message to the Hyper-V host and awaiting an interrupt
|
||||
indicating receipt of a reply message. Since
|
||||
irq_chip.irq_compose_msi_msg can be called with IRQ locks
|
||||
held, it doesn't work to do the normal sleep until awakened by
|
||||
the interrupt. Instead hv_compose_msi_msg() must send the
|
||||
VMBus message, and then poll for the completion message. As
|
||||
further complexity, the vPCI device could be ejected/rescinded
|
||||
while the polling is in progress, so this scenario must be
|
||||
detected as well. See comments in the code regarding this
|
||||
very tricky area.
|
||||
|
||||
Most of the code in the Hyper-V virtual PCI driver (pci-
|
||||
hyperv.c) applies to Hyper-V and Linux guests running on x86
|
||||
and on arm64 architectures. But there are differences in how
|
||||
interrupt assignments are managed. On x86, the Hyper-V
|
||||
virtual PCI driver in the guest must make a hypercall to tell
|
||||
Hyper-V which guest vCPU should be interrupted by each
|
||||
MSI/MSI-X interrupt, and the x86 interrupt vector number that
|
||||
the x86_vector IRQ domain has picked for the interrupt. This
|
||||
hypercall is made by hv_arch_irq_unmask(). On arm64, the
|
||||
Hyper-V virtual PCI driver manages the allocation of an SPI
|
||||
for each MSI/MSI-X interrupt. The Hyper-V virtual PCI driver
|
||||
stores the allocated SPI in the architectural GICD registers,
|
||||
which Hyper-V emulates, so no hypercall is necessary as with
|
||||
x86. Hyper-V does not support using LPIs for vPCI devices in
|
||||
arm64 guest VMs because it does not emulate a GICv3 ITS.
|
||||
|
||||
The Hyper-V virtual PCI driver in Linux supports vPCI devices
|
||||
whose drivers create managed or unmanaged Linux IRQs. If the
|
||||
smp_affinity for an unmanaged IRQ is updated via the /proc/irq
|
||||
interface, the Hyper-V virtual PCI driver is called to tell
|
||||
the Hyper-V host to change the interrupt targeting and
|
||||
everything works properly. However, on x86 if the x86_vector
|
||||
IRQ domain needs to reassign an interrupt vector due to
|
||||
running out of vectors on a CPU, there's no path to inform the
|
||||
Hyper-V host of the change, and things break. Fortunately,
|
||||
guest VMs operate in a constrained device environment where
|
||||
using all the vectors on a CPU doesn't happen. Since such a
|
||||
problem is only a theoretical concern rather than a practical
|
||||
concern, it has been left unaddressed.
|
||||
|
||||
DMA
|
||||
---
|
||||
By default, Hyper-V pins all guest VM memory in the host
|
||||
when the VM is created, and programs the physical IOMMU to
|
||||
allow the VM to have DMA access to all its memory. Hence
|
||||
it is safe to assign PCI devices to the VM, and allow the
|
||||
guest operating system to program the DMA transfers. The
|
||||
physical IOMMU prevents a malicious guest from initiating
|
||||
DMA to memory belonging to the host or to other VMs on the
|
||||
host. From the Linux guest standpoint, such DMA transfers
|
||||
are in "direct" mode since Hyper-V does not provide a virtual
|
||||
IOMMU in the guest.
|
||||
|
||||
Hyper-V assumes that physical PCI devices always perform
|
||||
cache-coherent DMA. When running on x86, this behavior is
|
||||
required by the architecture. When running on arm64, the
|
||||
architecture allows for both cache-coherent and
|
||||
non-cache-coherent devices, with the behavior of each device
|
||||
specified in the ACPI DSDT. But when a PCI device is assigned
|
||||
to a guest VM, that device does not appear in the DSDT, so the
|
||||
Hyper-V VMBus driver propagates cache-coherency information
|
||||
from the VMBus node in the ACPI DSDT to all VMBus devices,
|
||||
including vPCI devices (since they have a dual identity as a VMBus
|
||||
device and as a PCI device). See vmbus_dma_configure().
|
||||
Current Hyper-V versions always indicate that the VMBus is
|
||||
cache coherent, so vPCI devices on arm64 always get marked as
|
||||
cache coherent and the CPU does not perform any sync
|
||||
operations as part of dma_map/unmap_*() calls.
|
||||
|
||||
vPCI protocol versions
|
||||
----------------------
|
||||
As previously described, during vPCI device setup and teardown
|
||||
messages are passed over a VMBus channel between the Hyper-V
|
||||
host and the Hyper-v vPCI driver in the Linux guest. Some
|
||||
messages have been revised in newer versions of Hyper-V, so
|
||||
the guest and host must agree on the vPCI protocol version to
|
||||
be used. The version is negotiated when communication over
|
||||
the VMBus channel is first established. See
|
||||
hv_pci_protocol_negotiation(). Newer versions of the protocol
|
||||
extend support to VMs with more than 64 vCPUs, and provide
|
||||
additional information about the vPCI device, such as the
|
||||
guest virtual NUMA node to which it is most closely affined in
|
||||
the underlying hardware.
|
||||
|
||||
Guest NUMA node affinity
|
||||
------------------------
|
||||
When the vPCI protocol version provides it, the guest NUMA
|
||||
node affinity of the vPCI device is stored as part of the Linux
|
||||
device information for subsequent use by the Linux driver. See
|
||||
hv_pci_assign_numa_node(). If the negotiated protocol version
|
||||
does not support the host providing NUMA affinity information,
|
||||
the Linux guest defaults the device NUMA node to 0. But even
|
||||
when the negotiated protocol version includes NUMA affinity
|
||||
information, the ability of the host to provide such
|
||||
information depends on certain host configuration options. If
|
||||
the guest receives NUMA node value "0", it could mean NUMA
|
||||
node 0, or it could mean "no information is available".
|
||||
Unfortunately it is not possible to distinguish the two cases
|
||||
from the guest side.
|
||||
|
||||
PCI config space access in a CoCo VM
|
||||
------------------------------------
|
||||
Linux PCI device drivers access PCI config space using a
|
||||
standard set of functions provided by the Linux PCI subsystem.
|
||||
In Hyper-V guests these standard functions map to functions
|
||||
hv_pcifront_read_config() and hv_pcifront_write_config()
|
||||
in the Hyper-V virtual PCI driver. In normal VMs,
|
||||
these hv_pcifront_*() functions directly access the PCI config
|
||||
space, and the accesses trap to Hyper-V to be handled.
|
||||
But in CoCo VMs, memory encryption prevents Hyper-V
|
||||
from reading the guest instruction stream to emulate the
|
||||
access, so the hv_pcifront_*() functions must invoke
|
||||
hypercalls with explicit arguments describing the access to be
|
||||
made.
|
||||
|
||||
Config Block back-channel
|
||||
-------------------------
|
||||
The Hyper-V host and Hyper-V virtual PCI driver in Linux
|
||||
together implement a non-standard back-channel communication
|
||||
path between the host and guest. The back-channel path uses
|
||||
messages sent over the VMBus channel associated with the vPCI
|
||||
device. The functions hyperv_read_cfg_blk() and
|
||||
hyperv_write_cfg_blk() are the primary interfaces provided to
|
||||
other parts of the Linux kernel. As of this writing, these
|
||||
interfaces are used only by the Mellanox mlx5 driver to pass
|
||||
diagnostic data to a Hyper-V host running in the Azure public
|
||||
cloud. The functions hyperv_read_cfg_blk() and
|
||||
hyperv_write_cfg_blk() are implemented in a separate module
|
||||
(pci-hyperv-intf.c, under CONFIG_PCI_HYPERV_INTERFACE) that
|
||||
effectively stubs them out when running in non-Hyper-V
|
||||
environments.
|
Loading…
Reference in New Issue
Block a user