mirror of
https://github.com/torvalds/linux.git
synced 2024-12-12 06:02:38 +00:00
8b4a503d65
Convert all text files with s390 documentation to ReST format. Tried to preserve as much as possible the original document format. Still, some of the files required some work in order for it to be visible on both plain text and after converted to html. The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
327 lines
13 KiB
ReStructuredText
327 lines
13 KiB
ReStructuredText
==================================
|
|
vfio-ccw: the basic infrastructure
|
|
==================================
|
|
|
|
Introduction
|
|
------------
|
|
|
|
Here we describe the vfio support for I/O subchannel devices for
|
|
Linux/s390. Motivation for vfio-ccw is to passthrough subchannels to a
|
|
virtual machine, while vfio is the means.
|
|
|
|
Different than other hardware architectures, s390 has defined a unified
|
|
I/O access method, which is so called Channel I/O. It has its own access
|
|
patterns:
|
|
|
|
- Channel programs run asynchronously on a separate (co)processor.
|
|
- The channel subsystem will access any memory designated by the caller
|
|
in the channel program directly, i.e. there is no iommu involved.
|
|
|
|
Thus when we introduce vfio support for these devices, we realize it
|
|
with a mediated device (mdev) implementation. The vfio mdev will be
|
|
added to an iommu group, so as to make itself able to be managed by the
|
|
vfio framework. And we add read/write callbacks for special vfio I/O
|
|
regions to pass the channel programs from the mdev to its parent device
|
|
(the real I/O subchannel device) to do further address translation and
|
|
to perform I/O instructions.
|
|
|
|
This document does not intend to explain the s390 I/O architecture in
|
|
every detail. More information/reference could be found here:
|
|
|
|
- A good start to know Channel I/O in general:
|
|
https://en.wikipedia.org/wiki/Channel_I/O
|
|
- s390 architecture:
|
|
s390 Principles of Operation manual (IBM Form. No. SA22-7832)
|
|
- The existing QEMU code which implements a simple emulated channel
|
|
subsystem could also be a good reference. It makes it easier to follow
|
|
the flow.
|
|
qemu/hw/s390x/css.c
|
|
|
|
For vfio mediated device framework:
|
|
- Documentation/vfio-mediated-device.txt
|
|
|
|
Motivation of vfio-ccw
|
|
----------------------
|
|
|
|
Typically, a guest virtualized via QEMU/KVM on s390 only sees
|
|
paravirtualized virtio devices via the "Virtio Over Channel I/O
|
|
(virtio-ccw)" transport. This makes virtio devices discoverable via
|
|
standard operating system algorithms for handling channel devices.
|
|
|
|
However this is not enough. On s390 for the majority of devices, which
|
|
use the standard Channel I/O based mechanism, we also need to provide
|
|
the functionality of passing through them to a QEMU virtual machine.
|
|
This includes devices that don't have a virtio counterpart (e.g. tape
|
|
drives) or that have specific characteristics which guests want to
|
|
exploit.
|
|
|
|
For passing a device to a guest, we want to use the same interface as
|
|
everybody else, namely vfio. We implement this vfio support for channel
|
|
devices via the vfio mediated device framework and the subchannel device
|
|
driver "vfio_ccw".
|
|
|
|
Access patterns of CCW devices
|
|
------------------------------
|
|
|
|
s390 architecture has implemented a so called channel subsystem, that
|
|
provides a unified view of the devices physically attached to the
|
|
systems. Though the s390 hardware platform knows about a huge variety of
|
|
different peripheral attachments like disk devices (aka. DASDs), tapes,
|
|
communication controllers, etc. They can all be accessed by a well
|
|
defined access method and they are presenting I/O completion a unified
|
|
way: I/O interruptions.
|
|
|
|
All I/O requires the use of channel command words (CCWs). A CCW is an
|
|
instruction to a specialized I/O channel processor. A channel program is
|
|
a sequence of CCWs which are executed by the I/O channel subsystem. To
|
|
issue a channel program to the channel subsystem, it is required to
|
|
build an operation request block (ORB), which can be used to point out
|
|
the format of the CCW and other control information to the system. The
|
|
operating system signals the I/O channel subsystem to begin executing
|
|
the channel program with a SSCH (start sub-channel) instruction. The
|
|
central processor is then free to proceed with non-I/O instructions
|
|
until interrupted. The I/O completion result is received by the
|
|
interrupt handler in the form of interrupt response block (IRB).
|
|
|
|
Back to vfio-ccw, in short:
|
|
|
|
- ORBs and channel programs are built in guest kernel (with guest
|
|
physical addresses).
|
|
- ORBs and channel programs are passed to the host kernel.
|
|
- Host kernel translates the guest physical addresses to real addresses
|
|
and starts the I/O with issuing a privileged Channel I/O instruction
|
|
(e.g SSCH).
|
|
- channel programs run asynchronously on a separate processor.
|
|
- I/O completion will be signaled to the host with I/O interruptions.
|
|
And it will be copied as IRB to user space to pass it back to the
|
|
guest.
|
|
|
|
Physical vfio ccw device and its child mdev
|
|
-------------------------------------------
|
|
|
|
As mentioned above, we realize vfio-ccw with a mdev implementation.
|
|
|
|
Channel I/O does not have IOMMU hardware support, so the physical
|
|
vfio-ccw device does not have an IOMMU level translation or isolation.
|
|
|
|
Subchannel I/O instructions are all privileged instructions. When
|
|
handling the I/O instruction interception, vfio-ccw has the software
|
|
policing and translation how the channel program is programmed before
|
|
it gets sent to hardware.
|
|
|
|
Within this implementation, we have two drivers for two types of
|
|
devices:
|
|
|
|
- The vfio_ccw driver for the physical subchannel device.
|
|
This is an I/O subchannel driver for the real subchannel device. It
|
|
realizes a group of callbacks and registers to the mdev framework as a
|
|
parent (physical) device. As a consequence, mdev provides vfio_ccw a
|
|
generic interface (sysfs) to create mdev devices. A vfio mdev could be
|
|
created by vfio_ccw then and added to the mediated bus. It is the vfio
|
|
device that added to an IOMMU group and a vfio group.
|
|
vfio_ccw also provides an I/O region to accept channel program
|
|
request from user space and store I/O interrupt result for user
|
|
space to retrieve. To notify user space an I/O completion, it offers
|
|
an interface to setup an eventfd fd for asynchronous signaling.
|
|
|
|
- The vfio_mdev driver for the mediated vfio ccw device.
|
|
This is provided by the mdev framework. It is a vfio device driver for
|
|
the mdev that created by vfio_ccw.
|
|
It realizes a group of vfio device driver callbacks, adds itself to a
|
|
vfio group, and registers itself to the mdev framework as a mdev
|
|
driver.
|
|
It uses a vfio iommu backend that uses the existing map and unmap
|
|
ioctls, but rather than programming them into an IOMMU for a device,
|
|
it simply stores the translations for use by later requests. This
|
|
means that a device programmed in a VM with guest physical addresses
|
|
can have the vfio kernel convert that address to process virtual
|
|
address, pin the page and program the hardware with the host physical
|
|
address in one step.
|
|
For a mdev, the vfio iommu backend will not pin the pages during the
|
|
VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database
|
|
of the iova<->vaddr mappings in this operation. And they export a
|
|
vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu
|
|
backend for the physical devices to pin and unpin pages by demand.
|
|
|
|
Below is a high Level block diagram::
|
|
|
|
+-------------+
|
|
| |
|
|
| +---------+ | mdev_register_driver() +--------------+
|
|
| | Mdev | +<-----------------------+ |
|
|
| | bus | | | vfio_mdev.ko |
|
|
| | driver | +----------------------->+ |<-> VFIO user
|
|
| +---------+ | probe()/remove() +--------------+ APIs
|
|
| |
|
|
| MDEV CORE |
|
|
| MODULE |
|
|
| mdev.ko |
|
|
| +---------+ | mdev_register_device() +--------------+
|
|
| |Physical | +<-----------------------+ |
|
|
| | device | | | vfio_ccw.ko |<-> subchannel
|
|
| |interface| +----------------------->+ | device
|
|
| +---------+ | callback +--------------+
|
|
+-------------+
|
|
|
|
The process of how these work together.
|
|
|
|
1. vfio_ccw.ko drives the physical I/O subchannel, and registers the
|
|
physical device (with callbacks) to mdev framework.
|
|
When vfio_ccw probing the subchannel device, it registers device
|
|
pointer and callbacks to the mdev framework. Mdev related file nodes
|
|
under the device node in sysfs would be created for the subchannel
|
|
device, namely 'mdev_create', 'mdev_destroy' and
|
|
'mdev_supported_types'.
|
|
2. Create a mediated vfio ccw device.
|
|
Use the 'mdev_create' sysfs file, we need to manually create one (and
|
|
only one for our case) mediated device.
|
|
3. vfio_mdev.ko drives the mediated ccw device.
|
|
vfio_mdev is also the vfio device drvier. It will probe the mdev and
|
|
add it to an iommu_group and a vfio_group. Then we could pass through
|
|
the mdev to a guest.
|
|
|
|
vfio-ccw I/O region
|
|
-------------------
|
|
|
|
An I/O region is used to accept channel program request from user
|
|
space and store I/O interrupt result for user space to retrieve. The
|
|
definition of the region is::
|
|
|
|
struct ccw_io_region {
|
|
#define ORB_AREA_SIZE 12
|
|
__u8 orb_area[ORB_AREA_SIZE];
|
|
#define SCSW_AREA_SIZE 12
|
|
__u8 scsw_area[SCSW_AREA_SIZE];
|
|
#define IRB_AREA_SIZE 96
|
|
__u8 irb_area[IRB_AREA_SIZE];
|
|
__u32 ret_code;
|
|
} __packed;
|
|
|
|
While starting an I/O request, orb_area should be filled with the
|
|
guest ORB, and scsw_area should be filled with the SCSW of the Virtual
|
|
Subchannel.
|
|
|
|
irb_area stores the I/O result.
|
|
|
|
ret_code stores a return code for each access of the region.
|
|
|
|
vfio-ccw operation details
|
|
--------------------------
|
|
|
|
vfio-ccw follows what vfio-pci did on the s390 platform and uses
|
|
vfio-iommu-type1 as the vfio iommu backend.
|
|
|
|
* CCW translation APIs
|
|
A group of APIs (start with `cp_`) to do CCW translation. The CCWs
|
|
passed in by a user space program are organized with their guest
|
|
physical memory addresses. These APIs will copy the CCWs into kernel
|
|
space, and assemble a runnable kernel channel program by updating the
|
|
guest physical addresses with their corresponding host physical addresses.
|
|
Note that we have to use IDALs even for direct-access CCWs, as the
|
|
referenced memory can be located anywhere, including above 2G.
|
|
|
|
* vfio_ccw device driver
|
|
This driver utilizes the CCW translation APIs and introduces
|
|
vfio_ccw, which is the driver for the I/O subchannel devices you want
|
|
to pass through.
|
|
vfio_ccw implements the following vfio ioctls::
|
|
|
|
VFIO_DEVICE_GET_INFO
|
|
VFIO_DEVICE_GET_IRQ_INFO
|
|
VFIO_DEVICE_GET_REGION_INFO
|
|
VFIO_DEVICE_RESET
|
|
VFIO_DEVICE_SET_IRQS
|
|
|
|
This provides an I/O region, so that the user space program can pass a
|
|
channel program to the kernel, to do further CCW translation before
|
|
issuing them to a real device.
|
|
This also provides the SET_IRQ ioctl to setup an event notifier to
|
|
notify the user space program the I/O completion in an asynchronous
|
|
way.
|
|
|
|
The use of vfio-ccw is not limited to QEMU, while QEMU is definitely a
|
|
good example to get understand how these patches work. Here is a little
|
|
bit more detail how an I/O request triggered by the QEMU guest will be
|
|
handled (without error handling).
|
|
|
|
Explanation:
|
|
|
|
- Q1-Q7: QEMU side process.
|
|
- K1-K5: Kernel side process.
|
|
|
|
Q1.
|
|
Get I/O region info during initialization.
|
|
|
|
Q2.
|
|
Setup event notifier and handler to handle I/O completion.
|
|
|
|
... ...
|
|
|
|
Q3.
|
|
Intercept a ssch instruction.
|
|
Q4.
|
|
Write the guest channel program and ORB to the I/O region.
|
|
|
|
K1.
|
|
Copy from guest to kernel.
|
|
K2.
|
|
Translate the guest channel program to a host kernel space
|
|
channel program, which becomes runnable for a real device.
|
|
K3.
|
|
With the necessary information contained in the orb passed in
|
|
by QEMU, issue the ccwchain to the device.
|
|
K4.
|
|
Return the ssch CC code.
|
|
Q5.
|
|
Return the CC code to the guest.
|
|
|
|
... ...
|
|
|
|
K5.
|
|
Interrupt handler gets the I/O result and write the result to
|
|
the I/O region.
|
|
K6.
|
|
Signal QEMU to retrieve the result.
|
|
|
|
Q6.
|
|
Get the signal and event handler reads out the result from the I/O
|
|
region.
|
|
Q7.
|
|
Update the irb for the guest.
|
|
|
|
Limitations
|
|
-----------
|
|
|
|
The current vfio-ccw implementation focuses on supporting basic commands
|
|
needed to implement block device functionality (read/write) of DASD/ECKD
|
|
device only. Some commands may need special handling in the future, for
|
|
example, anything related to path grouping.
|
|
|
|
DASD is a kind of storage device. While ECKD is a data recording format.
|
|
More information for DASD and ECKD could be found here:
|
|
https://en.wikipedia.org/wiki/Direct-access_storage_device
|
|
https://en.wikipedia.org/wiki/Count_key_data
|
|
|
|
Together with the corresponding work in QEMU, we can bring the passed
|
|
through DASD/ECKD device online in a guest now and use it as a block
|
|
device.
|
|
|
|
While the current code allows the guest to start channel programs via
|
|
START SUBCHANNEL, support for HALT SUBCHANNEL or CLEAR SUBCHANNEL is
|
|
not yet implemented.
|
|
|
|
vfio-ccw supports classic (command mode) channel I/O only. Transport
|
|
mode (HPF) is not supported.
|
|
|
|
QDIO subchannels are currently not supported. Classic devices other than
|
|
DASD/ECKD might work, but have not been tested.
|
|
|
|
Reference
|
|
---------
|
|
1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832)
|
|
2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204)
|
|
3. https://en.wikipedia.org/wiki/Channel_I/O
|
|
4. Documentation/s390/cds.rst
|
|
5. Documentation/vfio.txt
|
|
6. Documentation/vfio-mediated-device.txt
|