forked from Minki/linux
docs: add documentation for vfio-ccw
Add file Documentation/s390/vfio-ccw.txt that includes details of vfio-ccw. Acked-by: Pierre Morel <pmorel@linux.vnet.ibm.com> Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com> Message-Id: <20170317031743.40128-15-bjsdjshi@linux.vnet.ibm.com> Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
This commit is contained in:
parent
bbe37e4cb8
commit
25627ba389
@ -22,5 +22,7 @@ qeth.txt
|
||||
- HiperSockets Bridge Port Support.
|
||||
s390dbf.txt
|
||||
- information on using the s390 debug feature.
|
||||
vfio-ccw.txt
|
||||
information on the vfio-ccw I/O subchannel driver.
|
||||
zfcpdump.txt
|
||||
- information on the s390 SCSI dump tool.
|
||||
|
303
Documentation/s390/vfio-ccw.txt
Normal file
303
Documentation/s390/vfio-ccw.txt
Normal file
@ -0,0 +1,303 @@
|
||||
vfio-ccw: the basic infrastructure
|
||||
==================================
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
Here we describe the vfio support for I/O subchannel devices for
|
||||
Linux/s390. Motivation for vfio-ccw is to passthrough subchannels to a
|
||||
virtual machine, while vfio is the means.
|
||||
|
||||
Different than other hardware architectures, s390 has defined a unified
|
||||
I/O access method, which is so called Channel I/O. It has its own access
|
||||
patterns:
|
||||
- Channel programs run asynchronously on a separate (co)processor.
|
||||
- The channel subsystem will access any memory designated by the caller
|
||||
in the channel program directly, i.e. there is no iommu involved.
|
||||
Thus when we introduce vfio support for these devices, we realize it
|
||||
with a mediated device (mdev) implementation. The vfio mdev will be
|
||||
added to an iommu group, so as to make itself able to be managed by the
|
||||
vfio framework. And we add read/write callbacks for special vfio I/O
|
||||
regions to pass the channel programs from the mdev to its parent device
|
||||
(the real I/O subchannel device) to do further address translation and
|
||||
to perform I/O instructions.
|
||||
|
||||
This document does not intend to explain the s390 I/O architecture in
|
||||
every detail. More information/reference could be found here:
|
||||
- A good start to know Channel I/O in general:
|
||||
https://en.wikipedia.org/wiki/Channel_I/O
|
||||
- s390 architecture:
|
||||
s390 Principles of Operation manual (IBM Form. No. SA22-7832)
|
||||
- The existing Qemu code which implements a simple emulated channel
|
||||
subsystem could also be a good reference. It makes it easier to follow
|
||||
the flow.
|
||||
qemu/hw/s390x/css.c
|
||||
|
||||
For vfio mediated device framework:
|
||||
- Documentation/vfio-mediated-device.txt
|
||||
|
||||
Motivation of vfio-ccw
|
||||
----------------------
|
||||
|
||||
Currently, a guest virtualized via qemu/kvm on s390 only sees
|
||||
paravirtualized virtio devices via the "Virtio Over Channel I/O
|
||||
(virtio-ccw)" transport. This makes virtio devices discoverable via
|
||||
standard operating system algorithms for handling channel devices.
|
||||
|
||||
However this is not enough. On s390 for the majority of devices, which
|
||||
use the standard Channel I/O based mechanism, we also need to provide
|
||||
the functionality of passing through them to a Qemu virtual machine.
|
||||
This includes devices that don't have a virtio counterpart (e.g. tape
|
||||
drives) or that have specific characteristics which guests want to
|
||||
exploit.
|
||||
|
||||
For passing a device to a guest, we want to use the same interface as
|
||||
everybody else, namely vfio. Thus, we would like to introduce vfio
|
||||
support for channel devices. And we would like to name this new vfio
|
||||
device "vfio-ccw".
|
||||
|
||||
Access patterns of CCW devices
|
||||
------------------------------
|
||||
|
||||
s390 architecture has implemented a so called channel subsystem, that
|
||||
provides a unified view of the devices physically attached to the
|
||||
systems. Though the s390 hardware platform knows about a huge variety of
|
||||
different peripheral attachments like disk devices (aka. DASDs), tapes,
|
||||
communication controllers, etc. They can all be accessed by a well
|
||||
defined access method and they are presenting I/O completion a unified
|
||||
way: I/O interruptions.
|
||||
|
||||
All I/O requires the use of channel command words (CCWs). A CCW is an
|
||||
instruction to a specialized I/O channel processor. A channel program is
|
||||
a sequence of CCWs which are executed by the I/O channel subsystem. To
|
||||
issue a channel program to the channel subsystem, it is required to
|
||||
build an operation request block (ORB), which can be used to point out
|
||||
the format of the CCW and other control information to the system. The
|
||||
operating system signals the I/O channel subsystem to begin executing
|
||||
the channel program with a SSCH (start sub-channel) instruction. The
|
||||
central processor is then free to proceed with non-I/O instructions
|
||||
until interrupted. The I/O completion result is received by the
|
||||
interrupt handler in the form of interrupt response block (IRB).
|
||||
|
||||
Back to vfio-ccw, in short:
|
||||
- ORBs and channel programs are built in guest kernel (with guest
|
||||
physical addresses).
|
||||
- ORBs and channel programs are passed to the host kernel.
|
||||
- Host kernel translates the guest physical addresses to real addresses
|
||||
and starts the I/O with issuing a privileged Channel I/O instruction
|
||||
(e.g SSCH).
|
||||
- channel programs run asynchronously on a separate processor.
|
||||
- I/O completion will be signaled to the host with I/O interruptions.
|
||||
And it will be copied as IRB to user space to pass it back to the
|
||||
guest.
|
||||
|
||||
Physical vfio ccw device and its child mdev
|
||||
-------------------------------------------
|
||||
|
||||
As mentioned above, we realize vfio-ccw with a mdev implementation.
|
||||
|
||||
Channel I/O does not have IOMMU hardware support, so the physical
|
||||
vfio-ccw device does not have an IOMMU level translation or isolation.
|
||||
|
||||
Sub-channel I/O instructions are all privileged instructions, When
|
||||
handling the I/O instruction interception, vfio-ccw has the software
|
||||
policing and translation how the channel program is programmed before
|
||||
it gets sent to hardware.
|
||||
|
||||
Within this implementation, we have two drivers for two types of
|
||||
devices:
|
||||
- The vfio_ccw driver for the physical subchannel device.
|
||||
This is an I/O subchannel driver for the real subchannel device. It
|
||||
realizes a group of callbacks and registers to the mdev framework as a
|
||||
parent (physical) device. As a consequence, mdev provides vfio_ccw a
|
||||
generic interface (sysfs) to create mdev devices. A vfio mdev could be
|
||||
created by vfio_ccw then and added to the mediated bus. It is the vfio
|
||||
device that added to an IOMMU group and a vfio group.
|
||||
vfio_ccw also provides an I/O region to accept channel program
|
||||
request from user space and store I/O interrupt result for user
|
||||
space to retrieve. To notify user space an I/O completion, it offers
|
||||
an interface to setup an eventfd fd for asynchronous signaling.
|
||||
|
||||
- The vfio_mdev driver for the mediated vfio ccw device.
|
||||
This is provided by the mdev framework. It is a vfio device driver for
|
||||
the mdev that created by vfio_ccw.
|
||||
It realize a group of vfio device driver callbacks, adds itself to a
|
||||
vfio group, and registers itself to the mdev framework as a mdev
|
||||
driver.
|
||||
It uses a vfio iommu backend that uses the existing map and unmap
|
||||
ioctls, but rather than programming them into an IOMMU for a device,
|
||||
it simply stores the translations for use by later requests. This
|
||||
means that a device programmed in a VM with guest physical addresses
|
||||
can have the vfio kernel convert that address to process virtual
|
||||
address, pin the page and program the hardware with the host physical
|
||||
address in one step.
|
||||
For a mdev, the vfio iommu backend will not pin the pages during the
|
||||
VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database
|
||||
of the iova<->vaddr mappings in this operation. And they export a
|
||||
vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu
|
||||
backend for the physical devices to pin and unpin pages by demand.
|
||||
|
||||
Below is a high Level block diagram.
|
||||
|
||||
+-------------+
|
||||
| |
|
||||
| +---------+ | mdev_register_driver() +--------------+
|
||||
| | Mdev | +<-----------------------+ |
|
||||
| | bus | | | vfio_mdev.ko |
|
||||
| | driver | +----------------------->+ |<-> VFIO user
|
||||
| +---------+ | probe()/remove() +--------------+ APIs
|
||||
| |
|
||||
| MDEV CORE |
|
||||
| MODULE |
|
||||
| mdev.ko |
|
||||
| +---------+ | mdev_register_device() +--------------+
|
||||
| |Physical | +<-----------------------+ |
|
||||
| | device | | | vfio_ccw.ko |<-> subchannel
|
||||
| |interface| +----------------------->+ | device
|
||||
| +---------+ | callback +--------------+
|
||||
+-------------+
|
||||
|
||||
The process of how these work together.
|
||||
1. vfio_ccw.ko drives the physical I/O subchannel, and registers the
|
||||
physical device (with callbacks) to mdev framework.
|
||||
When vfio_ccw probing the subchannel device, it registers device
|
||||
pointer and callbacks to the mdev framework. Mdev related file nodes
|
||||
under the device node in sysfs would be created for the subchannel
|
||||
device, namely 'mdev_create', 'mdev_destroy' and
|
||||
'mdev_supported_types'.
|
||||
2. Create a mediated vfio ccw device.
|
||||
Use the 'mdev_create' sysfs file, we need to manually create one (and
|
||||
only one for our case) mediated device.
|
||||
3. vfio_mdev.ko drives the mediated ccw device.
|
||||
vfio_mdev is also the vfio device drvier. It will probe the mdev and
|
||||
add it to an iommu_group and a vfio_group. Then we could pass through
|
||||
the mdev to a guest.
|
||||
|
||||
vfio-ccw I/O region
|
||||
-------------------
|
||||
|
||||
An I/O region is used to accept channel program request from user
|
||||
space and store I/O interrupt result for user space to retrieve. The
|
||||
defination of the region is:
|
||||
|
||||
struct ccw_io_region {
|
||||
#define ORB_AREA_SIZE 12
|
||||
__u8 orb_area[ORB_AREA_SIZE];
|
||||
#define SCSW_AREA_SIZE 12
|
||||
__u8 scsw_area[SCSW_AREA_SIZE];
|
||||
#define IRB_AREA_SIZE 96
|
||||
__u8 irb_area[IRB_AREA_SIZE];
|
||||
__u32 ret_code;
|
||||
} __packed;
|
||||
|
||||
While starting an I/O request, orb_area should be filled with the
|
||||
guest ORB, and scsw_area should be filled with the SCSW of the Virtual
|
||||
Subchannel.
|
||||
|
||||
irb_area stores the I/O result.
|
||||
|
||||
ret_code stores a return code for each access of the region.
|
||||
|
||||
vfio-ccw patches overview
|
||||
-------------------------
|
||||
|
||||
For now, our patches are rebased on the latest mdev implementation.
|
||||
vfio-ccw follows what vfio-pci did on the s390 paltform and uses
|
||||
vfio-iommu-type1 as the vfio iommu backend. It's a good start to launch
|
||||
the code review for vfio-ccw. Note that the implementation is far from
|
||||
complete yet; but we'd like to get feedback for the general
|
||||
architecture.
|
||||
|
||||
* CCW translation APIs
|
||||
- Description:
|
||||
These introduce a group of APIs (start with 'cp_') to do CCW
|
||||
translation. The CCWs passed in by a user space program are
|
||||
organized with their guest physical memory addresses. These APIs
|
||||
will copy the CCWs into the kernel space, and assemble a runnable
|
||||
kernel channel program by updating the guest physical addresses with
|
||||
their corresponding host physical addresses.
|
||||
- Patches:
|
||||
vfio: ccw: introduce channel program interfaces
|
||||
|
||||
* vfio_ccw device driver
|
||||
- Description:
|
||||
The following patches utilizes the CCW translation APIs and introduce
|
||||
vfio_ccw, which is the driver for the I/O subchannel devices you want
|
||||
to pass through.
|
||||
vfio_ccw implements the following vfio ioctls:
|
||||
VFIO_DEVICE_GET_INFO
|
||||
VFIO_DEVICE_GET_IRQ_INFO
|
||||
VFIO_DEVICE_GET_REGION_INFO
|
||||
VFIO_DEVICE_RESET
|
||||
VFIO_DEVICE_SET_IRQS
|
||||
This provides an I/O region, so that the user space program can pass a
|
||||
channel program to the kernel, to do further CCW translation before
|
||||
issuing them to a real device.
|
||||
This also provides the SET_IRQ ioctl to setup an event notifier to
|
||||
notify the user space program the I/O completion in an asynchronous
|
||||
way.
|
||||
- Patches:
|
||||
vfio: ccw: basic implementation for vfio_ccw driver
|
||||
vfio: ccw: introduce ccw_io_region
|
||||
vfio: ccw: realize VFIO_DEVICE_GET_REGION_INFO ioctl
|
||||
vfio: ccw: realize VFIO_DEVICE_RESET ioctl
|
||||
vfio: ccw: realize VFIO_DEVICE_G(S)ET_IRQ_INFO ioctls
|
||||
|
||||
The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
|
||||
good example to get understand how these patches work. Here is a little
|
||||
bit more detail how an I/O request triggered by the Qemu guest will be
|
||||
handled (without error handling).
|
||||
|
||||
Explanation:
|
||||
Q1-Q7: Qemu side process.
|
||||
K1-K5: Kernel side process.
|
||||
|
||||
Q1. Get I/O region info during initialization.
|
||||
Q2. Setup event notifier and handler to handle I/O completion.
|
||||
|
||||
... ...
|
||||
|
||||
Q3. Intercept a ssch instruction.
|
||||
Q4. Write the guest channel program and ORB to the I/O region.
|
||||
K1. Copy from guest to kernel.
|
||||
K2. Translate the guest channel program to a host kernel space
|
||||
channel program, which becomes runnable for a real device.
|
||||
K3. With the necessary information contained in the orb passed in
|
||||
by Qemu, issue the ccwchain to the device.
|
||||
K4. Return the ssch CC code.
|
||||
Q5. Return the CC code to the guest.
|
||||
|
||||
... ...
|
||||
|
||||
K5. Interrupt handler gets the I/O result and write the result to
|
||||
the I/O region.
|
||||
K6. Signal Qemu to retrieve the result.
|
||||
Q6. Get the signal and event handler reads out the result from the I/O
|
||||
region.
|
||||
Q7. Update the irb for the guest.
|
||||
|
||||
Limitations
|
||||
-----------
|
||||
|
||||
The current vfio-ccw implementation focuses on supporting basic commands
|
||||
needed to implement block device functionality (read/write) of DASD/ECKD
|
||||
device only. Some commands may need special handling in the future, for
|
||||
example, anything related to path grouping.
|
||||
|
||||
DASD is a kind of storage device. While ECKD is a data recording format.
|
||||
More information for DASD and ECKD could be found here:
|
||||
https://en.wikipedia.org/wiki/Direct-access_storage_device
|
||||
https://en.wikipedia.org/wiki/Count_key_data
|
||||
|
||||
Together with the corresponding work in Qemu, we can bring the passed
|
||||
through DASD/ECKD device online in a guest now and use it as a block
|
||||
device.
|
||||
|
||||
Reference
|
||||
---------
|
||||
1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832)
|
||||
2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204)
|
||||
3. https://en.wikipedia.org/wiki/Channel_I/O
|
||||
4. Documentation/s390/cds.txt
|
||||
5. Documentation/vfio.txt
|
||||
6. Documentation/vfio-mediated-device.txt
|
Loading…
Reference in New Issue
Block a user