KVM/arm fixes for 5.3
- A bunch of switch/case fall-through annotation, fixing one actual bug - Fix PMU reset bug - Add missing exception class debug strings -----BEGIN PGP SIGNATURE----- iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl1Bzw8PHG1hekBrZXJu ZWwub3JnAAoJECPQ0LrRPXpDlXYP/ixqJzqpJetTrvpiUpmLjhp4YwjjOxqyeQvo bWy/EFz8bSWbTZlwAAstFDVmtGenuwaiOakChvV8GH6USYqRsYdvc/sJu0evQplJ JQtOzGhyv1NuM0s9wYBcstAH+YAW+gBK5YFnowreheuidK/1lo3C/EnR2DxCtNal gpV3qQt8qfw3ysGlpC/fDjjOYw4lDkFa6CSx9uk3/587fPBqHANRY/i87nJxmhhX lGeCJcOrY3cy1HhbedFwxVt4Q/ZbHf0UhTfgwvsBYw7BaWmB1ymoEOoktQcUWoKb LL0rBe+OxNQgRnJpn3fMEHiCAmXaI9qE4dohFOl1J3dQvCElcV/jWjkXDD1+KgzW S2XZGB6yxet93Fh1x6xv4i6ATJvmZeTIDUXi9KkjcDiycB9YMCDYY2ejTbQv5VUP V0DghGGDd3d8sY7dEjxwBakuJ6nqKixSouQaNsWuBTm7tVpEVS8yW+hqWs/IVI5b 48SDbxaNpKvx7sAyhuWAjCFbZeIm0hd//JN3JoxazF9i9PKuqnZLbNv/ME6hmzj+ LrETwaAbjsw5Au+ST+OdT2UiauiBm9C6Kg62qagHrKJviuK941+3hjH8aj/e0pYk a0DQxumiyofXPQ0pVe8ZfqlPptONz+EKyAsrOm8AjLJ+bBdRUNHLcZKYj7em7YiE pANc8/T+ =kcDj -----END PGP SIGNATURE----- Merge tag 'kvmarm-fixes-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm fixes for 5.3 - A bunch of switch/case fall-through annotation, fixing one actual bug - Fix PMU reset bug - Add missing exception class debug strings
This commit is contained in:
commit
0e1c438c44
1
.gitignore
vendored
1
.gitignore
vendored
@ -30,6 +30,7 @@
|
|||||||
*.lz4
|
*.lz4
|
||||||
*.lzma
|
*.lzma
|
||||||
*.lzo
|
*.lzo
|
||||||
|
*.mod
|
||||||
*.mod.c
|
*.mod.c
|
||||||
*.o
|
*.o
|
||||||
*.o.*
|
*.o.*
|
||||||
|
3
CREDITS
3
CREDITS
@ -1770,7 +1770,6 @@ S: USA
|
|||||||
|
|
||||||
N: Dave Jones
|
N: Dave Jones
|
||||||
E: davej@codemonkey.org.uk
|
E: davej@codemonkey.org.uk
|
||||||
W: http://www.codemonkey.org.uk
|
|
||||||
D: Assorted VIA x86 support.
|
D: Assorted VIA x86 support.
|
||||||
D: 2.5 AGPGART overhaul.
|
D: 2.5 AGPGART overhaul.
|
||||||
D: CPUFREQ maintenance.
|
D: CPUFREQ maintenance.
|
||||||
@ -3120,7 +3119,7 @@ S: France
|
|||||||
N: Rik van Riel
|
N: Rik van Riel
|
||||||
E: riel@redhat.com
|
E: riel@redhat.com
|
||||||
W: http://www.surriel.com/
|
W: http://www.surriel.com/
|
||||||
D: Linux-MM site, Documentation/sysctl/*, swap/mm readaround
|
D: Linux-MM site, Documentation/admin-guide/sysctl/*, swap/mm readaround
|
||||||
D: kswapd fixes, random kernel hacker, rmap VM,
|
D: kswapd fixes, random kernel hacker, rmap VM,
|
||||||
D: nl.linux.org administrator, minor scheduler additions
|
D: nl.linux.org administrator, minor scheduler additions
|
||||||
S: Red Hat Boston
|
S: Red Hat Boston
|
||||||
|
@ -11,7 +11,7 @@ Description:
|
|||||||
Kernel code may export it for complete or partial access.
|
Kernel code may export it for complete or partial access.
|
||||||
|
|
||||||
GPIOs are identified as they are inside the kernel, using integers in
|
GPIOs are identified as they are inside the kernel, using integers in
|
||||||
the range 0..INT_MAX. See Documentation/gpio for more information.
|
the range 0..INT_MAX. See Documentation/admin-guide/gpio for more information.
|
||||||
|
|
||||||
/sys/class/gpio
|
/sys/class/gpio
|
||||||
/export ... asks the kernel to export a GPIO to userspace
|
/export ... asks the kernel to export a GPIO to userspace
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
rfkill - radio frequency (RF) connector kill switch support
|
rfkill - radio frequency (RF) connector kill switch support
|
||||||
|
|
||||||
For details to this subsystem look at Documentation/rfkill.txt.
|
For details to this subsystem look at Documentation/driver-api/rfkill.rst.
|
||||||
|
|
||||||
What: /sys/class/rfkill/rfkill[0-9]+/claim
|
What: /sys/class/rfkill/rfkill[0-9]+/claim
|
||||||
Date: 09-Jul-2007
|
Date: 09-Jul-2007
|
||||||
|
@ -423,23 +423,6 @@ Description:
|
|||||||
(e.g. driver restart on the VM which owns the VF).
|
(e.g. driver restart on the VM which owns the VF).
|
||||||
|
|
||||||
|
|
||||||
sysfs interface for NetEffect RNIC Low-Level iWARP driver (nes)
|
|
||||||
---------------------------------------------------------------
|
|
||||||
|
|
||||||
What: /sys/class/infiniband/nesX/hw_rev
|
|
||||||
What: /sys/class/infiniband/nesX/hca_type
|
|
||||||
What: /sys/class/infiniband/nesX/board_id
|
|
||||||
Date: Feb, 2008
|
|
||||||
KernelVersion: v2.6.25
|
|
||||||
Contact: linux-rdma@vger.kernel.org
|
|
||||||
Description:
|
|
||||||
hw_rev: (RO) Hardware revision number
|
|
||||||
|
|
||||||
hca_type: (RO) Host Channel Adapter type (NEX020)
|
|
||||||
|
|
||||||
board_id: (RO) Manufacturing board id
|
|
||||||
|
|
||||||
|
|
||||||
sysfs interface for Chelsio T4/T5 RDMA driver (cxgb4)
|
sysfs interface for Chelsio T4/T5 RDMA driver (cxgb4)
|
||||||
-----------------------------------------------------
|
-----------------------------------------------------
|
||||||
|
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
rfkill - radio frequency (RF) connector kill switch support
|
rfkill - radio frequency (RF) connector kill switch support
|
||||||
|
|
||||||
For details to this subsystem look at Documentation/rfkill.txt.
|
For details to this subsystem look at Documentation/driver-api/rfkill.rst.
|
||||||
|
|
||||||
For the deprecated /sys/class/rfkill/*/claim knobs of this interface look in
|
For the deprecated /sys/class/rfkill/*/claim knobs of this interface look in
|
||||||
Documentation/ABI/removed/sysfs-class-rfkill.
|
Documentation/ABI/removed/sysfs-class-rfkill.
|
||||||
|
@ -61,7 +61,7 @@ Date: October 2002
|
|||||||
Contact: Linux Memory Management list <linux-mm@kvack.org>
|
Contact: Linux Memory Management list <linux-mm@kvack.org>
|
||||||
Description:
|
Description:
|
||||||
The node's hit/miss statistics, in units of pages.
|
The node's hit/miss statistics, in units of pages.
|
||||||
See Documentation/numastat.txt
|
See Documentation/admin-guide/numastat.rst
|
||||||
|
|
||||||
What: /sys/devices/system/node/nodeX/distance
|
What: /sys/devices/system/node/nodeX/distance
|
||||||
Date: October 2002
|
Date: October 2002
|
||||||
|
@ -120,3 +120,23 @@ Description: These files show the system reset cause, as following: ComEx
|
|||||||
the last reset cause.
|
the last reset cause.
|
||||||
|
|
||||||
The files are read only.
|
The files are read only.
|
||||||
|
|
||||||
|
Date: June 2019
|
||||||
|
KernelVersion: 5.3
|
||||||
|
Contact: Vadim Pasternak <vadimpmellanox.com>
|
||||||
|
Description: These files show the system reset cause, as following:
|
||||||
|
COMEX thermal shutdown; wathchdog power off or reset was derived
|
||||||
|
by one of the next components: COMEX, switch board or by Small Form
|
||||||
|
Factor mezzanine, reset requested from ASIC, reset cuased by BIOS
|
||||||
|
reload. Value 1 in file means this is reset cause, 0 - otherwise.
|
||||||
|
Only one of the above causes could be 1 at the same time, representing
|
||||||
|
only last reset cause.
|
||||||
|
|
||||||
|
The files are read only.
|
||||||
|
|
||||||
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_comex_thermal
|
||||||
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_comex_wd
|
||||||
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_from_asic
|
||||||
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_reload_bios
|
||||||
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_sff_wd
|
||||||
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_swb_wd
|
||||||
|
@ -29,4 +29,4 @@ Description:
|
|||||||
17 - sectors discarded
|
17 - sectors discarded
|
||||||
18 - time spent discarding
|
18 - time spent discarding
|
||||||
|
|
||||||
For more details refer to Documentation/iostats.txt
|
For more details refer to Documentation/admin-guide/iostats.rst
|
||||||
|
@ -15,7 +15,7 @@ Description:
|
|||||||
9 - I/Os currently in progress
|
9 - I/Os currently in progress
|
||||||
10 - time spent doing I/Os (ms)
|
10 - time spent doing I/Os (ms)
|
||||||
11 - weighted time spent doing I/Os (ms)
|
11 - weighted time spent doing I/Os (ms)
|
||||||
For more details refer Documentation/iostats.txt
|
For more details refer Documentation/admin-guide/iostats.rst
|
||||||
|
|
||||||
|
|
||||||
What: /sys/block/<disk>/<part>/stat
|
What: /sys/block/<disk>/<part>/stat
|
||||||
|
@ -45,7 +45,7 @@ Description:
|
|||||||
- Values below -2 are rejected with -EINVAL
|
- Values below -2 are rejected with -EINVAL
|
||||||
|
|
||||||
For more information, see
|
For more information, see
|
||||||
Documentation/laptops/disk-shock-protection.txt
|
Documentation/admin-guide/laptops/disk-shock-protection.rst
|
||||||
|
|
||||||
|
|
||||||
What: /sys/block/*/device/ncq_prio_enable
|
What: /sys/block/*/device/ncq_prio_enable
|
||||||
|
@ -376,10 +376,42 @@ Description:
|
|||||||
supply. Normally this is configured based on the type of
|
supply. Normally this is configured based on the type of
|
||||||
connection made (e.g. A configured SDP should output a maximum
|
connection made (e.g. A configured SDP should output a maximum
|
||||||
of 500mA so the input current limit is set to the same value).
|
of 500mA so the input current limit is set to the same value).
|
||||||
|
Use preferably input_power_limit, and for problems that can be
|
||||||
|
solved using power limit use input_current_limit.
|
||||||
|
|
||||||
Access: Read, Write
|
Access: Read, Write
|
||||||
Valid values: Represented in microamps
|
Valid values: Represented in microamps
|
||||||
|
|
||||||
|
What: /sys/class/power_supply/<supply_name>/input_voltage_limit
|
||||||
|
Date: May 2019
|
||||||
|
Contact: linux-pm@vger.kernel.org
|
||||||
|
Description:
|
||||||
|
This entry configures the incoming VBUS voltage limit currently
|
||||||
|
set in the supply. Normally this is configured based on
|
||||||
|
system-level knowledge or user input (e.g. This is part of the
|
||||||
|
Pixel C's thermal management strategy to effectively limit the
|
||||||
|
input power to 5V when the screen is on to meet Google's skin
|
||||||
|
temperature targets). Note that this feature should not be
|
||||||
|
used for safety critical things.
|
||||||
|
Use preferably input_power_limit, and for problems that can be
|
||||||
|
solved using power limit use input_voltage_limit.
|
||||||
|
|
||||||
|
Access: Read, Write
|
||||||
|
Valid values: Represented in microvolts
|
||||||
|
|
||||||
|
What: /sys/class/power_supply/<supply_name>/input_power_limit
|
||||||
|
Date: May 2019
|
||||||
|
Contact: linux-pm@vger.kernel.org
|
||||||
|
Description:
|
||||||
|
This entry configures the incoming power limit currently set
|
||||||
|
in the supply. Normally this is configured based on
|
||||||
|
system-level knowledge or user input. Use preferably this
|
||||||
|
feature to limit the incoming power and use current/voltage
|
||||||
|
limit only for problems that can be solved using power limit.
|
||||||
|
|
||||||
|
Access: Read, Write
|
||||||
|
Valid values: Represented in microwatts
|
||||||
|
|
||||||
What: /sys/class/power_supply/<supply_name>/online,
|
What: /sys/class/power_supply/<supply_name>/online,
|
||||||
Date: May 2007
|
Date: May 2007
|
||||||
Contact: linux-pm@vger.kernel.org
|
Contact: linux-pm@vger.kernel.org
|
||||||
|
30
Documentation/ABI/testing/sysfs-class-power-wilco
Normal file
30
Documentation/ABI/testing/sysfs-class-power-wilco
Normal file
@ -0,0 +1,30 @@
|
|||||||
|
What: /sys/class/power_supply/wilco-charger/charge_type
|
||||||
|
Date: April 2019
|
||||||
|
KernelVersion: 5.2
|
||||||
|
Description:
|
||||||
|
What charging algorithm to use:
|
||||||
|
|
||||||
|
Standard: Fully charges battery at a standard rate.
|
||||||
|
Adaptive: Battery settings adaptively optimized based on
|
||||||
|
typical battery usage pattern.
|
||||||
|
Fast: Battery charges over a shorter period.
|
||||||
|
Trickle: Extends battery lifespan, intended for users who
|
||||||
|
primarily use their Chromebook while connected to AC.
|
||||||
|
Custom: A low and high threshold percentage is specified.
|
||||||
|
Charging begins when level drops below
|
||||||
|
charge_control_start_threshold, and ceases when
|
||||||
|
level is above charge_control_end_threshold.
|
||||||
|
|
||||||
|
What: /sys/class/power_supply/wilco-charger/charge_control_start_threshold
|
||||||
|
Date: April 2019
|
||||||
|
KernelVersion: 5.2
|
||||||
|
Description:
|
||||||
|
Used when charge_type="Custom", as described above. Measured in
|
||||||
|
percentages. The valid range is [50, 95].
|
||||||
|
|
||||||
|
What: /sys/class/power_supply/wilco-charger/charge_control_end_threshold
|
||||||
|
Date: April 2019
|
||||||
|
KernelVersion: 5.2
|
||||||
|
Description:
|
||||||
|
Used when charge_type="Custom", as described above. Measured in
|
||||||
|
percentages. The valid range is [55, 100].
|
@ -5,7 +5,7 @@ Contact: linux-pm@vger.kernel.org
|
|||||||
Description:
|
Description:
|
||||||
The powercap/ class sub directory belongs to the power cap
|
The powercap/ class sub directory belongs to the power cap
|
||||||
subsystem. Refer to
|
subsystem. Refer to
|
||||||
Documentation/power/powercap/powercap.txt for details.
|
Documentation/power/powercap/powercap.rst for details.
|
||||||
|
|
||||||
What: /sys/class/powercap/<control type>
|
What: /sys/class/powercap/<control type>
|
||||||
Date: September 2013
|
Date: September 2013
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
switchtec - Microsemi Switchtec PCI Switch Management Endpoint
|
switchtec - Microsemi Switchtec PCI Switch Management Endpoint
|
||||||
|
|
||||||
For details on this subsystem look at Documentation/switchtec.txt.
|
For details on this subsystem look at Documentation/driver-api/switchtec.rst.
|
||||||
|
|
||||||
What: /sys/class/switchtec
|
What: /sys/class/switchtec
|
||||||
Date: 05-Jan-2017
|
Date: 05-Jan-2017
|
||||||
|
@ -34,7 +34,7 @@ Description: CPU topology files that describe kernel limits related to
|
|||||||
present: cpus that have been identified as being present in
|
present: cpus that have been identified as being present in
|
||||||
the system.
|
the system.
|
||||||
|
|
||||||
See Documentation/cputopology.txt for more information.
|
See Documentation/admin-guide/cputopology.rst for more information.
|
||||||
|
|
||||||
|
|
||||||
What: /sys/devices/system/cpu/probe
|
What: /sys/devices/system/cpu/probe
|
||||||
@ -103,7 +103,7 @@ Description: CPU topology files that describe a logical CPU's relationship
|
|||||||
thread_siblings_list: human-readable list of cpu#'s hardware
|
thread_siblings_list: human-readable list of cpu#'s hardware
|
||||||
threads within the same core as cpu#
|
threads within the same core as cpu#
|
||||||
|
|
||||||
See Documentation/cputopology.txt for more information.
|
See Documentation/admin-guide/cputopology.rst for more information.
|
||||||
|
|
||||||
|
|
||||||
What: /sys/devices/system/cpu/cpuidle/current_driver
|
What: /sys/devices/system/cpu/cpuidle/current_driver
|
||||||
|
@ -31,7 +31,7 @@ Description:
|
|||||||
To control the LED display, use the following :
|
To control the LED display, use the following :
|
||||||
echo 0x0T000DDD > /sys/devices/platform/asus_laptop/
|
echo 0x0T000DDD > /sys/devices/platform/asus_laptop/
|
||||||
where T control the 3 letters display, and DDD the 3 digits display.
|
where T control the 3 letters display, and DDD the 3 digits display.
|
||||||
The DDD table can be found in Documentation/laptops/asus-laptop.txt
|
The DDD table can be found in Documentation/admin-guide/laptops/asus-laptop.rst
|
||||||
|
|
||||||
What: /sys/devices/platform/asus_laptop/bluetooth
|
What: /sys/devices/platform/asus_laptop/bluetooth
|
||||||
Date: January 2007
|
Date: January 2007
|
||||||
|
@ -36,3 +36,13 @@ KernelVersion: 3.5
|
|||||||
Contact: "AceLan Kao" <acelan.kao@canonical.com>
|
Contact: "AceLan Kao" <acelan.kao@canonical.com>
|
||||||
Description:
|
Description:
|
||||||
Resume on lid open. 1 means on, 0 means off.
|
Resume on lid open. 1 means on, 0 means off.
|
||||||
|
|
||||||
|
What: /sys/devices/platform/<platform>/fan_boost_mode
|
||||||
|
Date: Sep 2019
|
||||||
|
KernelVersion: 5.3
|
||||||
|
Contact: "Yurii Pavlovskyi" <yurii.pavlovskyi@gmail.com>
|
||||||
|
Description:
|
||||||
|
Fan boost mode:
|
||||||
|
* 0 - normal,
|
||||||
|
* 1 - overboost,
|
||||||
|
* 2 - silent
|
||||||
|
@ -1,7 +1,7 @@
|
|||||||
What: /sys/devices/platform/<i2c-demux-name>/available_masters
|
What: /sys/devices/platform/<i2c-demux-name>/available_masters
|
||||||
Date: January 2016
|
Date: January 2016
|
||||||
KernelVersion: 4.6
|
KernelVersion: 4.6
|
||||||
Contact: Wolfram Sang <wsa@the-dreams.de>
|
Contact: Wolfram Sang <wsa+renesas@sang-engineering.com>
|
||||||
Description:
|
Description:
|
||||||
Reading the file will give you a list of masters which can be
|
Reading the file will give you a list of masters which can be
|
||||||
selected for a demultiplexed bus. The format is
|
selected for a demultiplexed bus. The format is
|
||||||
@ -12,7 +12,7 @@ Description:
|
|||||||
What: /sys/devices/platform/<i2c-demux-name>/current_master
|
What: /sys/devices/platform/<i2c-demux-name>/current_master
|
||||||
Date: January 2016
|
Date: January 2016
|
||||||
KernelVersion: 4.6
|
KernelVersion: 4.6
|
||||||
Contact: Wolfram Sang <wsa@the-dreams.de>
|
Contact: Wolfram Sang <wsa+renesas@sang-engineering.com>
|
||||||
Description:
|
Description:
|
||||||
This file selects/shows the active I2C master for a demultiplexed
|
This file selects/shows the active I2C master for a demultiplexed
|
||||||
bus. It uses the <index> value from the file 'available_masters'.
|
bus. It uses the <index> value from the file 'available_masters'.
|
||||||
|
@ -212,7 +212,7 @@ The standard 64-bit addressing device would do something like this::
|
|||||||
|
|
||||||
If the device only supports 32-bit addressing for descriptors in the
|
If the device only supports 32-bit addressing for descriptors in the
|
||||||
coherent allocations, but supports full 64-bits for streaming mappings
|
coherent allocations, but supports full 64-bits for streaming mappings
|
||||||
it would look like this:
|
it would look like this::
|
||||||
|
|
||||||
if (dma_set_mask(dev, DMA_BIT_MASK(64))) {
|
if (dma_set_mask(dev, DMA_BIT_MASK(64))) {
|
||||||
dev_warn(dev, "mydev: No suitable DMA available\n");
|
dev_warn(dev, "mydev: No suitable DMA available\n");
|
||||||
|
@ -1,58 +0,0 @@
|
|||||||
:orphan:
|
|
||||||
|
|
||||||
====
|
|
||||||
EDID
|
|
||||||
====
|
|
||||||
|
|
||||||
In the good old days when graphics parameters were configured explicitly
|
|
||||||
in a file called xorg.conf, even broken hardware could be managed.
|
|
||||||
|
|
||||||
Today, with the advent of Kernel Mode Setting, a graphics board is
|
|
||||||
either correctly working because all components follow the standards -
|
|
||||||
or the computer is unusable, because the screen remains dark after
|
|
||||||
booting or it displays the wrong area. Cases when this happens are:
|
|
||||||
- The graphics board does not recognize the monitor.
|
|
||||||
- The graphics board is unable to detect any EDID data.
|
|
||||||
- The graphics board incorrectly forwards EDID data to the driver.
|
|
||||||
- The monitor sends no or bogus EDID data.
|
|
||||||
- A KVM sends its own EDID data instead of querying the connected monitor.
|
|
||||||
Adding the kernel parameter "nomodeset" helps in most cases, but causes
|
|
||||||
restrictions later on.
|
|
||||||
|
|
||||||
As a remedy for such situations, the kernel configuration item
|
|
||||||
CONFIG_DRM_LOAD_EDID_FIRMWARE was introduced. It allows to provide an
|
|
||||||
individually prepared or corrected EDID data set in the /lib/firmware
|
|
||||||
directory from where it is loaded via the firmware interface. The code
|
|
||||||
(see drivers/gpu/drm/drm_edid_load.c) contains built-in data sets for
|
|
||||||
commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200,
|
|
||||||
1680x1050, 1920x1080) as binary blobs, but the kernel source tree does
|
|
||||||
not contain code to create these data. In order to elucidate the origin
|
|
||||||
of the built-in binary EDID blobs and to facilitate the creation of
|
|
||||||
individual data for a specific misbehaving monitor, commented sources
|
|
||||||
and a Makefile environment are given here.
|
|
||||||
|
|
||||||
To create binary EDID and C source code files from the existing data
|
|
||||||
material, simply type "make".
|
|
||||||
|
|
||||||
If you want to create your own EDID file, copy the file 1024x768.S,
|
|
||||||
replace the settings with your own data and add a new target to the
|
|
||||||
Makefile. Please note that the EDID data structure expects the timing
|
|
||||||
values in a different way as compared to the standard X11 format.
|
|
||||||
|
|
||||||
X11:
|
|
||||||
HTimings:
|
|
||||||
hdisp hsyncstart hsyncend htotal
|
|
||||||
VTimings:
|
|
||||||
vdisp vsyncstart vsyncend vtotal
|
|
||||||
|
|
||||||
EDID::
|
|
||||||
|
|
||||||
#define XPIX hdisp
|
|
||||||
#define XBLANK htotal-hdisp
|
|
||||||
#define XOFFSET hsyncstart-hdisp
|
|
||||||
#define XPULSE hsyncend-hsyncstart
|
|
||||||
|
|
||||||
#define YPIX vdisp
|
|
||||||
#define YBLANK vtotal-vdisp
|
|
||||||
#define YOFFSET vsyncstart-vdisp
|
|
||||||
#define YPULSE vsyncend-vsyncstart
|
|
@ -1,270 +0,0 @@
|
|||||||
The MSI Driver Guide HOWTO
|
|
||||||
Tom L Nguyen tom.l.nguyen@intel.com
|
|
||||||
10/03/2003
|
|
||||||
Revised Feb 12, 2004 by Martine Silbermann
|
|
||||||
email: Martine.Silbermann@hp.com
|
|
||||||
Revised Jun 25, 2004 by Tom L Nguyen
|
|
||||||
Revised Jul 9, 2008 by Matthew Wilcox <willy@linux.intel.com>
|
|
||||||
Copyright 2003, 2008 Intel Corporation
|
|
||||||
|
|
||||||
1. About this guide
|
|
||||||
|
|
||||||
This guide describes the basics of Message Signaled Interrupts (MSIs),
|
|
||||||
the advantages of using MSI over traditional interrupt mechanisms, how
|
|
||||||
to change your driver to use MSI or MSI-X and some basic diagnostics to
|
|
||||||
try if a device doesn't support MSIs.
|
|
||||||
|
|
||||||
|
|
||||||
2. What are MSIs?
|
|
||||||
|
|
||||||
A Message Signaled Interrupt is a write from the device to a special
|
|
||||||
address which causes an interrupt to be received by the CPU.
|
|
||||||
|
|
||||||
The MSI capability was first specified in PCI 2.2 and was later enhanced
|
|
||||||
in PCI 3.0 to allow each interrupt to be masked individually. The MSI-X
|
|
||||||
capability was also introduced with PCI 3.0. It supports more interrupts
|
|
||||||
per device than MSI and allows interrupts to be independently configured.
|
|
||||||
|
|
||||||
Devices may support both MSI and MSI-X, but only one can be enabled at
|
|
||||||
a time.
|
|
||||||
|
|
||||||
|
|
||||||
3. Why use MSIs?
|
|
||||||
|
|
||||||
There are three reasons why using MSIs can give an advantage over
|
|
||||||
traditional pin-based interrupts.
|
|
||||||
|
|
||||||
Pin-based PCI interrupts are often shared amongst several devices.
|
|
||||||
To support this, the kernel must call each interrupt handler associated
|
|
||||||
with an interrupt, which leads to reduced performance for the system as
|
|
||||||
a whole. MSIs are never shared, so this problem cannot arise.
|
|
||||||
|
|
||||||
When a device writes data to memory, then raises a pin-based interrupt,
|
|
||||||
it is possible that the interrupt may arrive before all the data has
|
|
||||||
arrived in memory (this becomes more likely with devices behind PCI-PCI
|
|
||||||
bridges). In order to ensure that all the data has arrived in memory,
|
|
||||||
the interrupt handler must read a register on the device which raised
|
|
||||||
the interrupt. PCI transaction ordering rules require that all the data
|
|
||||||
arrive in memory before the value may be returned from the register.
|
|
||||||
Using MSIs avoids this problem as the interrupt-generating write cannot
|
|
||||||
pass the data writes, so by the time the interrupt is raised, the driver
|
|
||||||
knows that all the data has arrived in memory.
|
|
||||||
|
|
||||||
PCI devices can only support a single pin-based interrupt per function.
|
|
||||||
Often drivers have to query the device to find out what event has
|
|
||||||
occurred, slowing down interrupt handling for the common case. With
|
|
||||||
MSIs, a device can support more interrupts, allowing each interrupt
|
|
||||||
to be specialised to a different purpose. One possible design gives
|
|
||||||
infrequent conditions (such as errors) their own interrupt which allows
|
|
||||||
the driver to handle the normal interrupt handling path more efficiently.
|
|
||||||
Other possible designs include giving one interrupt to each packet queue
|
|
||||||
in a network card or each port in a storage controller.
|
|
||||||
|
|
||||||
|
|
||||||
4. How to use MSIs
|
|
||||||
|
|
||||||
PCI devices are initialised to use pin-based interrupts. The device
|
|
||||||
driver has to set up the device to use MSI or MSI-X. Not all machines
|
|
||||||
support MSIs correctly, and for those machines, the APIs described below
|
|
||||||
will simply fail and the device will continue to use pin-based interrupts.
|
|
||||||
|
|
||||||
4.1 Include kernel support for MSIs
|
|
||||||
|
|
||||||
To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI
|
|
||||||
option enabled. This option is only available on some architectures,
|
|
||||||
and it may depend on some other options also being set. For example,
|
|
||||||
on x86, you must also enable X86_UP_APIC or SMP in order to see the
|
|
||||||
CONFIG_PCI_MSI option.
|
|
||||||
|
|
||||||
4.2 Using MSI
|
|
||||||
|
|
||||||
Most of the hard work is done for the driver in the PCI layer. The driver
|
|
||||||
simply has to request that the PCI layer set up the MSI capability for this
|
|
||||||
device.
|
|
||||||
|
|
||||||
To automatically use MSI or MSI-X interrupt vectors, use the following
|
|
||||||
function:
|
|
||||||
|
|
||||||
int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
|
|
||||||
unsigned int max_vecs, unsigned int flags);
|
|
||||||
|
|
||||||
which allocates up to max_vecs interrupt vectors for a PCI device. It
|
|
||||||
returns the number of vectors allocated or a negative error. If the device
|
|
||||||
has a requirements for a minimum number of vectors the driver can pass a
|
|
||||||
min_vecs argument set to this limit, and the PCI core will return -ENOSPC
|
|
||||||
if it can't meet the minimum number of vectors.
|
|
||||||
|
|
||||||
The flags argument is used to specify which type of interrupt can be used
|
|
||||||
by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX).
|
|
||||||
A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for
|
|
||||||
any possible kind of interrupt. If the PCI_IRQ_AFFINITY flag is set,
|
|
||||||
pci_alloc_irq_vectors() will spread the interrupts around the available CPUs.
|
|
||||||
|
|
||||||
To get the Linux IRQ numbers passed to request_irq() and free_irq() and the
|
|
||||||
vectors, use the following function:
|
|
||||||
|
|
||||||
int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
|
|
||||||
|
|
||||||
Any allocated resources should be freed before removing the device using
|
|
||||||
the following function:
|
|
||||||
|
|
||||||
void pci_free_irq_vectors(struct pci_dev *dev);
|
|
||||||
|
|
||||||
If a device supports both MSI-X and MSI capabilities, this API will use the
|
|
||||||
MSI-X facilities in preference to the MSI facilities. MSI-X supports any
|
|
||||||
number of interrupts between 1 and 2048. In contrast, MSI is restricted to
|
|
||||||
a maximum of 32 interrupts (and must be a power of two). In addition, the
|
|
||||||
MSI interrupt vectors must be allocated consecutively, so the system might
|
|
||||||
not be able to allocate as many vectors for MSI as it could for MSI-X. On
|
|
||||||
some platforms, MSI interrupts must all be targeted at the same set of CPUs
|
|
||||||
whereas MSI-X interrupts can all be targeted at different CPUs.
|
|
||||||
|
|
||||||
If a device supports neither MSI-X or MSI it will fall back to a single
|
|
||||||
legacy IRQ vector.
|
|
||||||
|
|
||||||
The typical usage of MSI or MSI-X interrupts is to allocate as many vectors
|
|
||||||
as possible, likely up to the limit supported by the device. If nvec is
|
|
||||||
larger than the number supported by the device it will automatically be
|
|
||||||
capped to the supported limit, so there is no need to query the number of
|
|
||||||
vectors supported beforehand:
|
|
||||||
|
|
||||||
nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_ALL_TYPES)
|
|
||||||
if (nvec < 0)
|
|
||||||
goto out_err;
|
|
||||||
|
|
||||||
If a driver is unable or unwilling to deal with a variable number of MSI
|
|
||||||
interrupts it can request a particular number of interrupts by passing that
|
|
||||||
number to pci_alloc_irq_vectors() function as both 'min_vecs' and
|
|
||||||
'max_vecs' parameters:
|
|
||||||
|
|
||||||
ret = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_ALL_TYPES);
|
|
||||||
if (ret < 0)
|
|
||||||
goto out_err;
|
|
||||||
|
|
||||||
The most notorious example of the request type described above is enabling
|
|
||||||
the single MSI mode for a device. It could be done by passing two 1s as
|
|
||||||
'min_vecs' and 'max_vecs':
|
|
||||||
|
|
||||||
ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
|
|
||||||
if (ret < 0)
|
|
||||||
goto out_err;
|
|
||||||
|
|
||||||
Some devices might not support using legacy line interrupts, in which case
|
|
||||||
the driver can specify that only MSI or MSI-X is acceptable:
|
|
||||||
|
|
||||||
nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_MSI | PCI_IRQ_MSIX);
|
|
||||||
if (nvec < 0)
|
|
||||||
goto out_err;
|
|
||||||
|
|
||||||
4.3 Legacy APIs
|
|
||||||
|
|
||||||
The following old APIs to enable and disable MSI or MSI-X interrupts should
|
|
||||||
not be used in new code:
|
|
||||||
|
|
||||||
pci_enable_msi() /* deprecated */
|
|
||||||
pci_disable_msi() /* deprecated */
|
|
||||||
pci_enable_msix_range() /* deprecated */
|
|
||||||
pci_enable_msix_exact() /* deprecated */
|
|
||||||
pci_disable_msix() /* deprecated */
|
|
||||||
|
|
||||||
Additionally there are APIs to provide the number of supported MSI or MSI-X
|
|
||||||
vectors: pci_msi_vec_count() and pci_msix_vec_count(). In general these
|
|
||||||
should be avoided in favor of letting pci_alloc_irq_vectors() cap the
|
|
||||||
number of vectors. If you have a legitimate special use case for the count
|
|
||||||
of vectors we might have to revisit that decision and add a
|
|
||||||
pci_nr_irq_vectors() helper that handles MSI and MSI-X transparently.
|
|
||||||
|
|
||||||
4.4 Considerations when using MSIs
|
|
||||||
|
|
||||||
4.4.1 Spinlocks
|
|
||||||
|
|
||||||
Most device drivers have a per-device spinlock which is taken in the
|
|
||||||
interrupt handler. With pin-based interrupts or a single MSI, it is not
|
|
||||||
necessary to disable interrupts (Linux guarantees the same interrupt will
|
|
||||||
not be re-entered). If a device uses multiple interrupts, the driver
|
|
||||||
must disable interrupts while the lock is held. If the device sends
|
|
||||||
a different interrupt, the driver will deadlock trying to recursively
|
|
||||||
acquire the spinlock. Such deadlocks can be avoided by using
|
|
||||||
spin_lock_irqsave() or spin_lock_irq() which disable local interrupts
|
|
||||||
and acquire the lock (see Documentation/kernel-hacking/locking.rst).
|
|
||||||
|
|
||||||
4.5 How to tell whether MSI/MSI-X is enabled on a device
|
|
||||||
|
|
||||||
Using 'lspci -v' (as root) may show some devices with "MSI", "Message
|
|
||||||
Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities
|
|
||||||
has an 'Enable' flag which is followed with either "+" (enabled)
|
|
||||||
or "-" (disabled).
|
|
||||||
|
|
||||||
|
|
||||||
5. MSI quirks
|
|
||||||
|
|
||||||
Several PCI chipsets or devices are known not to support MSIs.
|
|
||||||
The PCI stack provides three ways to disable MSIs:
|
|
||||||
|
|
||||||
1. globally
|
|
||||||
2. on all devices behind a specific bridge
|
|
||||||
3. on a single device
|
|
||||||
|
|
||||||
5.1. Disabling MSIs globally
|
|
||||||
|
|
||||||
Some host chipsets simply don't support MSIs properly. If we're
|
|
||||||
lucky, the manufacturer knows this and has indicated it in the ACPI
|
|
||||||
FADT table. In this case, Linux automatically disables MSIs.
|
|
||||||
Some boards don't include this information in the table and so we have
|
|
||||||
to detect them ourselves. The complete list of these is found near the
|
|
||||||
quirk_disable_all_msi() function in drivers/pci/quirks.c.
|
|
||||||
|
|
||||||
If you have a board which has problems with MSIs, you can pass pci=nomsi
|
|
||||||
on the kernel command line to disable MSIs on all devices. It would be
|
|
||||||
in your best interests to report the problem to linux-pci@vger.kernel.org
|
|
||||||
including a full 'lspci -v' so we can add the quirks to the kernel.
|
|
||||||
|
|
||||||
5.2. Disabling MSIs below a bridge
|
|
||||||
|
|
||||||
Some PCI bridges are not able to route MSIs between busses properly.
|
|
||||||
In this case, MSIs must be disabled on all devices behind the bridge.
|
|
||||||
|
|
||||||
Some bridges allow you to enable MSIs by changing some bits in their
|
|
||||||
PCI configuration space (especially the Hypertransport chipsets such
|
|
||||||
as the nVidia nForce and Serverworks HT2000). As with host chipsets,
|
|
||||||
Linux mostly knows about them and automatically enables MSIs if it can.
|
|
||||||
If you have a bridge unknown to Linux, you can enable
|
|
||||||
MSIs in configuration space using whatever method you know works, then
|
|
||||||
enable MSIs on that bridge by doing:
|
|
||||||
|
|
||||||
echo 1 > /sys/bus/pci/devices/$bridge/msi_bus
|
|
||||||
|
|
||||||
where $bridge is the PCI address of the bridge you've enabled (eg
|
|
||||||
0000:00:0e.0).
|
|
||||||
|
|
||||||
To disable MSIs, echo 0 instead of 1. Changing this value should be
|
|
||||||
done with caution as it could break interrupt handling for all devices
|
|
||||||
below this bridge.
|
|
||||||
|
|
||||||
Again, please notify linux-pci@vger.kernel.org of any bridges that need
|
|
||||||
special handling.
|
|
||||||
|
|
||||||
5.3. Disabling MSIs on a single device
|
|
||||||
|
|
||||||
Some devices are known to have faulty MSI implementations. Usually this
|
|
||||||
is handled in the individual device driver, but occasionally it's necessary
|
|
||||||
to handle this with a quirk. Some drivers have an option to disable use
|
|
||||||
of MSI. While this is a convenient workaround for the driver author,
|
|
||||||
it is not good practice, and should not be emulated.
|
|
||||||
|
|
||||||
5.4. Finding why MSIs are disabled on a device
|
|
||||||
|
|
||||||
From the above three sections, you can see that there are many reasons
|
|
||||||
why MSIs may not be enabled for a given device. Your first step should
|
|
||||||
be to examine your dmesg carefully to determine whether MSIs are enabled
|
|
||||||
for your machine. You should also check your .config to be sure you
|
|
||||||
have enabled CONFIG_PCI_MSI.
|
|
||||||
|
|
||||||
Then, 'lspci -t' gives the list of bridges above a device. Reading
|
|
||||||
/sys/bus/pci/devices/*/msi_bus will tell you whether MSIs are enabled (1)
|
|
||||||
or disabled (0). If 0 is found in any of the msi_bus files belonging
|
|
||||||
to bridges between the PCI root and the device, MSIs are disabled.
|
|
||||||
|
|
||||||
It is also worth checking the device driver to see whether it supports MSIs.
|
|
||||||
For example, it may contain calls to pci_irq_alloc_vectors() with the
|
|
||||||
PCI_IRQ_MSI or PCI_IRQ_MSIX flags.
|
|
@ -1,198 +0,0 @@
|
|||||||
The PCI Express Port Bus Driver Guide HOWTO
|
|
||||||
Tom L Nguyen tom.l.nguyen@intel.com
|
|
||||||
11/03/2004
|
|
||||||
|
|
||||||
1. About this guide
|
|
||||||
|
|
||||||
This guide describes the basics of the PCI Express Port Bus driver
|
|
||||||
and provides information on how to enable the service drivers to
|
|
||||||
register/unregister with the PCI Express Port Bus Driver.
|
|
||||||
|
|
||||||
2. Copyright 2004 Intel Corporation
|
|
||||||
|
|
||||||
3. What is the PCI Express Port Bus Driver
|
|
||||||
|
|
||||||
A PCI Express Port is a logical PCI-PCI Bridge structure. There
|
|
||||||
are two types of PCI Express Port: the Root Port and the Switch
|
|
||||||
Port. The Root Port originates a PCI Express link from a PCI Express
|
|
||||||
Root Complex and the Switch Port connects PCI Express links to
|
|
||||||
internal logical PCI buses. The Switch Port, which has its secondary
|
|
||||||
bus representing the switch's internal routing logic, is called the
|
|
||||||
switch's Upstream Port. The switch's Downstream Port is bridging from
|
|
||||||
switch's internal routing bus to a bus representing the downstream
|
|
||||||
PCI Express link from the PCI Express Switch.
|
|
||||||
|
|
||||||
A PCI Express Port can provide up to four distinct functions,
|
|
||||||
referred to in this document as services, depending on its port type.
|
|
||||||
PCI Express Port's services include native hotplug support (HP),
|
|
||||||
power management event support (PME), advanced error reporting
|
|
||||||
support (AER), and virtual channel support (VC). These services may
|
|
||||||
be handled by a single complex driver or be individually distributed
|
|
||||||
and handled by corresponding service drivers.
|
|
||||||
|
|
||||||
4. Why use the PCI Express Port Bus Driver?
|
|
||||||
|
|
||||||
In existing Linux kernels, the Linux Device Driver Model allows a
|
|
||||||
physical device to be handled by only a single driver. The PCI
|
|
||||||
Express Port is a PCI-PCI Bridge device with multiple distinct
|
|
||||||
services. To maintain a clean and simple solution each service
|
|
||||||
may have its own software service driver. In this case several
|
|
||||||
service drivers will compete for a single PCI-PCI Bridge device.
|
|
||||||
For example, if the PCI Express Root Port native hotplug service
|
|
||||||
driver is loaded first, it claims a PCI-PCI Bridge Root Port. The
|
|
||||||
kernel therefore does not load other service drivers for that Root
|
|
||||||
Port. In other words, it is impossible to have multiple service
|
|
||||||
drivers load and run on a PCI-PCI Bridge device simultaneously
|
|
||||||
using the current driver model.
|
|
||||||
|
|
||||||
To enable multiple service drivers running simultaneously requires
|
|
||||||
having a PCI Express Port Bus driver, which manages all populated
|
|
||||||
PCI Express Ports and distributes all provided service requests
|
|
||||||
to the corresponding service drivers as required. Some key
|
|
||||||
advantages of using the PCI Express Port Bus driver are listed below:
|
|
||||||
|
|
||||||
- Allow multiple service drivers to run simultaneously on
|
|
||||||
a PCI-PCI Bridge Port device.
|
|
||||||
|
|
||||||
- Allow service drivers implemented in an independent
|
|
||||||
staged approach.
|
|
||||||
|
|
||||||
- Allow one service driver to run on multiple PCI-PCI Bridge
|
|
||||||
Port devices.
|
|
||||||
|
|
||||||
- Manage and distribute resources of a PCI-PCI Bridge Port
|
|
||||||
device to requested service drivers.
|
|
||||||
|
|
||||||
5. Configuring the PCI Express Port Bus Driver vs. Service Drivers
|
|
||||||
|
|
||||||
5.1 Including the PCI Express Port Bus Driver Support into the Kernel
|
|
||||||
|
|
||||||
Including the PCI Express Port Bus driver depends on whether the PCI
|
|
||||||
Express support is included in the kernel config. The kernel will
|
|
||||||
automatically include the PCI Express Port Bus driver as a kernel
|
|
||||||
driver when the PCI Express support is enabled in the kernel.
|
|
||||||
|
|
||||||
5.2 Enabling Service Driver Support
|
|
||||||
|
|
||||||
PCI device drivers are implemented based on Linux Device Driver Model.
|
|
||||||
All service drivers are PCI device drivers. As discussed above, it is
|
|
||||||
impossible to load any service driver once the kernel has loaded the
|
|
||||||
PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver
|
|
||||||
Model requires some minimal changes on existing service drivers that
|
|
||||||
imposes no impact on the functionality of existing service drivers.
|
|
||||||
|
|
||||||
A service driver is required to use the two APIs shown below to
|
|
||||||
register its service with the PCI Express Port Bus driver (see
|
|
||||||
section 5.2.1 & 5.2.2). It is important that a service driver
|
|
||||||
initializes the pcie_port_service_driver data structure, included in
|
|
||||||
header file /include/linux/pcieport_if.h, before calling these APIs.
|
|
||||||
Failure to do so will result an identity mismatch, which prevents
|
|
||||||
the PCI Express Port Bus driver from loading a service driver.
|
|
||||||
|
|
||||||
5.2.1 pcie_port_service_register
|
|
||||||
|
|
||||||
int pcie_port_service_register(struct pcie_port_service_driver *new)
|
|
||||||
|
|
||||||
This API replaces the Linux Driver Model's pci_register_driver API. A
|
|
||||||
service driver should always calls pcie_port_service_register at
|
|
||||||
module init. Note that after service driver being loaded, calls
|
|
||||||
such as pci_enable_device(dev) and pci_set_master(dev) are no longer
|
|
||||||
necessary since these calls are executed by the PCI Port Bus driver.
|
|
||||||
|
|
||||||
5.2.2 pcie_port_service_unregister
|
|
||||||
|
|
||||||
void pcie_port_service_unregister(struct pcie_port_service_driver *new)
|
|
||||||
|
|
||||||
pcie_port_service_unregister replaces the Linux Driver Model's
|
|
||||||
pci_unregister_driver. It's always called by service driver when a
|
|
||||||
module exits.
|
|
||||||
|
|
||||||
5.2.3 Sample Code
|
|
||||||
|
|
||||||
Below is sample service driver code to initialize the port service
|
|
||||||
driver data structure.
|
|
||||||
|
|
||||||
static struct pcie_port_service_id service_id[] = { {
|
|
||||||
.vendor = PCI_ANY_ID,
|
|
||||||
.device = PCI_ANY_ID,
|
|
||||||
.port_type = PCIE_RC_PORT,
|
|
||||||
.service_type = PCIE_PORT_SERVICE_AER,
|
|
||||||
}, { /* end: all zeroes */ }
|
|
||||||
};
|
|
||||||
|
|
||||||
static struct pcie_port_service_driver root_aerdrv = {
|
|
||||||
.name = (char *)device_name,
|
|
||||||
.id_table = &service_id[0],
|
|
||||||
|
|
||||||
.probe = aerdrv_load,
|
|
||||||
.remove = aerdrv_unload,
|
|
||||||
|
|
||||||
.suspend = aerdrv_suspend,
|
|
||||||
.resume = aerdrv_resume,
|
|
||||||
};
|
|
||||||
|
|
||||||
Below is a sample code for registering/unregistering a service
|
|
||||||
driver.
|
|
||||||
|
|
||||||
static int __init aerdrv_service_init(void)
|
|
||||||
{
|
|
||||||
int retval = 0;
|
|
||||||
|
|
||||||
retval = pcie_port_service_register(&root_aerdrv);
|
|
||||||
if (!retval) {
|
|
||||||
/*
|
|
||||||
* FIX ME
|
|
||||||
*/
|
|
||||||
}
|
|
||||||
return retval;
|
|
||||||
}
|
|
||||||
|
|
||||||
static void __exit aerdrv_service_exit(void)
|
|
||||||
{
|
|
||||||
pcie_port_service_unregister(&root_aerdrv);
|
|
||||||
}
|
|
||||||
|
|
||||||
module_init(aerdrv_service_init);
|
|
||||||
module_exit(aerdrv_service_exit);
|
|
||||||
|
|
||||||
6. Possible Resource Conflicts
|
|
||||||
|
|
||||||
Since all service drivers of a PCI-PCI Bridge Port device are
|
|
||||||
allowed to run simultaneously, below lists a few of possible resource
|
|
||||||
conflicts with proposed solutions.
|
|
||||||
|
|
||||||
6.1 MSI and MSI-X Vector Resource
|
|
||||||
|
|
||||||
Once MSI or MSI-X interrupts are enabled on a device, it stays in this
|
|
||||||
mode until they are disabled again. Since service drivers of the same
|
|
||||||
PCI-PCI Bridge port share the same physical device, if an individual
|
|
||||||
service driver enables or disables MSI/MSI-X mode it may result
|
|
||||||
unpredictable behavior.
|
|
||||||
|
|
||||||
To avoid this situation all service drivers are not permitted to
|
|
||||||
switch interrupt mode on its device. The PCI Express Port Bus driver
|
|
||||||
is responsible for determining the interrupt mode and this should be
|
|
||||||
transparent to service drivers. Service drivers need to know only
|
|
||||||
the vector IRQ assigned to the field irq of struct pcie_device, which
|
|
||||||
is passed in when the PCI Express Port Bus driver probes each service
|
|
||||||
driver. Service drivers should use (struct pcie_device*)dev->irq to
|
|
||||||
call request_irq/free_irq. In addition, the interrupt mode is stored
|
|
||||||
in the field interrupt_mode of struct pcie_device.
|
|
||||||
|
|
||||||
6.3 PCI Memory/IO Mapped Regions
|
|
||||||
|
|
||||||
Service drivers for PCI Express Power Management (PME), Advanced
|
|
||||||
Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access
|
|
||||||
PCI configuration space on the PCI Express port. In all cases the
|
|
||||||
registers accessed are independent of each other. This patch assumes
|
|
||||||
that all service drivers will be well behaved and not overwrite
|
|
||||||
other service driver's configuration settings.
|
|
||||||
|
|
||||||
6.4 PCI Config Registers
|
|
||||||
|
|
||||||
Each service driver runs its PCI config operations on its own
|
|
||||||
capability structure except the PCI Express capability structure, in
|
|
||||||
which Root Control register and Device Control register are shared
|
|
||||||
between PME and AER. This patch assumes that all service drivers
|
|
||||||
will be well behaved and not overwrite other service driver's
|
|
||||||
configuration settings.
|
|
192
Documentation/PCI/acpi-info.rst
Normal file
192
Documentation/PCI/acpi-info.rst
Normal file
@ -0,0 +1,192 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
========================================
|
||||||
|
ACPI considerations for PCI host bridges
|
||||||
|
========================================
|
||||||
|
|
||||||
|
The general rule is that the ACPI namespace should describe everything the
|
||||||
|
OS might use unless there's another way for the OS to find it [1, 2].
|
||||||
|
|
||||||
|
For example, there's no standard hardware mechanism for enumerating PCI
|
||||||
|
host bridges, so the ACPI namespace must describe each host bridge, the
|
||||||
|
method for accessing PCI config space below it, the address space windows
|
||||||
|
the host bridge forwards to PCI (using _CRS), and the routing of legacy
|
||||||
|
INTx interrupts (using _PRT).
|
||||||
|
|
||||||
|
PCI devices, which are below the host bridge, generally do not need to be
|
||||||
|
described via ACPI. The OS can discover them via the standard PCI
|
||||||
|
enumeration mechanism, using config accesses to discover and identify
|
||||||
|
devices and read and size their BARs. However, ACPI may describe PCI
|
||||||
|
devices if it provides power management or hotplug functionality for them
|
||||||
|
or if the device has INTx interrupts connected by platform interrupt
|
||||||
|
controllers and a _PRT is needed to describe those connections.
|
||||||
|
|
||||||
|
ACPI resource description is done via _CRS objects of devices in the ACPI
|
||||||
|
namespace [2]. The _CRS is like a generalized PCI BAR: the OS can read
|
||||||
|
_CRS and figure out what resource is being consumed even if it doesn't have
|
||||||
|
a driver for the device [3]. That's important because it means an old OS
|
||||||
|
can work correctly even on a system with new devices unknown to the OS.
|
||||||
|
The new devices might not do anything, but the OS can at least make sure no
|
||||||
|
resources conflict with them.
|
||||||
|
|
||||||
|
Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for
|
||||||
|
reserving address space. The static tables are for things the OS needs to
|
||||||
|
know early in boot, before it can parse the ACPI namespace. If a new table
|
||||||
|
is defined, an old OS needs to operate correctly even though it ignores the
|
||||||
|
table. _CRS allows that because it is generic and understood by the old
|
||||||
|
OS; a static table does not.
|
||||||
|
|
||||||
|
If the OS is expected to manage a non-discoverable device described via
|
||||||
|
ACPI, that device will have a specific _HID/_CID that tells the OS what
|
||||||
|
driver to bind to it, and the _CRS tells the OS and the driver where the
|
||||||
|
device's registers are.
|
||||||
|
|
||||||
|
PCI host bridges are PNP0A03 or PNP0A08 devices. Their _CRS should
|
||||||
|
describe all the address space they consume. This includes all the windows
|
||||||
|
they forward down to the PCI bus, as well as registers of the host bridge
|
||||||
|
itself that are not forwarded to PCI. The host bridge registers include
|
||||||
|
things like secondary/subordinate bus registers that determine the bus
|
||||||
|
range below the bridge, window registers that describe the apertures, etc.
|
||||||
|
These are all device-specific, non-architected things, so the only way a
|
||||||
|
PNP0A03/PNP0A08 driver can manage them is via _PRS/_CRS/_SRS, which contain
|
||||||
|
the device-specific details. The host bridge registers also include ECAM
|
||||||
|
space, since it is consumed by the host bridge.
|
||||||
|
|
||||||
|
ACPI defines a Consumer/Producer bit to distinguish the bridge registers
|
||||||
|
("Consumer") from the bridge apertures ("Producer") [4, 5], but early
|
||||||
|
BIOSes didn't use that bit correctly. The result is that the current ACPI
|
||||||
|
spec defines Consumer/Producer only for the Extended Address Space
|
||||||
|
descriptors; the bit should be ignored in the older QWord/DWord/Word
|
||||||
|
Address Space descriptors. Consequently, OSes have to assume all
|
||||||
|
QWord/DWord/Word descriptors are windows.
|
||||||
|
|
||||||
|
Prior to the addition of Extended Address Space descriptors, the failure of
|
||||||
|
Consumer/Producer meant there was no way to describe bridge registers in
|
||||||
|
the PNP0A03/PNP0A08 device itself. The workaround was to describe the
|
||||||
|
bridge registers (including ECAM space) in PNP0C02 catch-all devices [6].
|
||||||
|
With the exception of ECAM, the bridge register space is device-specific
|
||||||
|
anyway, so the generic PNP0A03/PNP0A08 driver (pci_root.c) has no need to
|
||||||
|
know about it.
|
||||||
|
|
||||||
|
New architectures should be able to use "Consumer" Extended Address Space
|
||||||
|
descriptors in the PNP0A03 device for bridge registers, including ECAM,
|
||||||
|
although a strict interpretation of [6] might prohibit this. Old x86 and
|
||||||
|
ia64 kernels assume all address space descriptors, including "Consumer"
|
||||||
|
Extended Address Space ones, are windows, so it would not be safe to
|
||||||
|
describe bridge registers this way on those architectures.
|
||||||
|
|
||||||
|
PNP0C02 "motherboard" devices are basically a catch-all. There's no
|
||||||
|
programming model for them other than "don't use these resources for
|
||||||
|
anything else." So a PNP0C02 _CRS should claim any address space that is
|
||||||
|
(1) not claimed by _CRS under any other device object in the ACPI namespace
|
||||||
|
and (2) should not be assigned by the OS to something else.
|
||||||
|
|
||||||
|
The PCIe spec requires the Enhanced Configuration Access Method (ECAM)
|
||||||
|
unless there's a standard firmware interface for config access, e.g., the
|
||||||
|
ia64 SAL interface [7]. A host bridge consumes ECAM memory address space
|
||||||
|
and converts memory accesses into PCI configuration accesses. The spec
|
||||||
|
defines the ECAM address space layout and functionality; only the base of
|
||||||
|
the address space is device-specific. An ACPI OS learns the base address
|
||||||
|
from either the static MCFG table or a _CBA method in the PNP0A03 device.
|
||||||
|
|
||||||
|
The MCFG table must describe the ECAM space of non-hot pluggable host
|
||||||
|
bridges [8]. Since MCFG is a static table and can't be updated by hotplug,
|
||||||
|
a _CBA method in the PNP0A03 device describes the ECAM space of a
|
||||||
|
hot-pluggable host bridge [9]. Note that for both MCFG and _CBA, the base
|
||||||
|
address always corresponds to bus 0, even if the bus range below the bridge
|
||||||
|
(which is reported via _CRS) doesn't start at 0.
|
||||||
|
|
||||||
|
|
||||||
|
[1] ACPI 6.2, sec 6.1:
|
||||||
|
For any device that is on a non-enumerable type of bus (for example, an
|
||||||
|
ISA bus), OSPM enumerates the devices' identifier(s) and the ACPI
|
||||||
|
system firmware must supply an _HID object ... for each device to
|
||||||
|
enable OSPM to do that.
|
||||||
|
|
||||||
|
[2] ACPI 6.2, sec 3.7:
|
||||||
|
The OS enumerates motherboard devices simply by reading through the
|
||||||
|
ACPI Namespace looking for devices with hardware IDs.
|
||||||
|
|
||||||
|
Each device enumerated by ACPI includes ACPI-defined objects in the
|
||||||
|
ACPI Namespace that report the hardware resources the device could
|
||||||
|
occupy [_PRS], an object that reports the resources that are currently
|
||||||
|
used by the device [_CRS], and objects for configuring those resources
|
||||||
|
[_SRS]. The information is used by the Plug and Play OS (OSPM) to
|
||||||
|
configure the devices.
|
||||||
|
|
||||||
|
[3] ACPI 6.2, sec 6.2:
|
||||||
|
OSPM uses device configuration objects to configure hardware resources
|
||||||
|
for devices enumerated via ACPI. Device configuration objects provide
|
||||||
|
information about current and possible resource requirements, the
|
||||||
|
relationship between shared resources, and methods for configuring
|
||||||
|
hardware resources.
|
||||||
|
|
||||||
|
When OSPM enumerates a device, it calls _PRS to determine the resource
|
||||||
|
requirements of the device. It may also call _CRS to find the current
|
||||||
|
resource settings for the device. Using this information, the Plug and
|
||||||
|
Play system determines what resources the device should consume and
|
||||||
|
sets those resources by calling the device’s _SRS control method.
|
||||||
|
|
||||||
|
In ACPI, devices can consume resources (for example, legacy keyboards),
|
||||||
|
provide resources (for example, a proprietary PCI bridge), or do both.
|
||||||
|
Unless otherwise specified, resources for a device are assumed to be
|
||||||
|
taken from the nearest matching resource above the device in the device
|
||||||
|
hierarchy.
|
||||||
|
|
||||||
|
[4] ACPI 6.2, sec 6.4.3.5.1, 2, 3, 4:
|
||||||
|
QWord/DWord/Word Address Space Descriptor (.1, .2, .3)
|
||||||
|
General Flags: Bit [0] Ignored
|
||||||
|
|
||||||
|
Extended Address Space Descriptor (.4)
|
||||||
|
General Flags: Bit [0] Consumer/Producer:
|
||||||
|
|
||||||
|
* 1 – This device consumes this resource
|
||||||
|
* 0 – This device produces and consumes this resource
|
||||||
|
|
||||||
|
[5] ACPI 6.2, sec 19.6.43:
|
||||||
|
ResourceUsage specifies whether the Memory range is consumed by
|
||||||
|
this device (ResourceConsumer) or passed on to child devices
|
||||||
|
(ResourceProducer). If nothing is specified, then
|
||||||
|
ResourceConsumer is assumed.
|
||||||
|
|
||||||
|
[6] PCI Firmware 3.2, sec 4.1.2:
|
||||||
|
If the operating system does not natively comprehend reserving the
|
||||||
|
MMCFG region, the MMCFG region must be reserved by firmware. The
|
||||||
|
address range reported in the MCFG table or by _CBA method (see Section
|
||||||
|
4.1.3) must be reserved by declaring a motherboard resource. For most
|
||||||
|
systems, the motherboard resource would appear at the root of the ACPI
|
||||||
|
namespace (under \_SB) in a node with a _HID of EISAID (PNP0C02), and
|
||||||
|
the resources in this case should not be claimed in the root PCI bus’s
|
||||||
|
_CRS. The resources can optionally be returned in Int15 E820 or
|
||||||
|
EFIGetMemoryMap as reserved memory but must always be reported through
|
||||||
|
ACPI as a motherboard resource.
|
||||||
|
|
||||||
|
[7] PCI Express 4.0, sec 7.2.2:
|
||||||
|
For systems that are PC-compatible, or that do not implement a
|
||||||
|
processor-architecture-specific firmware interface standard that allows
|
||||||
|
access to the Configuration Space, the ECAM is required as defined in
|
||||||
|
this section.
|
||||||
|
|
||||||
|
[8] PCI Firmware 3.2, sec 4.1.2:
|
||||||
|
The MCFG table is an ACPI table that is used to communicate the base
|
||||||
|
addresses corresponding to the non-hot removable PCI Segment Groups
|
||||||
|
range within a PCI Segment Group available to the operating system at
|
||||||
|
boot. This is required for the PC-compatible systems.
|
||||||
|
|
||||||
|
The MCFG table is only used to communicate the base addresses
|
||||||
|
corresponding to the PCI Segment Groups available to the system at
|
||||||
|
boot.
|
||||||
|
|
||||||
|
[9] PCI Firmware 3.2, sec 4.1.3:
|
||||||
|
The _CBA (Memory mapped Configuration Base Address) control method is
|
||||||
|
an optional ACPI object that returns the 64-bit memory mapped
|
||||||
|
configuration base address for the hot plug capable host bridge. The
|
||||||
|
base address returned by _CBA is processor-relative address. The _CBA
|
||||||
|
control method evaluates to an Integer.
|
||||||
|
|
||||||
|
This control method appears under a host bridge object. When the _CBA
|
||||||
|
method appears under an active host bridge object, the operating system
|
||||||
|
evaluates this structure to identify the memory mapped configuration
|
||||||
|
base address corresponding to the PCI Segment Group for the bus number
|
||||||
|
range specified in _CRS method. An ACPI name space object that contains
|
||||||
|
the _CBA method must also contain a corresponding _SEG method.
|
@ -1,187 +0,0 @@
|
|||||||
ACPI considerations for PCI host bridges
|
|
||||||
|
|
||||||
The general rule is that the ACPI namespace should describe everything the
|
|
||||||
OS might use unless there's another way for the OS to find it [1, 2].
|
|
||||||
|
|
||||||
For example, there's no standard hardware mechanism for enumerating PCI
|
|
||||||
host bridges, so the ACPI namespace must describe each host bridge, the
|
|
||||||
method for accessing PCI config space below it, the address space windows
|
|
||||||
the host bridge forwards to PCI (using _CRS), and the routing of legacy
|
|
||||||
INTx interrupts (using _PRT).
|
|
||||||
|
|
||||||
PCI devices, which are below the host bridge, generally do not need to be
|
|
||||||
described via ACPI. The OS can discover them via the standard PCI
|
|
||||||
enumeration mechanism, using config accesses to discover and identify
|
|
||||||
devices and read and size their BARs. However, ACPI may describe PCI
|
|
||||||
devices if it provides power management or hotplug functionality for them
|
|
||||||
or if the device has INTx interrupts connected by platform interrupt
|
|
||||||
controllers and a _PRT is needed to describe those connections.
|
|
||||||
|
|
||||||
ACPI resource description is done via _CRS objects of devices in the ACPI
|
|
||||||
namespace [2]. The _CRS is like a generalized PCI BAR: the OS can read
|
|
||||||
_CRS and figure out what resource is being consumed even if it doesn't have
|
|
||||||
a driver for the device [3]. That's important because it means an old OS
|
|
||||||
can work correctly even on a system with new devices unknown to the OS.
|
|
||||||
The new devices might not do anything, but the OS can at least make sure no
|
|
||||||
resources conflict with them.
|
|
||||||
|
|
||||||
Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for
|
|
||||||
reserving address space. The static tables are for things the OS needs to
|
|
||||||
know early in boot, before it can parse the ACPI namespace. If a new table
|
|
||||||
is defined, an old OS needs to operate correctly even though it ignores the
|
|
||||||
table. _CRS allows that because it is generic and understood by the old
|
|
||||||
OS; a static table does not.
|
|
||||||
|
|
||||||
If the OS is expected to manage a non-discoverable device described via
|
|
||||||
ACPI, that device will have a specific _HID/_CID that tells the OS what
|
|
||||||
driver to bind to it, and the _CRS tells the OS and the driver where the
|
|
||||||
device's registers are.
|
|
||||||
|
|
||||||
PCI host bridges are PNP0A03 or PNP0A08 devices. Their _CRS should
|
|
||||||
describe all the address space they consume. This includes all the windows
|
|
||||||
they forward down to the PCI bus, as well as registers of the host bridge
|
|
||||||
itself that are not forwarded to PCI. The host bridge registers include
|
|
||||||
things like secondary/subordinate bus registers that determine the bus
|
|
||||||
range below the bridge, window registers that describe the apertures, etc.
|
|
||||||
These are all device-specific, non-architected things, so the only way a
|
|
||||||
PNP0A03/PNP0A08 driver can manage them is via _PRS/_CRS/_SRS, which contain
|
|
||||||
the device-specific details. The host bridge registers also include ECAM
|
|
||||||
space, since it is consumed by the host bridge.
|
|
||||||
|
|
||||||
ACPI defines a Consumer/Producer bit to distinguish the bridge registers
|
|
||||||
("Consumer") from the bridge apertures ("Producer") [4, 5], but early
|
|
||||||
BIOSes didn't use that bit correctly. The result is that the current ACPI
|
|
||||||
spec defines Consumer/Producer only for the Extended Address Space
|
|
||||||
descriptors; the bit should be ignored in the older QWord/DWord/Word
|
|
||||||
Address Space descriptors. Consequently, OSes have to assume all
|
|
||||||
QWord/DWord/Word descriptors are windows.
|
|
||||||
|
|
||||||
Prior to the addition of Extended Address Space descriptors, the failure of
|
|
||||||
Consumer/Producer meant there was no way to describe bridge registers in
|
|
||||||
the PNP0A03/PNP0A08 device itself. The workaround was to describe the
|
|
||||||
bridge registers (including ECAM space) in PNP0C02 catch-all devices [6].
|
|
||||||
With the exception of ECAM, the bridge register space is device-specific
|
|
||||||
anyway, so the generic PNP0A03/PNP0A08 driver (pci_root.c) has no need to
|
|
||||||
know about it.
|
|
||||||
|
|
||||||
New architectures should be able to use "Consumer" Extended Address Space
|
|
||||||
descriptors in the PNP0A03 device for bridge registers, including ECAM,
|
|
||||||
although a strict interpretation of [6] might prohibit this. Old x86 and
|
|
||||||
ia64 kernels assume all address space descriptors, including "Consumer"
|
|
||||||
Extended Address Space ones, are windows, so it would not be safe to
|
|
||||||
describe bridge registers this way on those architectures.
|
|
||||||
|
|
||||||
PNP0C02 "motherboard" devices are basically a catch-all. There's no
|
|
||||||
programming model for them other than "don't use these resources for
|
|
||||||
anything else." So a PNP0C02 _CRS should claim any address space that is
|
|
||||||
(1) not claimed by _CRS under any other device object in the ACPI namespace
|
|
||||||
and (2) should not be assigned by the OS to something else.
|
|
||||||
|
|
||||||
The PCIe spec requires the Enhanced Configuration Access Method (ECAM)
|
|
||||||
unless there's a standard firmware interface for config access, e.g., the
|
|
||||||
ia64 SAL interface [7]. A host bridge consumes ECAM memory address space
|
|
||||||
and converts memory accesses into PCI configuration accesses. The spec
|
|
||||||
defines the ECAM address space layout and functionality; only the base of
|
|
||||||
the address space is device-specific. An ACPI OS learns the base address
|
|
||||||
from either the static MCFG table or a _CBA method in the PNP0A03 device.
|
|
||||||
|
|
||||||
The MCFG table must describe the ECAM space of non-hot pluggable host
|
|
||||||
bridges [8]. Since MCFG is a static table and can't be updated by hotplug,
|
|
||||||
a _CBA method in the PNP0A03 device describes the ECAM space of a
|
|
||||||
hot-pluggable host bridge [9]. Note that for both MCFG and _CBA, the base
|
|
||||||
address always corresponds to bus 0, even if the bus range below the bridge
|
|
||||||
(which is reported via _CRS) doesn't start at 0.
|
|
||||||
|
|
||||||
|
|
||||||
[1] ACPI 6.2, sec 6.1:
|
|
||||||
For any device that is on a non-enumerable type of bus (for example, an
|
|
||||||
ISA bus), OSPM enumerates the devices' identifier(s) and the ACPI
|
|
||||||
system firmware must supply an _HID object ... for each device to
|
|
||||||
enable OSPM to do that.
|
|
||||||
|
|
||||||
[2] ACPI 6.2, sec 3.7:
|
|
||||||
The OS enumerates motherboard devices simply by reading through the
|
|
||||||
ACPI Namespace looking for devices with hardware IDs.
|
|
||||||
|
|
||||||
Each device enumerated by ACPI includes ACPI-defined objects in the
|
|
||||||
ACPI Namespace that report the hardware resources the device could
|
|
||||||
occupy [_PRS], an object that reports the resources that are currently
|
|
||||||
used by the device [_CRS], and objects for configuring those resources
|
|
||||||
[_SRS]. The information is used by the Plug and Play OS (OSPM) to
|
|
||||||
configure the devices.
|
|
||||||
|
|
||||||
[3] ACPI 6.2, sec 6.2:
|
|
||||||
OSPM uses device configuration objects to configure hardware resources
|
|
||||||
for devices enumerated via ACPI. Device configuration objects provide
|
|
||||||
information about current and possible resource requirements, the
|
|
||||||
relationship between shared resources, and methods for configuring
|
|
||||||
hardware resources.
|
|
||||||
|
|
||||||
When OSPM enumerates a device, it calls _PRS to determine the resource
|
|
||||||
requirements of the device. It may also call _CRS to find the current
|
|
||||||
resource settings for the device. Using this information, the Plug and
|
|
||||||
Play system determines what resources the device should consume and
|
|
||||||
sets those resources by calling the device’s _SRS control method.
|
|
||||||
|
|
||||||
In ACPI, devices can consume resources (for example, legacy keyboards),
|
|
||||||
provide resources (for example, a proprietary PCI bridge), or do both.
|
|
||||||
Unless otherwise specified, resources for a device are assumed to be
|
|
||||||
taken from the nearest matching resource above the device in the device
|
|
||||||
hierarchy.
|
|
||||||
|
|
||||||
[4] ACPI 6.2, sec 6.4.3.5.1, 2, 3, 4:
|
|
||||||
QWord/DWord/Word Address Space Descriptor (.1, .2, .3)
|
|
||||||
General Flags: Bit [0] Ignored
|
|
||||||
|
|
||||||
Extended Address Space Descriptor (.4)
|
|
||||||
General Flags: Bit [0] Consumer/Producer:
|
|
||||||
1–This device consumes this resource
|
|
||||||
0–This device produces and consumes this resource
|
|
||||||
|
|
||||||
[5] ACPI 6.2, sec 19.6.43:
|
|
||||||
ResourceUsage specifies whether the Memory range is consumed by
|
|
||||||
this device (ResourceConsumer) or passed on to child devices
|
|
||||||
(ResourceProducer). If nothing is specified, then
|
|
||||||
ResourceConsumer is assumed.
|
|
||||||
|
|
||||||
[6] PCI Firmware 3.2, sec 4.1.2:
|
|
||||||
If the operating system does not natively comprehend reserving the
|
|
||||||
MMCFG region, the MMCFG region must be reserved by firmware. The
|
|
||||||
address range reported in the MCFG table or by _CBA method (see Section
|
|
||||||
4.1.3) must be reserved by declaring a motherboard resource. For most
|
|
||||||
systems, the motherboard resource would appear at the root of the ACPI
|
|
||||||
namespace (under \_SB) in a node with a _HID of EISAID (PNP0C02), and
|
|
||||||
the resources in this case should not be claimed in the root PCI bus’s
|
|
||||||
_CRS. The resources can optionally be returned in Int15 E820 or
|
|
||||||
EFIGetMemoryMap as reserved memory but must always be reported through
|
|
||||||
ACPI as a motherboard resource.
|
|
||||||
|
|
||||||
[7] PCI Express 4.0, sec 7.2.2:
|
|
||||||
For systems that are PC-compatible, or that do not implement a
|
|
||||||
processor-architecture-specific firmware interface standard that allows
|
|
||||||
access to the Configuration Space, the ECAM is required as defined in
|
|
||||||
this section.
|
|
||||||
|
|
||||||
[8] PCI Firmware 3.2, sec 4.1.2:
|
|
||||||
The MCFG table is an ACPI table that is used to communicate the base
|
|
||||||
addresses corresponding to the non-hot removable PCI Segment Groups
|
|
||||||
range within a PCI Segment Group available to the operating system at
|
|
||||||
boot. This is required for the PC-compatible systems.
|
|
||||||
|
|
||||||
The MCFG table is only used to communicate the base addresses
|
|
||||||
corresponding to the PCI Segment Groups available to the system at
|
|
||||||
boot.
|
|
||||||
|
|
||||||
[9] PCI Firmware 3.2, sec 4.1.3:
|
|
||||||
The _CBA (Memory mapped Configuration Base Address) control method is
|
|
||||||
an optional ACPI object that returns the 64-bit memory mapped
|
|
||||||
configuration base address for the hot plug capable host bridge. The
|
|
||||||
base address returned by _CBA is processor-relative address. The _CBA
|
|
||||||
control method evaluates to an Integer.
|
|
||||||
|
|
||||||
This control method appears under a host bridge object. When the _CBA
|
|
||||||
method appears under an active host bridge object, the operating system
|
|
||||||
evaluates this structure to identify the memory mapped configuration
|
|
||||||
base address corresponding to the PCI Segment Group for the bus number
|
|
||||||
range specified in _CRS method. An ACPI name space object that contains
|
|
||||||
the _CBA method must also contain a corresponding _SEG method.
|
|
13
Documentation/PCI/endpoint/index.rst
Normal file
13
Documentation/PCI/endpoint/index.rst
Normal file
@ -0,0 +1,13 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
======================
|
||||||
|
PCI Endpoint Framework
|
||||||
|
======================
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
|
||||||
|
pci-endpoint
|
||||||
|
pci-endpoint-cfs
|
||||||
|
pci-test-function
|
||||||
|
pci-test-howto
|
118
Documentation/PCI/endpoint/pci-endpoint-cfs.rst
Normal file
118
Documentation/PCI/endpoint/pci-endpoint-cfs.rst
Normal file
@ -0,0 +1,118 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
=======================================
|
||||||
|
Configuring PCI Endpoint Using CONFIGFS
|
||||||
|
=======================================
|
||||||
|
|
||||||
|
:Author: Kishon Vijay Abraham I <kishon@ti.com>
|
||||||
|
|
||||||
|
The PCI Endpoint Core exposes configfs entry (pci_ep) to configure the
|
||||||
|
PCI endpoint function and to bind the endpoint function
|
||||||
|
with the endpoint controller. (For introducing other mechanisms to
|
||||||
|
configure the PCI Endpoint Function refer to [1]).
|
||||||
|
|
||||||
|
Mounting configfs
|
||||||
|
=================
|
||||||
|
|
||||||
|
The PCI Endpoint Core layer creates pci_ep directory in the mounted configfs
|
||||||
|
directory. configfs can be mounted using the following command::
|
||||||
|
|
||||||
|
mount -t configfs none /sys/kernel/config
|
||||||
|
|
||||||
|
Directory Structure
|
||||||
|
===================
|
||||||
|
|
||||||
|
The pci_ep configfs has two directories at its root: controllers and
|
||||||
|
functions. Every EPC device present in the system will have an entry in
|
||||||
|
the *controllers* directory and and every EPF driver present in the system
|
||||||
|
will have an entry in the *functions* directory.
|
||||||
|
::
|
||||||
|
|
||||||
|
/sys/kernel/config/pci_ep/
|
||||||
|
.. controllers/
|
||||||
|
.. functions/
|
||||||
|
|
||||||
|
Creating EPF Device
|
||||||
|
===================
|
||||||
|
|
||||||
|
Every registered EPF driver will be listed in controllers directory. The
|
||||||
|
entries corresponding to EPF driver will be created by the EPF core.
|
||||||
|
::
|
||||||
|
|
||||||
|
/sys/kernel/config/pci_ep/functions/
|
||||||
|
.. <EPF Driver1>/
|
||||||
|
... <EPF Device 11>/
|
||||||
|
... <EPF Device 21>/
|
||||||
|
.. <EPF Driver2>/
|
||||||
|
... <EPF Device 12>/
|
||||||
|
... <EPF Device 22>/
|
||||||
|
|
||||||
|
In order to create a <EPF device> of the type probed by <EPF Driver>, the
|
||||||
|
user has to create a directory inside <EPF DriverN>.
|
||||||
|
|
||||||
|
Every <EPF device> directory consists of the following entries that can be
|
||||||
|
used to configure the standard configuration header of the endpoint function.
|
||||||
|
(These entries are created by the framework when any new <EPF Device> is
|
||||||
|
created)
|
||||||
|
::
|
||||||
|
|
||||||
|
.. <EPF Driver1>/
|
||||||
|
... <EPF Device 11>/
|
||||||
|
... vendorid
|
||||||
|
... deviceid
|
||||||
|
... revid
|
||||||
|
... progif_code
|
||||||
|
... subclass_code
|
||||||
|
... baseclass_code
|
||||||
|
... cache_line_size
|
||||||
|
... subsys_vendor_id
|
||||||
|
... subsys_id
|
||||||
|
... interrupt_pin
|
||||||
|
|
||||||
|
EPC Device
|
||||||
|
==========
|
||||||
|
|
||||||
|
Every registered EPC device will be listed in controllers directory. The
|
||||||
|
entries corresponding to EPC device will be created by the EPC core.
|
||||||
|
::
|
||||||
|
|
||||||
|
/sys/kernel/config/pci_ep/controllers/
|
||||||
|
.. <EPC Device1>/
|
||||||
|
... <Symlink EPF Device11>/
|
||||||
|
... <Symlink EPF Device12>/
|
||||||
|
... start
|
||||||
|
.. <EPC Device2>/
|
||||||
|
... <Symlink EPF Device21>/
|
||||||
|
... <Symlink EPF Device22>/
|
||||||
|
... start
|
||||||
|
|
||||||
|
The <EPC Device> directory will have a list of symbolic links to
|
||||||
|
<EPF Device>. These symbolic links should be created by the user to
|
||||||
|
represent the functions present in the endpoint device.
|
||||||
|
|
||||||
|
The <EPC Device> directory will also have a *start* field. Once
|
||||||
|
"1" is written to this field, the endpoint device will be ready to
|
||||||
|
establish the link with the host. This is usually done after
|
||||||
|
all the EPF devices are created and linked with the EPC device.
|
||||||
|
::
|
||||||
|
|
||||||
|
| controllers/
|
||||||
|
| <Directory: EPC name>/
|
||||||
|
| <Symbolic Link: Function>
|
||||||
|
| start
|
||||||
|
| functions/
|
||||||
|
| <Directory: EPF driver>/
|
||||||
|
| <Directory: EPF device>/
|
||||||
|
| vendorid
|
||||||
|
| deviceid
|
||||||
|
| revid
|
||||||
|
| progif_code
|
||||||
|
| subclass_code
|
||||||
|
| baseclass_code
|
||||||
|
| cache_line_size
|
||||||
|
| subsys_vendor_id
|
||||||
|
| subsys_id
|
||||||
|
| interrupt_pin
|
||||||
|
| function
|
||||||
|
|
||||||
|
[1] :doc:`pci-endpoint`
|
@ -1,105 +0,0 @@
|
|||||||
CONFIGURING PCI ENDPOINT USING CONFIGFS
|
|
||||||
Kishon Vijay Abraham I <kishon@ti.com>
|
|
||||||
|
|
||||||
The PCI Endpoint Core exposes configfs entry (pci_ep) to configure the
|
|
||||||
PCI endpoint function and to bind the endpoint function
|
|
||||||
with the endpoint controller. (For introducing other mechanisms to
|
|
||||||
configure the PCI Endpoint Function refer to [1]).
|
|
||||||
|
|
||||||
*) Mounting configfs
|
|
||||||
|
|
||||||
The PCI Endpoint Core layer creates pci_ep directory in the mounted configfs
|
|
||||||
directory. configfs can be mounted using the following command.
|
|
||||||
|
|
||||||
mount -t configfs none /sys/kernel/config
|
|
||||||
|
|
||||||
*) Directory Structure
|
|
||||||
|
|
||||||
The pci_ep configfs has two directories at its root: controllers and
|
|
||||||
functions. Every EPC device present in the system will have an entry in
|
|
||||||
the *controllers* directory and and every EPF driver present in the system
|
|
||||||
will have an entry in the *functions* directory.
|
|
||||||
|
|
||||||
/sys/kernel/config/pci_ep/
|
|
||||||
.. controllers/
|
|
||||||
.. functions/
|
|
||||||
|
|
||||||
*) Creating EPF Device
|
|
||||||
|
|
||||||
Every registered EPF driver will be listed in controllers directory. The
|
|
||||||
entries corresponding to EPF driver will be created by the EPF core.
|
|
||||||
|
|
||||||
/sys/kernel/config/pci_ep/functions/
|
|
||||||
.. <EPF Driver1>/
|
|
||||||
... <EPF Device 11>/
|
|
||||||
... <EPF Device 21>/
|
|
||||||
.. <EPF Driver2>/
|
|
||||||
... <EPF Device 12>/
|
|
||||||
... <EPF Device 22>/
|
|
||||||
|
|
||||||
In order to create a <EPF device> of the type probed by <EPF Driver>, the
|
|
||||||
user has to create a directory inside <EPF DriverN>.
|
|
||||||
|
|
||||||
Every <EPF device> directory consists of the following entries that can be
|
|
||||||
used to configure the standard configuration header of the endpoint function.
|
|
||||||
(These entries are created by the framework when any new <EPF Device> is
|
|
||||||
created)
|
|
||||||
|
|
||||||
.. <EPF Driver1>/
|
|
||||||
... <EPF Device 11>/
|
|
||||||
... vendorid
|
|
||||||
... deviceid
|
|
||||||
... revid
|
|
||||||
... progif_code
|
|
||||||
... subclass_code
|
|
||||||
... baseclass_code
|
|
||||||
... cache_line_size
|
|
||||||
... subsys_vendor_id
|
|
||||||
... subsys_id
|
|
||||||
... interrupt_pin
|
|
||||||
|
|
||||||
*) EPC Device
|
|
||||||
|
|
||||||
Every registered EPC device will be listed in controllers directory. The
|
|
||||||
entries corresponding to EPC device will be created by the EPC core.
|
|
||||||
|
|
||||||
/sys/kernel/config/pci_ep/controllers/
|
|
||||||
.. <EPC Device1>/
|
|
||||||
... <Symlink EPF Device11>/
|
|
||||||
... <Symlink EPF Device12>/
|
|
||||||
... start
|
|
||||||
.. <EPC Device2>/
|
|
||||||
... <Symlink EPF Device21>/
|
|
||||||
... <Symlink EPF Device22>/
|
|
||||||
... start
|
|
||||||
|
|
||||||
The <EPC Device> directory will have a list of symbolic links to
|
|
||||||
<EPF Device>. These symbolic links should be created by the user to
|
|
||||||
represent the functions present in the endpoint device.
|
|
||||||
|
|
||||||
The <EPC Device> directory will also have a *start* field. Once
|
|
||||||
"1" is written to this field, the endpoint device will be ready to
|
|
||||||
establish the link with the host. This is usually done after
|
|
||||||
all the EPF devices are created and linked with the EPC device.
|
|
||||||
|
|
||||||
|
|
||||||
| controllers/
|
|
||||||
| <Directory: EPC name>/
|
|
||||||
| <Symbolic Link: Function>
|
|
||||||
| start
|
|
||||||
| functions/
|
|
||||||
| <Directory: EPF driver>/
|
|
||||||
| <Directory: EPF device>/
|
|
||||||
| vendorid
|
|
||||||
| deviceid
|
|
||||||
| revid
|
|
||||||
| progif_code
|
|
||||||
| subclass_code
|
|
||||||
| baseclass_code
|
|
||||||
| cache_line_size
|
|
||||||
| subsys_vendor_id
|
|
||||||
| subsys_id
|
|
||||||
| interrupt_pin
|
|
||||||
| function
|
|
||||||
|
|
||||||
[1] -> Documentation/PCI/endpoint/pci-endpoint.txt
|
|
231
Documentation/PCI/endpoint/pci-endpoint.rst
Normal file
231
Documentation/PCI/endpoint/pci-endpoint.rst
Normal file
@ -0,0 +1,231 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
:Author: Kishon Vijay Abraham I <kishon@ti.com>
|
||||||
|
|
||||||
|
This document is a guide to use the PCI Endpoint Framework in order to create
|
||||||
|
endpoint controller driver, endpoint function driver, and using configfs
|
||||||
|
interface to bind the function driver to the controller driver.
|
||||||
|
|
||||||
|
Introduction
|
||||||
|
============
|
||||||
|
|
||||||
|
Linux has a comprehensive PCI subsystem to support PCI controllers that
|
||||||
|
operates in Root Complex mode. The subsystem has capability to scan PCI bus,
|
||||||
|
assign memory resources and IRQ resources, load PCI driver (based on
|
||||||
|
vendor ID, device ID), support other services like hot-plug, power management,
|
||||||
|
advanced error reporting and virtual channels.
|
||||||
|
|
||||||
|
However the PCI controller IP integrated in some SoCs is capable of operating
|
||||||
|
either in Root Complex mode or Endpoint mode. PCI Endpoint Framework will
|
||||||
|
add endpoint mode support in Linux. This will help to run Linux in an
|
||||||
|
EP system which can have a wide variety of use cases from testing or
|
||||||
|
validation, co-processor accelerator, etc.
|
||||||
|
|
||||||
|
PCI Endpoint Core
|
||||||
|
=================
|
||||||
|
|
||||||
|
The PCI Endpoint Core layer comprises 3 components: the Endpoint Controller
|
||||||
|
library, the Endpoint Function library, and the configfs layer to bind the
|
||||||
|
endpoint function with the endpoint controller.
|
||||||
|
|
||||||
|
PCI Endpoint Controller(EPC) Library
|
||||||
|
------------------------------------
|
||||||
|
|
||||||
|
The EPC library provides APIs to be used by the controller that can operate
|
||||||
|
in endpoint mode. It also provides APIs to be used by function driver/library
|
||||||
|
in order to implement a particular endpoint function.
|
||||||
|
|
||||||
|
APIs for the PCI controller Driver
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
This section lists the APIs that the PCI Endpoint core provides to be used
|
||||||
|
by the PCI controller driver.
|
||||||
|
|
||||||
|
* devm_pci_epc_create()/pci_epc_create()
|
||||||
|
|
||||||
|
The PCI controller driver should implement the following ops:
|
||||||
|
|
||||||
|
* write_header: ops to populate configuration space header
|
||||||
|
* set_bar: ops to configure the BAR
|
||||||
|
* clear_bar: ops to reset the BAR
|
||||||
|
* alloc_addr_space: ops to allocate in PCI controller address space
|
||||||
|
* free_addr_space: ops to free the allocated address space
|
||||||
|
* raise_irq: ops to raise a legacy, MSI or MSI-X interrupt
|
||||||
|
* start: ops to start the PCI link
|
||||||
|
* stop: ops to stop the PCI link
|
||||||
|
|
||||||
|
The PCI controller driver can then create a new EPC device by invoking
|
||||||
|
devm_pci_epc_create()/pci_epc_create().
|
||||||
|
|
||||||
|
* devm_pci_epc_destroy()/pci_epc_destroy()
|
||||||
|
|
||||||
|
The PCI controller driver can destroy the EPC device created by either
|
||||||
|
devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or
|
||||||
|
pci_epc_destroy().
|
||||||
|
|
||||||
|
* pci_epc_linkup()
|
||||||
|
|
||||||
|
In order to notify all the function devices that the EPC device to which
|
||||||
|
they are linked has established a link with the host, the PCI controller
|
||||||
|
driver should invoke pci_epc_linkup().
|
||||||
|
|
||||||
|
* pci_epc_mem_init()
|
||||||
|
|
||||||
|
Initialize the pci_epc_mem structure used for allocating EPC addr space.
|
||||||
|
|
||||||
|
* pci_epc_mem_exit()
|
||||||
|
|
||||||
|
Cleanup the pci_epc_mem structure allocated during pci_epc_mem_init().
|
||||||
|
|
||||||
|
|
||||||
|
APIs for the PCI Endpoint Function Driver
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
This section lists the APIs that the PCI Endpoint core provides to be used
|
||||||
|
by the PCI endpoint function driver.
|
||||||
|
|
||||||
|
* pci_epc_write_header()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should use pci_epc_write_header() to
|
||||||
|
write the standard configuration header to the endpoint controller.
|
||||||
|
|
||||||
|
* pci_epc_set_bar()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should use pci_epc_set_bar() to configure
|
||||||
|
the Base Address Register in order for the host to assign PCI addr space.
|
||||||
|
Register space of the function driver is usually configured
|
||||||
|
using this API.
|
||||||
|
|
||||||
|
* pci_epc_clear_bar()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should use pci_epc_clear_bar() to reset
|
||||||
|
the BAR.
|
||||||
|
|
||||||
|
* pci_epc_raise_irq()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should use pci_epc_raise_irq() to raise
|
||||||
|
Legacy Interrupt, MSI or MSI-X Interrupt.
|
||||||
|
|
||||||
|
* pci_epc_mem_alloc_addr()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should use pci_epc_mem_alloc_addr(), to
|
||||||
|
allocate memory address from EPC addr space which is required to access
|
||||||
|
RC's buffer
|
||||||
|
|
||||||
|
* pci_epc_mem_free_addr()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should use pci_epc_mem_free_addr() to
|
||||||
|
free the memory space allocated using pci_epc_mem_alloc_addr().
|
||||||
|
|
||||||
|
Other APIs
|
||||||
|
~~~~~~~~~~
|
||||||
|
|
||||||
|
There are other APIs provided by the EPC library. These are used for binding
|
||||||
|
the EPF device with EPC device. pci-ep-cfs.c can be used as reference for
|
||||||
|
using these APIs.
|
||||||
|
|
||||||
|
* pci_epc_get()
|
||||||
|
|
||||||
|
Get a reference to the PCI endpoint controller based on the device name of
|
||||||
|
the controller.
|
||||||
|
|
||||||
|
* pci_epc_put()
|
||||||
|
|
||||||
|
Release the reference to the PCI endpoint controller obtained using
|
||||||
|
pci_epc_get()
|
||||||
|
|
||||||
|
* pci_epc_add_epf()
|
||||||
|
|
||||||
|
Add a PCI endpoint function to a PCI endpoint controller. A PCIe device
|
||||||
|
can have up to 8 functions according to the specification.
|
||||||
|
|
||||||
|
* pci_epc_remove_epf()
|
||||||
|
|
||||||
|
Remove the PCI endpoint function from PCI endpoint controller.
|
||||||
|
|
||||||
|
* pci_epc_start()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should invoke pci_epc_start() once it
|
||||||
|
has configured the endpoint function and wants to start the PCI link.
|
||||||
|
|
||||||
|
* pci_epc_stop()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should invoke pci_epc_stop() to stop
|
||||||
|
the PCI LINK.
|
||||||
|
|
||||||
|
|
||||||
|
PCI Endpoint Function(EPF) Library
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
|
The EPF library provides APIs to be used by the function driver and the EPC
|
||||||
|
library to provide endpoint mode functionality.
|
||||||
|
|
||||||
|
APIs for the PCI Endpoint Function Driver
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
This section lists the APIs that the PCI Endpoint core provides to be used
|
||||||
|
by the PCI endpoint function driver.
|
||||||
|
|
||||||
|
* pci_epf_register_driver()
|
||||||
|
|
||||||
|
The PCI Endpoint Function driver should implement the following ops:
|
||||||
|
* bind: ops to perform when a EPC device has been bound to EPF device
|
||||||
|
* unbind: ops to perform when a binding has been lost between a EPC
|
||||||
|
device and EPF device
|
||||||
|
* linkup: ops to perform when the EPC device has established a
|
||||||
|
connection with a host system
|
||||||
|
|
||||||
|
The PCI Function driver can then register the PCI EPF driver by using
|
||||||
|
pci_epf_register_driver().
|
||||||
|
|
||||||
|
* pci_epf_unregister_driver()
|
||||||
|
|
||||||
|
The PCI Function driver can unregister the PCI EPF driver by using
|
||||||
|
pci_epf_unregister_driver().
|
||||||
|
|
||||||
|
* pci_epf_alloc_space()
|
||||||
|
|
||||||
|
The PCI Function driver can allocate space for a particular BAR using
|
||||||
|
pci_epf_alloc_space().
|
||||||
|
|
||||||
|
* pci_epf_free_space()
|
||||||
|
|
||||||
|
The PCI Function driver can free the allocated space
|
||||||
|
(using pci_epf_alloc_space) by invoking pci_epf_free_space().
|
||||||
|
|
||||||
|
APIs for the PCI Endpoint Controller Library
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
This section lists the APIs that the PCI Endpoint core provides to be used
|
||||||
|
by the PCI endpoint controller library.
|
||||||
|
|
||||||
|
* pci_epf_linkup()
|
||||||
|
|
||||||
|
The PCI endpoint controller library invokes pci_epf_linkup() when the
|
||||||
|
EPC device has established the connection to the host.
|
||||||
|
|
||||||
|
Other APIs
|
||||||
|
~~~~~~~~~~
|
||||||
|
|
||||||
|
There are other APIs provided by the EPF library. These are used to notify
|
||||||
|
the function driver when the EPF device is bound to the EPC device.
|
||||||
|
pci-ep-cfs.c can be used as reference for using these APIs.
|
||||||
|
|
||||||
|
* pci_epf_create()
|
||||||
|
|
||||||
|
Create a new PCI EPF device by passing the name of the PCI EPF device.
|
||||||
|
This name will be used to bind the the EPF device to a EPF driver.
|
||||||
|
|
||||||
|
* pci_epf_destroy()
|
||||||
|
|
||||||
|
Destroy the created PCI EPF device.
|
||||||
|
|
||||||
|
* pci_epf_bind()
|
||||||
|
|
||||||
|
pci_epf_bind() should be invoked when the EPF device has been bound to
|
||||||
|
a EPC device.
|
||||||
|
|
||||||
|
* pci_epf_unbind()
|
||||||
|
|
||||||
|
pci_epf_unbind() should be invoked when the binding between EPC device
|
||||||
|
and EPF device is lost.
|
@ -1,215 +0,0 @@
|
|||||||
PCI ENDPOINT FRAMEWORK
|
|
||||||
Kishon Vijay Abraham I <kishon@ti.com>
|
|
||||||
|
|
||||||
This document is a guide to use the PCI Endpoint Framework in order to create
|
|
||||||
endpoint controller driver, endpoint function driver, and using configfs
|
|
||||||
interface to bind the function driver to the controller driver.
|
|
||||||
|
|
||||||
1. Introduction
|
|
||||||
|
|
||||||
Linux has a comprehensive PCI subsystem to support PCI controllers that
|
|
||||||
operates in Root Complex mode. The subsystem has capability to scan PCI bus,
|
|
||||||
assign memory resources and IRQ resources, load PCI driver (based on
|
|
||||||
vendor ID, device ID), support other services like hot-plug, power management,
|
|
||||||
advanced error reporting and virtual channels.
|
|
||||||
|
|
||||||
However the PCI controller IP integrated in some SoCs is capable of operating
|
|
||||||
either in Root Complex mode or Endpoint mode. PCI Endpoint Framework will
|
|
||||||
add endpoint mode support in Linux. This will help to run Linux in an
|
|
||||||
EP system which can have a wide variety of use cases from testing or
|
|
||||||
validation, co-processor accelerator, etc.
|
|
||||||
|
|
||||||
2. PCI Endpoint Core
|
|
||||||
|
|
||||||
The PCI Endpoint Core layer comprises 3 components: the Endpoint Controller
|
|
||||||
library, the Endpoint Function library, and the configfs layer to bind the
|
|
||||||
endpoint function with the endpoint controller.
|
|
||||||
|
|
||||||
2.1 PCI Endpoint Controller(EPC) Library
|
|
||||||
|
|
||||||
The EPC library provides APIs to be used by the controller that can operate
|
|
||||||
in endpoint mode. It also provides APIs to be used by function driver/library
|
|
||||||
in order to implement a particular endpoint function.
|
|
||||||
|
|
||||||
2.1.1 APIs for the PCI controller Driver
|
|
||||||
|
|
||||||
This section lists the APIs that the PCI Endpoint core provides to be used
|
|
||||||
by the PCI controller driver.
|
|
||||||
|
|
||||||
*) devm_pci_epc_create()/pci_epc_create()
|
|
||||||
|
|
||||||
The PCI controller driver should implement the following ops:
|
|
||||||
* write_header: ops to populate configuration space header
|
|
||||||
* set_bar: ops to configure the BAR
|
|
||||||
* clear_bar: ops to reset the BAR
|
|
||||||
* alloc_addr_space: ops to allocate in PCI controller address space
|
|
||||||
* free_addr_space: ops to free the allocated address space
|
|
||||||
* raise_irq: ops to raise a legacy, MSI or MSI-X interrupt
|
|
||||||
* start: ops to start the PCI link
|
|
||||||
* stop: ops to stop the PCI link
|
|
||||||
|
|
||||||
The PCI controller driver can then create a new EPC device by invoking
|
|
||||||
devm_pci_epc_create()/pci_epc_create().
|
|
||||||
|
|
||||||
*) devm_pci_epc_destroy()/pci_epc_destroy()
|
|
||||||
|
|
||||||
The PCI controller driver can destroy the EPC device created by either
|
|
||||||
devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or
|
|
||||||
pci_epc_destroy().
|
|
||||||
|
|
||||||
*) pci_epc_linkup()
|
|
||||||
|
|
||||||
In order to notify all the function devices that the EPC device to which
|
|
||||||
they are linked has established a link with the host, the PCI controller
|
|
||||||
driver should invoke pci_epc_linkup().
|
|
||||||
|
|
||||||
*) pci_epc_mem_init()
|
|
||||||
|
|
||||||
Initialize the pci_epc_mem structure used for allocating EPC addr space.
|
|
||||||
|
|
||||||
*) pci_epc_mem_exit()
|
|
||||||
|
|
||||||
Cleanup the pci_epc_mem structure allocated during pci_epc_mem_init().
|
|
||||||
|
|
||||||
2.1.2 APIs for the PCI Endpoint Function Driver
|
|
||||||
|
|
||||||
This section lists the APIs that the PCI Endpoint core provides to be used
|
|
||||||
by the PCI endpoint function driver.
|
|
||||||
|
|
||||||
*) pci_epc_write_header()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should use pci_epc_write_header() to
|
|
||||||
write the standard configuration header to the endpoint controller.
|
|
||||||
|
|
||||||
*) pci_epc_set_bar()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should use pci_epc_set_bar() to configure
|
|
||||||
the Base Address Register in order for the host to assign PCI addr space.
|
|
||||||
Register space of the function driver is usually configured
|
|
||||||
using this API.
|
|
||||||
|
|
||||||
*) pci_epc_clear_bar()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should use pci_epc_clear_bar() to reset
|
|
||||||
the BAR.
|
|
||||||
|
|
||||||
*) pci_epc_raise_irq()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should use pci_epc_raise_irq() to raise
|
|
||||||
Legacy Interrupt, MSI or MSI-X Interrupt.
|
|
||||||
|
|
||||||
*) pci_epc_mem_alloc_addr()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should use pci_epc_mem_alloc_addr(), to
|
|
||||||
allocate memory address from EPC addr space which is required to access
|
|
||||||
RC's buffer
|
|
||||||
|
|
||||||
*) pci_epc_mem_free_addr()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should use pci_epc_mem_free_addr() to
|
|
||||||
free the memory space allocated using pci_epc_mem_alloc_addr().
|
|
||||||
|
|
||||||
2.1.3 Other APIs
|
|
||||||
|
|
||||||
There are other APIs provided by the EPC library. These are used for binding
|
|
||||||
the EPF device with EPC device. pci-ep-cfs.c can be used as reference for
|
|
||||||
using these APIs.
|
|
||||||
|
|
||||||
*) pci_epc_get()
|
|
||||||
|
|
||||||
Get a reference to the PCI endpoint controller based on the device name of
|
|
||||||
the controller.
|
|
||||||
|
|
||||||
*) pci_epc_put()
|
|
||||||
|
|
||||||
Release the reference to the PCI endpoint controller obtained using
|
|
||||||
pci_epc_get()
|
|
||||||
|
|
||||||
*) pci_epc_add_epf()
|
|
||||||
|
|
||||||
Add a PCI endpoint function to a PCI endpoint controller. A PCIe device
|
|
||||||
can have up to 8 functions according to the specification.
|
|
||||||
|
|
||||||
*) pci_epc_remove_epf()
|
|
||||||
|
|
||||||
Remove the PCI endpoint function from PCI endpoint controller.
|
|
||||||
|
|
||||||
*) pci_epc_start()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should invoke pci_epc_start() once it
|
|
||||||
has configured the endpoint function and wants to start the PCI link.
|
|
||||||
|
|
||||||
*) pci_epc_stop()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should invoke pci_epc_stop() to stop
|
|
||||||
the PCI LINK.
|
|
||||||
|
|
||||||
2.2 PCI Endpoint Function(EPF) Library
|
|
||||||
|
|
||||||
The EPF library provides APIs to be used by the function driver and the EPC
|
|
||||||
library to provide endpoint mode functionality.
|
|
||||||
|
|
||||||
2.2.1 APIs for the PCI Endpoint Function Driver
|
|
||||||
|
|
||||||
This section lists the APIs that the PCI Endpoint core provides to be used
|
|
||||||
by the PCI endpoint function driver.
|
|
||||||
|
|
||||||
*) pci_epf_register_driver()
|
|
||||||
|
|
||||||
The PCI Endpoint Function driver should implement the following ops:
|
|
||||||
* bind: ops to perform when a EPC device has been bound to EPF device
|
|
||||||
* unbind: ops to perform when a binding has been lost between a EPC
|
|
||||||
device and EPF device
|
|
||||||
* linkup: ops to perform when the EPC device has established a
|
|
||||||
connection with a host system
|
|
||||||
|
|
||||||
The PCI Function driver can then register the PCI EPF driver by using
|
|
||||||
pci_epf_register_driver().
|
|
||||||
|
|
||||||
*) pci_epf_unregister_driver()
|
|
||||||
|
|
||||||
The PCI Function driver can unregister the PCI EPF driver by using
|
|
||||||
pci_epf_unregister_driver().
|
|
||||||
|
|
||||||
*) pci_epf_alloc_space()
|
|
||||||
|
|
||||||
The PCI Function driver can allocate space for a particular BAR using
|
|
||||||
pci_epf_alloc_space().
|
|
||||||
|
|
||||||
*) pci_epf_free_space()
|
|
||||||
|
|
||||||
The PCI Function driver can free the allocated space
|
|
||||||
(using pci_epf_alloc_space) by invoking pci_epf_free_space().
|
|
||||||
|
|
||||||
2.2.2 APIs for the PCI Endpoint Controller Library
|
|
||||||
This section lists the APIs that the PCI Endpoint core provides to be used
|
|
||||||
by the PCI endpoint controller library.
|
|
||||||
|
|
||||||
*) pci_epf_linkup()
|
|
||||||
|
|
||||||
The PCI endpoint controller library invokes pci_epf_linkup() when the
|
|
||||||
EPC device has established the connection to the host.
|
|
||||||
|
|
||||||
2.2.2 Other APIs
|
|
||||||
There are other APIs provided by the EPF library. These are used to notify
|
|
||||||
the function driver when the EPF device is bound to the EPC device.
|
|
||||||
pci-ep-cfs.c can be used as reference for using these APIs.
|
|
||||||
|
|
||||||
*) pci_epf_create()
|
|
||||||
|
|
||||||
Create a new PCI EPF device by passing the name of the PCI EPF device.
|
|
||||||
This name will be used to bind the the EPF device to a EPF driver.
|
|
||||||
|
|
||||||
*) pci_epf_destroy()
|
|
||||||
|
|
||||||
Destroy the created PCI EPF device.
|
|
||||||
|
|
||||||
*) pci_epf_bind()
|
|
||||||
|
|
||||||
pci_epf_bind() should be invoked when the EPF device has been bound to
|
|
||||||
a EPC device.
|
|
||||||
|
|
||||||
*) pci_epf_unbind()
|
|
||||||
|
|
||||||
pci_epf_unbind() should be invoked when the binding between EPC device
|
|
||||||
and EPF device is lost.
|
|
103
Documentation/PCI/endpoint/pci-test-function.rst
Normal file
103
Documentation/PCI/endpoint/pci-test-function.rst
Normal file
@ -0,0 +1,103 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
=================
|
||||||
|
PCI Test Function
|
||||||
|
=================
|
||||||
|
|
||||||
|
:Author: Kishon Vijay Abraham I <kishon@ti.com>
|
||||||
|
|
||||||
|
Traditionally PCI RC has always been validated by using standard
|
||||||
|
PCI cards like ethernet PCI cards or USB PCI cards or SATA PCI cards.
|
||||||
|
However with the addition of EP-core in linux kernel, it is possible
|
||||||
|
to configure a PCI controller that can operate in EP mode to work as
|
||||||
|
a test device.
|
||||||
|
|
||||||
|
The PCI endpoint test device is a virtual device (defined in software)
|
||||||
|
used to test the endpoint functionality and serve as a sample driver
|
||||||
|
for other PCI endpoint devices (to use the EP framework).
|
||||||
|
|
||||||
|
The PCI endpoint test device has the following registers:
|
||||||
|
|
||||||
|
1) PCI_ENDPOINT_TEST_MAGIC
|
||||||
|
2) PCI_ENDPOINT_TEST_COMMAND
|
||||||
|
3) PCI_ENDPOINT_TEST_STATUS
|
||||||
|
4) PCI_ENDPOINT_TEST_SRC_ADDR
|
||||||
|
5) PCI_ENDPOINT_TEST_DST_ADDR
|
||||||
|
6) PCI_ENDPOINT_TEST_SIZE
|
||||||
|
7) PCI_ENDPOINT_TEST_CHECKSUM
|
||||||
|
8) PCI_ENDPOINT_TEST_IRQ_TYPE
|
||||||
|
9) PCI_ENDPOINT_TEST_IRQ_NUMBER
|
||||||
|
|
||||||
|
* PCI_ENDPOINT_TEST_MAGIC
|
||||||
|
|
||||||
|
This register will be used to test BAR0. A known pattern will be written
|
||||||
|
and read back from MAGIC register to verify BAR0.
|
||||||
|
|
||||||
|
* PCI_ENDPOINT_TEST_COMMAND
|
||||||
|
|
||||||
|
This register will be used by the host driver to indicate the function
|
||||||
|
that the endpoint device must perform.
|
||||||
|
|
||||||
|
======== ================================================================
|
||||||
|
Bitfield Description
|
||||||
|
======== ================================================================
|
||||||
|
Bit 0 raise legacy IRQ
|
||||||
|
Bit 1 raise MSI IRQ
|
||||||
|
Bit 2 raise MSI-X IRQ
|
||||||
|
Bit 3 read command (read data from RC buffer)
|
||||||
|
Bit 4 write command (write data to RC buffer)
|
||||||
|
Bit 5 copy command (copy data from one RC buffer to another RC buffer)
|
||||||
|
======== ================================================================
|
||||||
|
|
||||||
|
* PCI_ENDPOINT_TEST_STATUS
|
||||||
|
|
||||||
|
This register reflects the status of the PCI endpoint device.
|
||||||
|
|
||||||
|
======== ==============================
|
||||||
|
Bitfield Description
|
||||||
|
======== ==============================
|
||||||
|
Bit 0 read success
|
||||||
|
Bit 1 read fail
|
||||||
|
Bit 2 write success
|
||||||
|
Bit 3 write fail
|
||||||
|
Bit 4 copy success
|
||||||
|
Bit 5 copy fail
|
||||||
|
Bit 6 IRQ raised
|
||||||
|
Bit 7 source address is invalid
|
||||||
|
Bit 8 destination address is invalid
|
||||||
|
======== ==============================
|
||||||
|
|
||||||
|
* PCI_ENDPOINT_TEST_SRC_ADDR
|
||||||
|
|
||||||
|
This register contains the source address (RC buffer address) for the
|
||||||
|
COPY/READ command.
|
||||||
|
|
||||||
|
* PCI_ENDPOINT_TEST_DST_ADDR
|
||||||
|
|
||||||
|
This register contains the destination address (RC buffer address) for
|
||||||
|
the COPY/WRITE command.
|
||||||
|
|
||||||
|
* PCI_ENDPOINT_TEST_IRQ_TYPE
|
||||||
|
|
||||||
|
This register contains the interrupt type (Legacy/MSI) triggered
|
||||||
|
for the READ/WRITE/COPY and raise IRQ (Legacy/MSI) commands.
|
||||||
|
|
||||||
|
Possible types:
|
||||||
|
|
||||||
|
====== ==
|
||||||
|
Legacy 0
|
||||||
|
MSI 1
|
||||||
|
MSI-X 2
|
||||||
|
====== ==
|
||||||
|
|
||||||
|
* PCI_ENDPOINT_TEST_IRQ_NUMBER
|
||||||
|
|
||||||
|
This register contains the triggered ID interrupt.
|
||||||
|
|
||||||
|
Admissible values:
|
||||||
|
|
||||||
|
====== ===========
|
||||||
|
Legacy 0
|
||||||
|
MSI [1 .. 32]
|
||||||
|
MSI-X [1 .. 2048]
|
||||||
|
====== ===========
|
@ -1,87 +0,0 @@
|
|||||||
PCI TEST
|
|
||||||
Kishon Vijay Abraham I <kishon@ti.com>
|
|
||||||
|
|
||||||
Traditionally PCI RC has always been validated by using standard
|
|
||||||
PCI cards like ethernet PCI cards or USB PCI cards or SATA PCI cards.
|
|
||||||
However with the addition of EP-core in linux kernel, it is possible
|
|
||||||
to configure a PCI controller that can operate in EP mode to work as
|
|
||||||
a test device.
|
|
||||||
|
|
||||||
The PCI endpoint test device is a virtual device (defined in software)
|
|
||||||
used to test the endpoint functionality and serve as a sample driver
|
|
||||||
for other PCI endpoint devices (to use the EP framework).
|
|
||||||
|
|
||||||
The PCI endpoint test device has the following registers:
|
|
||||||
|
|
||||||
1) PCI_ENDPOINT_TEST_MAGIC
|
|
||||||
2) PCI_ENDPOINT_TEST_COMMAND
|
|
||||||
3) PCI_ENDPOINT_TEST_STATUS
|
|
||||||
4) PCI_ENDPOINT_TEST_SRC_ADDR
|
|
||||||
5) PCI_ENDPOINT_TEST_DST_ADDR
|
|
||||||
6) PCI_ENDPOINT_TEST_SIZE
|
|
||||||
7) PCI_ENDPOINT_TEST_CHECKSUM
|
|
||||||
8) PCI_ENDPOINT_TEST_IRQ_TYPE
|
|
||||||
9) PCI_ENDPOINT_TEST_IRQ_NUMBER
|
|
||||||
|
|
||||||
*) PCI_ENDPOINT_TEST_MAGIC
|
|
||||||
|
|
||||||
This register will be used to test BAR0. A known pattern will be written
|
|
||||||
and read back from MAGIC register to verify BAR0.
|
|
||||||
|
|
||||||
*) PCI_ENDPOINT_TEST_COMMAND:
|
|
||||||
|
|
||||||
This register will be used by the host driver to indicate the function
|
|
||||||
that the endpoint device must perform.
|
|
||||||
|
|
||||||
Bitfield Description:
|
|
||||||
Bit 0 : raise legacy IRQ
|
|
||||||
Bit 1 : raise MSI IRQ
|
|
||||||
Bit 2 : raise MSI-X IRQ
|
|
||||||
Bit 3 : read command (read data from RC buffer)
|
|
||||||
Bit 4 : write command (write data to RC buffer)
|
|
||||||
Bit 5 : copy command (copy data from one RC buffer to another
|
|
||||||
RC buffer)
|
|
||||||
|
|
||||||
*) PCI_ENDPOINT_TEST_STATUS
|
|
||||||
|
|
||||||
This register reflects the status of the PCI endpoint device.
|
|
||||||
|
|
||||||
Bitfield Description:
|
|
||||||
Bit 0 : read success
|
|
||||||
Bit 1 : read fail
|
|
||||||
Bit 2 : write success
|
|
||||||
Bit 3 : write fail
|
|
||||||
Bit 4 : copy success
|
|
||||||
Bit 5 : copy fail
|
|
||||||
Bit 6 : IRQ raised
|
|
||||||
Bit 7 : source address is invalid
|
|
||||||
Bit 8 : destination address is invalid
|
|
||||||
|
|
||||||
*) PCI_ENDPOINT_TEST_SRC_ADDR
|
|
||||||
|
|
||||||
This register contains the source address (RC buffer address) for the
|
|
||||||
COPY/READ command.
|
|
||||||
|
|
||||||
*) PCI_ENDPOINT_TEST_DST_ADDR
|
|
||||||
|
|
||||||
This register contains the destination address (RC buffer address) for
|
|
||||||
the COPY/WRITE command.
|
|
||||||
|
|
||||||
*) PCI_ENDPOINT_TEST_IRQ_TYPE
|
|
||||||
|
|
||||||
This register contains the interrupt type (Legacy/MSI) triggered
|
|
||||||
for the READ/WRITE/COPY and raise IRQ (Legacy/MSI) commands.
|
|
||||||
|
|
||||||
Possible types:
|
|
||||||
- Legacy : 0
|
|
||||||
- MSI : 1
|
|
||||||
- MSI-X : 2
|
|
||||||
|
|
||||||
*) PCI_ENDPOINT_TEST_IRQ_NUMBER
|
|
||||||
|
|
||||||
This register contains the triggered ID interrupt.
|
|
||||||
|
|
||||||
Admissible values:
|
|
||||||
- Legacy : 0
|
|
||||||
- MSI : [1 .. 32]
|
|
||||||
- MSI-X : [1 .. 2048]
|
|
235
Documentation/PCI/endpoint/pci-test-howto.rst
Normal file
235
Documentation/PCI/endpoint/pci-test-howto.rst
Normal file
@ -0,0 +1,235 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
===================
|
||||||
|
PCI Test User Guide
|
||||||
|
===================
|
||||||
|
|
||||||
|
:Author: Kishon Vijay Abraham I <kishon@ti.com>
|
||||||
|
|
||||||
|
This document is a guide to help users use pci-epf-test function driver
|
||||||
|
and pci_endpoint_test host driver for testing PCI. The list of steps to
|
||||||
|
be followed in the host side and EP side is given below.
|
||||||
|
|
||||||
|
Endpoint Device
|
||||||
|
===============
|
||||||
|
|
||||||
|
Endpoint Controller Devices
|
||||||
|
---------------------------
|
||||||
|
|
||||||
|
To find the list of endpoint controller devices in the system::
|
||||||
|
|
||||||
|
# ls /sys/class/pci_epc/
|
||||||
|
51000000.pcie_ep
|
||||||
|
|
||||||
|
If PCI_ENDPOINT_CONFIGFS is enabled::
|
||||||
|
|
||||||
|
# ls /sys/kernel/config/pci_ep/controllers
|
||||||
|
51000000.pcie_ep
|
||||||
|
|
||||||
|
|
||||||
|
Endpoint Function Drivers
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
To find the list of endpoint function drivers in the system::
|
||||||
|
|
||||||
|
# ls /sys/bus/pci-epf/drivers
|
||||||
|
pci_epf_test
|
||||||
|
|
||||||
|
If PCI_ENDPOINT_CONFIGFS is enabled::
|
||||||
|
|
||||||
|
# ls /sys/kernel/config/pci_ep/functions
|
||||||
|
pci_epf_test
|
||||||
|
|
||||||
|
|
||||||
|
Creating pci-epf-test Device
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
PCI endpoint function device can be created using the configfs. To create
|
||||||
|
pci-epf-test device, the following commands can be used::
|
||||||
|
|
||||||
|
# mount -t configfs none /sys/kernel/config
|
||||||
|
# cd /sys/kernel/config/pci_ep/
|
||||||
|
# mkdir functions/pci_epf_test/func1
|
||||||
|
|
||||||
|
The "mkdir func1" above creates the pci-epf-test function device that will
|
||||||
|
be probed by pci_epf_test driver.
|
||||||
|
|
||||||
|
The PCI endpoint framework populates the directory with the following
|
||||||
|
configurable fields::
|
||||||
|
|
||||||
|
# ls functions/pci_epf_test/func1
|
||||||
|
baseclass_code interrupt_pin progif_code subsys_id
|
||||||
|
cache_line_size msi_interrupts revid subsys_vendorid
|
||||||
|
deviceid msix_interrupts subclass_code vendorid
|
||||||
|
|
||||||
|
The PCI endpoint function driver populates these entries with default values
|
||||||
|
when the device is bound to the driver. The pci-epf-test driver populates
|
||||||
|
vendorid with 0xffff and interrupt_pin with 0x0001::
|
||||||
|
|
||||||
|
# cat functions/pci_epf_test/func1/vendorid
|
||||||
|
0xffff
|
||||||
|
# cat functions/pci_epf_test/func1/interrupt_pin
|
||||||
|
0x0001
|
||||||
|
|
||||||
|
|
||||||
|
Configuring pci-epf-test Device
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
The user can configure the pci-epf-test device using configfs entry. In order
|
||||||
|
to change the vendorid and the number of MSI interrupts used by the function
|
||||||
|
device, the following commands can be used::
|
||||||
|
|
||||||
|
# echo 0x104c > functions/pci_epf_test/func1/vendorid
|
||||||
|
# echo 0xb500 > functions/pci_epf_test/func1/deviceid
|
||||||
|
# echo 16 > functions/pci_epf_test/func1/msi_interrupts
|
||||||
|
# echo 8 > functions/pci_epf_test/func1/msix_interrupts
|
||||||
|
|
||||||
|
|
||||||
|
Binding pci-epf-test Device to EP Controller
|
||||||
|
--------------------------------------------
|
||||||
|
|
||||||
|
In order for the endpoint function device to be useful, it has to be bound to
|
||||||
|
a PCI endpoint controller driver. Use the configfs to bind the function
|
||||||
|
device to one of the controller driver present in the system::
|
||||||
|
|
||||||
|
# ln -s functions/pci_epf_test/func1 controllers/51000000.pcie_ep/
|
||||||
|
|
||||||
|
Once the above step is completed, the PCI endpoint is ready to establish a link
|
||||||
|
with the host.
|
||||||
|
|
||||||
|
|
||||||
|
Start the Link
|
||||||
|
--------------
|
||||||
|
|
||||||
|
In order for the endpoint device to establish a link with the host, the _start_
|
||||||
|
field should be populated with '1'::
|
||||||
|
|
||||||
|
# echo 1 > controllers/51000000.pcie_ep/start
|
||||||
|
|
||||||
|
|
||||||
|
RootComplex Device
|
||||||
|
==================
|
||||||
|
|
||||||
|
lspci Output
|
||||||
|
------------
|
||||||
|
|
||||||
|
Note that the devices listed here correspond to the value populated in 1.4
|
||||||
|
above::
|
||||||
|
|
||||||
|
00:00.0 PCI bridge: Texas Instruments Device 8888 (rev 01)
|
||||||
|
01:00.0 Unassigned class [ff00]: Texas Instruments Device b500
|
||||||
|
|
||||||
|
|
||||||
|
Using Endpoint Test function Device
|
||||||
|
-----------------------------------
|
||||||
|
|
||||||
|
pcitest.sh added in tools/pci/ can be used to run all the default PCI endpoint
|
||||||
|
tests. To compile this tool the following commands should be used::
|
||||||
|
|
||||||
|
# cd <kernel-dir>
|
||||||
|
# make -C tools/pci
|
||||||
|
|
||||||
|
or if you desire to compile and install in your system::
|
||||||
|
|
||||||
|
# cd <kernel-dir>
|
||||||
|
# make -C tools/pci install
|
||||||
|
|
||||||
|
The tool and script will be located in <rootfs>/usr/bin/
|
||||||
|
|
||||||
|
|
||||||
|
pcitest.sh Output
|
||||||
|
~~~~~~~~~~~~~~~~~
|
||||||
|
::
|
||||||
|
|
||||||
|
# pcitest.sh
|
||||||
|
BAR tests
|
||||||
|
|
||||||
|
BAR0: OKAY
|
||||||
|
BAR1: OKAY
|
||||||
|
BAR2: OKAY
|
||||||
|
BAR3: OKAY
|
||||||
|
BAR4: NOT OKAY
|
||||||
|
BAR5: NOT OKAY
|
||||||
|
|
||||||
|
Interrupt tests
|
||||||
|
|
||||||
|
SET IRQ TYPE TO LEGACY: OKAY
|
||||||
|
LEGACY IRQ: NOT OKAY
|
||||||
|
SET IRQ TYPE TO MSI: OKAY
|
||||||
|
MSI1: OKAY
|
||||||
|
MSI2: OKAY
|
||||||
|
MSI3: OKAY
|
||||||
|
MSI4: OKAY
|
||||||
|
MSI5: OKAY
|
||||||
|
MSI6: OKAY
|
||||||
|
MSI7: OKAY
|
||||||
|
MSI8: OKAY
|
||||||
|
MSI9: OKAY
|
||||||
|
MSI10: OKAY
|
||||||
|
MSI11: OKAY
|
||||||
|
MSI12: OKAY
|
||||||
|
MSI13: OKAY
|
||||||
|
MSI14: OKAY
|
||||||
|
MSI15: OKAY
|
||||||
|
MSI16: OKAY
|
||||||
|
MSI17: NOT OKAY
|
||||||
|
MSI18: NOT OKAY
|
||||||
|
MSI19: NOT OKAY
|
||||||
|
MSI20: NOT OKAY
|
||||||
|
MSI21: NOT OKAY
|
||||||
|
MSI22: NOT OKAY
|
||||||
|
MSI23: NOT OKAY
|
||||||
|
MSI24: NOT OKAY
|
||||||
|
MSI25: NOT OKAY
|
||||||
|
MSI26: NOT OKAY
|
||||||
|
MSI27: NOT OKAY
|
||||||
|
MSI28: NOT OKAY
|
||||||
|
MSI29: NOT OKAY
|
||||||
|
MSI30: NOT OKAY
|
||||||
|
MSI31: NOT OKAY
|
||||||
|
MSI32: NOT OKAY
|
||||||
|
SET IRQ TYPE TO MSI-X: OKAY
|
||||||
|
MSI-X1: OKAY
|
||||||
|
MSI-X2: OKAY
|
||||||
|
MSI-X3: OKAY
|
||||||
|
MSI-X4: OKAY
|
||||||
|
MSI-X5: OKAY
|
||||||
|
MSI-X6: OKAY
|
||||||
|
MSI-X7: OKAY
|
||||||
|
MSI-X8: OKAY
|
||||||
|
MSI-X9: NOT OKAY
|
||||||
|
MSI-X10: NOT OKAY
|
||||||
|
MSI-X11: NOT OKAY
|
||||||
|
MSI-X12: NOT OKAY
|
||||||
|
MSI-X13: NOT OKAY
|
||||||
|
MSI-X14: NOT OKAY
|
||||||
|
MSI-X15: NOT OKAY
|
||||||
|
MSI-X16: NOT OKAY
|
||||||
|
[...]
|
||||||
|
MSI-X2047: NOT OKAY
|
||||||
|
MSI-X2048: NOT OKAY
|
||||||
|
|
||||||
|
Read Tests
|
||||||
|
|
||||||
|
SET IRQ TYPE TO MSI: OKAY
|
||||||
|
READ ( 1 bytes): OKAY
|
||||||
|
READ ( 1024 bytes): OKAY
|
||||||
|
READ ( 1025 bytes): OKAY
|
||||||
|
READ (1024000 bytes): OKAY
|
||||||
|
READ (1024001 bytes): OKAY
|
||||||
|
|
||||||
|
Write Tests
|
||||||
|
|
||||||
|
WRITE ( 1 bytes): OKAY
|
||||||
|
WRITE ( 1024 bytes): OKAY
|
||||||
|
WRITE ( 1025 bytes): OKAY
|
||||||
|
WRITE (1024000 bytes): OKAY
|
||||||
|
WRITE (1024001 bytes): OKAY
|
||||||
|
|
||||||
|
Copy Tests
|
||||||
|
|
||||||
|
COPY ( 1 bytes): OKAY
|
||||||
|
COPY ( 1024 bytes): OKAY
|
||||||
|
COPY ( 1025 bytes): OKAY
|
||||||
|
COPY (1024000 bytes): OKAY
|
||||||
|
COPY (1024001 bytes): OKAY
|
@ -1,206 +0,0 @@
|
|||||||
PCI TEST USERGUIDE
|
|
||||||
Kishon Vijay Abraham I <kishon@ti.com>
|
|
||||||
|
|
||||||
This document is a guide to help users use pci-epf-test function driver
|
|
||||||
and pci_endpoint_test host driver for testing PCI. The list of steps to
|
|
||||||
be followed in the host side and EP side is given below.
|
|
||||||
|
|
||||||
1. Endpoint Device
|
|
||||||
|
|
||||||
1.1 Endpoint Controller Devices
|
|
||||||
|
|
||||||
To find the list of endpoint controller devices in the system:
|
|
||||||
|
|
||||||
# ls /sys/class/pci_epc/
|
|
||||||
51000000.pcie_ep
|
|
||||||
|
|
||||||
If PCI_ENDPOINT_CONFIGFS is enabled
|
|
||||||
# ls /sys/kernel/config/pci_ep/controllers
|
|
||||||
51000000.pcie_ep
|
|
||||||
|
|
||||||
1.2 Endpoint Function Drivers
|
|
||||||
|
|
||||||
To find the list of endpoint function drivers in the system:
|
|
||||||
|
|
||||||
# ls /sys/bus/pci-epf/drivers
|
|
||||||
pci_epf_test
|
|
||||||
|
|
||||||
If PCI_ENDPOINT_CONFIGFS is enabled
|
|
||||||
# ls /sys/kernel/config/pci_ep/functions
|
|
||||||
pci_epf_test
|
|
||||||
|
|
||||||
1.3 Creating pci-epf-test Device
|
|
||||||
|
|
||||||
PCI endpoint function device can be created using the configfs. To create
|
|
||||||
pci-epf-test device, the following commands can be used
|
|
||||||
|
|
||||||
# mount -t configfs none /sys/kernel/config
|
|
||||||
# cd /sys/kernel/config/pci_ep/
|
|
||||||
# mkdir functions/pci_epf_test/func1
|
|
||||||
|
|
||||||
The "mkdir func1" above creates the pci-epf-test function device that will
|
|
||||||
be probed by pci_epf_test driver.
|
|
||||||
|
|
||||||
The PCI endpoint framework populates the directory with the following
|
|
||||||
configurable fields.
|
|
||||||
|
|
||||||
# ls functions/pci_epf_test/func1
|
|
||||||
baseclass_code interrupt_pin progif_code subsys_id
|
|
||||||
cache_line_size msi_interrupts revid subsys_vendorid
|
|
||||||
deviceid msix_interrupts subclass_code vendorid
|
|
||||||
|
|
||||||
The PCI endpoint function driver populates these entries with default values
|
|
||||||
when the device is bound to the driver. The pci-epf-test driver populates
|
|
||||||
vendorid with 0xffff and interrupt_pin with 0x0001
|
|
||||||
|
|
||||||
# cat functions/pci_epf_test/func1/vendorid
|
|
||||||
0xffff
|
|
||||||
# cat functions/pci_epf_test/func1/interrupt_pin
|
|
||||||
0x0001
|
|
||||||
|
|
||||||
1.4 Configuring pci-epf-test Device
|
|
||||||
|
|
||||||
The user can configure the pci-epf-test device using configfs entry. In order
|
|
||||||
to change the vendorid and the number of MSI interrupts used by the function
|
|
||||||
device, the following commands can be used.
|
|
||||||
|
|
||||||
# echo 0x104c > functions/pci_epf_test/func1/vendorid
|
|
||||||
# echo 0xb500 > functions/pci_epf_test/func1/deviceid
|
|
||||||
# echo 16 > functions/pci_epf_test/func1/msi_interrupts
|
|
||||||
# echo 8 > functions/pci_epf_test/func1/msix_interrupts
|
|
||||||
|
|
||||||
1.5 Binding pci-epf-test Device to EP Controller
|
|
||||||
|
|
||||||
In order for the endpoint function device to be useful, it has to be bound to
|
|
||||||
a PCI endpoint controller driver. Use the configfs to bind the function
|
|
||||||
device to one of the controller driver present in the system.
|
|
||||||
|
|
||||||
# ln -s functions/pci_epf_test/func1 controllers/51000000.pcie_ep/
|
|
||||||
|
|
||||||
Once the above step is completed, the PCI endpoint is ready to establish a link
|
|
||||||
with the host.
|
|
||||||
|
|
||||||
1.6 Start the Link
|
|
||||||
|
|
||||||
In order for the endpoint device to establish a link with the host, the _start_
|
|
||||||
field should be populated with '1'.
|
|
||||||
|
|
||||||
# echo 1 > controllers/51000000.pcie_ep/start
|
|
||||||
|
|
||||||
2. RootComplex Device
|
|
||||||
|
|
||||||
2.1 lspci Output
|
|
||||||
|
|
||||||
Note that the devices listed here correspond to the value populated in 1.4 above
|
|
||||||
|
|
||||||
00:00.0 PCI bridge: Texas Instruments Device 8888 (rev 01)
|
|
||||||
01:00.0 Unassigned class [ff00]: Texas Instruments Device b500
|
|
||||||
|
|
||||||
2.2 Using Endpoint Test function Device
|
|
||||||
|
|
||||||
pcitest.sh added in tools/pci/ can be used to run all the default PCI endpoint
|
|
||||||
tests. To compile this tool the following commands should be used:
|
|
||||||
|
|
||||||
# cd <kernel-dir>
|
|
||||||
# make -C tools/pci
|
|
||||||
|
|
||||||
or if you desire to compile and install in your system:
|
|
||||||
|
|
||||||
# cd <kernel-dir>
|
|
||||||
# make -C tools/pci install
|
|
||||||
|
|
||||||
The tool and script will be located in <rootfs>/usr/bin/
|
|
||||||
|
|
||||||
2.2.1 pcitest.sh Output
|
|
||||||
# pcitest.sh
|
|
||||||
BAR tests
|
|
||||||
|
|
||||||
BAR0: OKAY
|
|
||||||
BAR1: OKAY
|
|
||||||
BAR2: OKAY
|
|
||||||
BAR3: OKAY
|
|
||||||
BAR4: NOT OKAY
|
|
||||||
BAR5: NOT OKAY
|
|
||||||
|
|
||||||
Interrupt tests
|
|
||||||
|
|
||||||
SET IRQ TYPE TO LEGACY: OKAY
|
|
||||||
LEGACY IRQ: NOT OKAY
|
|
||||||
SET IRQ TYPE TO MSI: OKAY
|
|
||||||
MSI1: OKAY
|
|
||||||
MSI2: OKAY
|
|
||||||
MSI3: OKAY
|
|
||||||
MSI4: OKAY
|
|
||||||
MSI5: OKAY
|
|
||||||
MSI6: OKAY
|
|
||||||
MSI7: OKAY
|
|
||||||
MSI8: OKAY
|
|
||||||
MSI9: OKAY
|
|
||||||
MSI10: OKAY
|
|
||||||
MSI11: OKAY
|
|
||||||
MSI12: OKAY
|
|
||||||
MSI13: OKAY
|
|
||||||
MSI14: OKAY
|
|
||||||
MSI15: OKAY
|
|
||||||
MSI16: OKAY
|
|
||||||
MSI17: NOT OKAY
|
|
||||||
MSI18: NOT OKAY
|
|
||||||
MSI19: NOT OKAY
|
|
||||||
MSI20: NOT OKAY
|
|
||||||
MSI21: NOT OKAY
|
|
||||||
MSI22: NOT OKAY
|
|
||||||
MSI23: NOT OKAY
|
|
||||||
MSI24: NOT OKAY
|
|
||||||
MSI25: NOT OKAY
|
|
||||||
MSI26: NOT OKAY
|
|
||||||
MSI27: NOT OKAY
|
|
||||||
MSI28: NOT OKAY
|
|
||||||
MSI29: NOT OKAY
|
|
||||||
MSI30: NOT OKAY
|
|
||||||
MSI31: NOT OKAY
|
|
||||||
MSI32: NOT OKAY
|
|
||||||
SET IRQ TYPE TO MSI-X: OKAY
|
|
||||||
MSI-X1: OKAY
|
|
||||||
MSI-X2: OKAY
|
|
||||||
MSI-X3: OKAY
|
|
||||||
MSI-X4: OKAY
|
|
||||||
MSI-X5: OKAY
|
|
||||||
MSI-X6: OKAY
|
|
||||||
MSI-X7: OKAY
|
|
||||||
MSI-X8: OKAY
|
|
||||||
MSI-X9: NOT OKAY
|
|
||||||
MSI-X10: NOT OKAY
|
|
||||||
MSI-X11: NOT OKAY
|
|
||||||
MSI-X12: NOT OKAY
|
|
||||||
MSI-X13: NOT OKAY
|
|
||||||
MSI-X14: NOT OKAY
|
|
||||||
MSI-X15: NOT OKAY
|
|
||||||
MSI-X16: NOT OKAY
|
|
||||||
[...]
|
|
||||||
MSI-X2047: NOT OKAY
|
|
||||||
MSI-X2048: NOT OKAY
|
|
||||||
|
|
||||||
Read Tests
|
|
||||||
|
|
||||||
SET IRQ TYPE TO MSI: OKAY
|
|
||||||
READ ( 1 bytes): OKAY
|
|
||||||
READ ( 1024 bytes): OKAY
|
|
||||||
READ ( 1025 bytes): OKAY
|
|
||||||
READ (1024000 bytes): OKAY
|
|
||||||
READ (1024001 bytes): OKAY
|
|
||||||
|
|
||||||
Write Tests
|
|
||||||
|
|
||||||
WRITE ( 1 bytes): OKAY
|
|
||||||
WRITE ( 1024 bytes): OKAY
|
|
||||||
WRITE ( 1025 bytes): OKAY
|
|
||||||
WRITE (1024000 bytes): OKAY
|
|
||||||
WRITE (1024001 bytes): OKAY
|
|
||||||
|
|
||||||
Copy Tests
|
|
||||||
|
|
||||||
COPY ( 1 bytes): OKAY
|
|
||||||
COPY ( 1024 bytes): OKAY
|
|
||||||
COPY ( 1025 bytes): OKAY
|
|
||||||
COPY (1024000 bytes): OKAY
|
|
||||||
COPY (1024001 bytes): OKAY
|
|
18
Documentation/PCI/index.rst
Normal file
18
Documentation/PCI/index.rst
Normal file
@ -0,0 +1,18 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
=======================
|
||||||
|
Linux PCI Bus Subsystem
|
||||||
|
=======================
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
:numbered:
|
||||||
|
|
||||||
|
pci
|
||||||
|
picebus-howto
|
||||||
|
pci-iov-howto
|
||||||
|
msi-howto
|
||||||
|
acpi-info
|
||||||
|
pci-error-recovery
|
||||||
|
pcieaer-howto
|
||||||
|
endpoint/index
|
287
Documentation/PCI/msi-howto.rst
Normal file
287
Documentation/PCI/msi-howto.rst
Normal file
@ -0,0 +1,287 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
.. include:: <isonum.txt>
|
||||||
|
|
||||||
|
==========================
|
||||||
|
The MSI Driver Guide HOWTO
|
||||||
|
==========================
|
||||||
|
|
||||||
|
:Authors: Tom L Nguyen; Martine Silbermann; Matthew Wilcox
|
||||||
|
|
||||||
|
:Copyright: 2003, 2008 Intel Corporation
|
||||||
|
|
||||||
|
About this guide
|
||||||
|
================
|
||||||
|
|
||||||
|
This guide describes the basics of Message Signaled Interrupts (MSIs),
|
||||||
|
the advantages of using MSI over traditional interrupt mechanisms, how
|
||||||
|
to change your driver to use MSI or MSI-X and some basic diagnostics to
|
||||||
|
try if a device doesn't support MSIs.
|
||||||
|
|
||||||
|
|
||||||
|
What are MSIs?
|
||||||
|
==============
|
||||||
|
|
||||||
|
A Message Signaled Interrupt is a write from the device to a special
|
||||||
|
address which causes an interrupt to be received by the CPU.
|
||||||
|
|
||||||
|
The MSI capability was first specified in PCI 2.2 and was later enhanced
|
||||||
|
in PCI 3.0 to allow each interrupt to be masked individually. The MSI-X
|
||||||
|
capability was also introduced with PCI 3.0. It supports more interrupts
|
||||||
|
per device than MSI and allows interrupts to be independently configured.
|
||||||
|
|
||||||
|
Devices may support both MSI and MSI-X, but only one can be enabled at
|
||||||
|
a time.
|
||||||
|
|
||||||
|
|
||||||
|
Why use MSIs?
|
||||||
|
=============
|
||||||
|
|
||||||
|
There are three reasons why using MSIs can give an advantage over
|
||||||
|
traditional pin-based interrupts.
|
||||||
|
|
||||||
|
Pin-based PCI interrupts are often shared amongst several devices.
|
||||||
|
To support this, the kernel must call each interrupt handler associated
|
||||||
|
with an interrupt, which leads to reduced performance for the system as
|
||||||
|
a whole. MSIs are never shared, so this problem cannot arise.
|
||||||
|
|
||||||
|
When a device writes data to memory, then raises a pin-based interrupt,
|
||||||
|
it is possible that the interrupt may arrive before all the data has
|
||||||
|
arrived in memory (this becomes more likely with devices behind PCI-PCI
|
||||||
|
bridges). In order to ensure that all the data has arrived in memory,
|
||||||
|
the interrupt handler must read a register on the device which raised
|
||||||
|
the interrupt. PCI transaction ordering rules require that all the data
|
||||||
|
arrive in memory before the value may be returned from the register.
|
||||||
|
Using MSIs avoids this problem as the interrupt-generating write cannot
|
||||||
|
pass the data writes, so by the time the interrupt is raised, the driver
|
||||||
|
knows that all the data has arrived in memory.
|
||||||
|
|
||||||
|
PCI devices can only support a single pin-based interrupt per function.
|
||||||
|
Often drivers have to query the device to find out what event has
|
||||||
|
occurred, slowing down interrupt handling for the common case. With
|
||||||
|
MSIs, a device can support more interrupts, allowing each interrupt
|
||||||
|
to be specialised to a different purpose. One possible design gives
|
||||||
|
infrequent conditions (such as errors) their own interrupt which allows
|
||||||
|
the driver to handle the normal interrupt handling path more efficiently.
|
||||||
|
Other possible designs include giving one interrupt to each packet queue
|
||||||
|
in a network card or each port in a storage controller.
|
||||||
|
|
||||||
|
|
||||||
|
How to use MSIs
|
||||||
|
===============
|
||||||
|
|
||||||
|
PCI devices are initialised to use pin-based interrupts. The device
|
||||||
|
driver has to set up the device to use MSI or MSI-X. Not all machines
|
||||||
|
support MSIs correctly, and for those machines, the APIs described below
|
||||||
|
will simply fail and the device will continue to use pin-based interrupts.
|
||||||
|
|
||||||
|
Include kernel support for MSIs
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI
|
||||||
|
option enabled. This option is only available on some architectures,
|
||||||
|
and it may depend on some other options also being set. For example,
|
||||||
|
on x86, you must also enable X86_UP_APIC or SMP in order to see the
|
||||||
|
CONFIG_PCI_MSI option.
|
||||||
|
|
||||||
|
Using MSI
|
||||||
|
---------
|
||||||
|
|
||||||
|
Most of the hard work is done for the driver in the PCI layer. The driver
|
||||||
|
simply has to request that the PCI layer set up the MSI capability for this
|
||||||
|
device.
|
||||||
|
|
||||||
|
To automatically use MSI or MSI-X interrupt vectors, use the following
|
||||||
|
function::
|
||||||
|
|
||||||
|
int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
|
||||||
|
unsigned int max_vecs, unsigned int flags);
|
||||||
|
|
||||||
|
which allocates up to max_vecs interrupt vectors for a PCI device. It
|
||||||
|
returns the number of vectors allocated or a negative error. If the device
|
||||||
|
has a requirements for a minimum number of vectors the driver can pass a
|
||||||
|
min_vecs argument set to this limit, and the PCI core will return -ENOSPC
|
||||||
|
if it can't meet the minimum number of vectors.
|
||||||
|
|
||||||
|
The flags argument is used to specify which type of interrupt can be used
|
||||||
|
by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX).
|
||||||
|
A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for
|
||||||
|
any possible kind of interrupt. If the PCI_IRQ_AFFINITY flag is set,
|
||||||
|
pci_alloc_irq_vectors() will spread the interrupts around the available CPUs.
|
||||||
|
|
||||||
|
To get the Linux IRQ numbers passed to request_irq() and free_irq() and the
|
||||||
|
vectors, use the following function::
|
||||||
|
|
||||||
|
int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
|
||||||
|
|
||||||
|
Any allocated resources should be freed before removing the device using
|
||||||
|
the following function::
|
||||||
|
|
||||||
|
void pci_free_irq_vectors(struct pci_dev *dev);
|
||||||
|
|
||||||
|
If a device supports both MSI-X and MSI capabilities, this API will use the
|
||||||
|
MSI-X facilities in preference to the MSI facilities. MSI-X supports any
|
||||||
|
number of interrupts between 1 and 2048. In contrast, MSI is restricted to
|
||||||
|
a maximum of 32 interrupts (and must be a power of two). In addition, the
|
||||||
|
MSI interrupt vectors must be allocated consecutively, so the system might
|
||||||
|
not be able to allocate as many vectors for MSI as it could for MSI-X. On
|
||||||
|
some platforms, MSI interrupts must all be targeted at the same set of CPUs
|
||||||
|
whereas MSI-X interrupts can all be targeted at different CPUs.
|
||||||
|
|
||||||
|
If a device supports neither MSI-X or MSI it will fall back to a single
|
||||||
|
legacy IRQ vector.
|
||||||
|
|
||||||
|
The typical usage of MSI or MSI-X interrupts is to allocate as many vectors
|
||||||
|
as possible, likely up to the limit supported by the device. If nvec is
|
||||||
|
larger than the number supported by the device it will automatically be
|
||||||
|
capped to the supported limit, so there is no need to query the number of
|
||||||
|
vectors supported beforehand::
|
||||||
|
|
||||||
|
nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_ALL_TYPES)
|
||||||
|
if (nvec < 0)
|
||||||
|
goto out_err;
|
||||||
|
|
||||||
|
If a driver is unable or unwilling to deal with a variable number of MSI
|
||||||
|
interrupts it can request a particular number of interrupts by passing that
|
||||||
|
number to pci_alloc_irq_vectors() function as both 'min_vecs' and
|
||||||
|
'max_vecs' parameters::
|
||||||
|
|
||||||
|
ret = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_ALL_TYPES);
|
||||||
|
if (ret < 0)
|
||||||
|
goto out_err;
|
||||||
|
|
||||||
|
The most notorious example of the request type described above is enabling
|
||||||
|
the single MSI mode for a device. It could be done by passing two 1s as
|
||||||
|
'min_vecs' and 'max_vecs'::
|
||||||
|
|
||||||
|
ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
|
||||||
|
if (ret < 0)
|
||||||
|
goto out_err;
|
||||||
|
|
||||||
|
Some devices might not support using legacy line interrupts, in which case
|
||||||
|
the driver can specify that only MSI or MSI-X is acceptable::
|
||||||
|
|
||||||
|
nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_MSI | PCI_IRQ_MSIX);
|
||||||
|
if (nvec < 0)
|
||||||
|
goto out_err;
|
||||||
|
|
||||||
|
Legacy APIs
|
||||||
|
-----------
|
||||||
|
|
||||||
|
The following old APIs to enable and disable MSI or MSI-X interrupts should
|
||||||
|
not be used in new code::
|
||||||
|
|
||||||
|
pci_enable_msi() /* deprecated */
|
||||||
|
pci_disable_msi() /* deprecated */
|
||||||
|
pci_enable_msix_range() /* deprecated */
|
||||||
|
pci_enable_msix_exact() /* deprecated */
|
||||||
|
pci_disable_msix() /* deprecated */
|
||||||
|
|
||||||
|
Additionally there are APIs to provide the number of supported MSI or MSI-X
|
||||||
|
vectors: pci_msi_vec_count() and pci_msix_vec_count(). In general these
|
||||||
|
should be avoided in favor of letting pci_alloc_irq_vectors() cap the
|
||||||
|
number of vectors. If you have a legitimate special use case for the count
|
||||||
|
of vectors we might have to revisit that decision and add a
|
||||||
|
pci_nr_irq_vectors() helper that handles MSI and MSI-X transparently.
|
||||||
|
|
||||||
|
Considerations when using MSIs
|
||||||
|
------------------------------
|
||||||
|
|
||||||
|
Spinlocks
|
||||||
|
~~~~~~~~~
|
||||||
|
|
||||||
|
Most device drivers have a per-device spinlock which is taken in the
|
||||||
|
interrupt handler. With pin-based interrupts or a single MSI, it is not
|
||||||
|
necessary to disable interrupts (Linux guarantees the same interrupt will
|
||||||
|
not be re-entered). If a device uses multiple interrupts, the driver
|
||||||
|
must disable interrupts while the lock is held. If the device sends
|
||||||
|
a different interrupt, the driver will deadlock trying to recursively
|
||||||
|
acquire the spinlock. Such deadlocks can be avoided by using
|
||||||
|
spin_lock_irqsave() or spin_lock_irq() which disable local interrupts
|
||||||
|
and acquire the lock (see Documentation/kernel-hacking/locking.rst).
|
||||||
|
|
||||||
|
How to tell whether MSI/MSI-X is enabled on a device
|
||||||
|
----------------------------------------------------
|
||||||
|
|
||||||
|
Using 'lspci -v' (as root) may show some devices with "MSI", "Message
|
||||||
|
Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities
|
||||||
|
has an 'Enable' flag which is followed with either "+" (enabled)
|
||||||
|
or "-" (disabled).
|
||||||
|
|
||||||
|
|
||||||
|
MSI quirks
|
||||||
|
==========
|
||||||
|
|
||||||
|
Several PCI chipsets or devices are known not to support MSIs.
|
||||||
|
The PCI stack provides three ways to disable MSIs:
|
||||||
|
|
||||||
|
1. globally
|
||||||
|
2. on all devices behind a specific bridge
|
||||||
|
3. on a single device
|
||||||
|
|
||||||
|
Disabling MSIs globally
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
Some host chipsets simply don't support MSIs properly. If we're
|
||||||
|
lucky, the manufacturer knows this and has indicated it in the ACPI
|
||||||
|
FADT table. In this case, Linux automatically disables MSIs.
|
||||||
|
Some boards don't include this information in the table and so we have
|
||||||
|
to detect them ourselves. The complete list of these is found near the
|
||||||
|
quirk_disable_all_msi() function in drivers/pci/quirks.c.
|
||||||
|
|
||||||
|
If you have a board which has problems with MSIs, you can pass pci=nomsi
|
||||||
|
on the kernel command line to disable MSIs on all devices. It would be
|
||||||
|
in your best interests to report the problem to linux-pci@vger.kernel.org
|
||||||
|
including a full 'lspci -v' so we can add the quirks to the kernel.
|
||||||
|
|
||||||
|
Disabling MSIs below a bridge
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
Some PCI bridges are not able to route MSIs between busses properly.
|
||||||
|
In this case, MSIs must be disabled on all devices behind the bridge.
|
||||||
|
|
||||||
|
Some bridges allow you to enable MSIs by changing some bits in their
|
||||||
|
PCI configuration space (especially the Hypertransport chipsets such
|
||||||
|
as the nVidia nForce and Serverworks HT2000). As with host chipsets,
|
||||||
|
Linux mostly knows about them and automatically enables MSIs if it can.
|
||||||
|
If you have a bridge unknown to Linux, you can enable
|
||||||
|
MSIs in configuration space using whatever method you know works, then
|
||||||
|
enable MSIs on that bridge by doing::
|
||||||
|
|
||||||
|
echo 1 > /sys/bus/pci/devices/$bridge/msi_bus
|
||||||
|
|
||||||
|
where $bridge is the PCI address of the bridge you've enabled (eg
|
||||||
|
0000:00:0e.0).
|
||||||
|
|
||||||
|
To disable MSIs, echo 0 instead of 1. Changing this value should be
|
||||||
|
done with caution as it could break interrupt handling for all devices
|
||||||
|
below this bridge.
|
||||||
|
|
||||||
|
Again, please notify linux-pci@vger.kernel.org of any bridges that need
|
||||||
|
special handling.
|
||||||
|
|
||||||
|
Disabling MSIs on a single device
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
Some devices are known to have faulty MSI implementations. Usually this
|
||||||
|
is handled in the individual device driver, but occasionally it's necessary
|
||||||
|
to handle this with a quirk. Some drivers have an option to disable use
|
||||||
|
of MSI. While this is a convenient workaround for the driver author,
|
||||||
|
it is not good practice, and should not be emulated.
|
||||||
|
|
||||||
|
Finding why MSIs are disabled on a device
|
||||||
|
-----------------------------------------
|
||||||
|
|
||||||
|
From the above three sections, you can see that there are many reasons
|
||||||
|
why MSIs may not be enabled for a given device. Your first step should
|
||||||
|
be to examine your dmesg carefully to determine whether MSIs are enabled
|
||||||
|
for your machine. You should also check your .config to be sure you
|
||||||
|
have enabled CONFIG_PCI_MSI.
|
||||||
|
|
||||||
|
Then, 'lspci -t' gives the list of bridges above a device. Reading
|
||||||
|
`/sys/bus/pci/devices/*/msi_bus` will tell you whether MSIs are enabled (1)
|
||||||
|
or disabled (0). If 0 is found in any of the msi_bus files belonging
|
||||||
|
to bridges between the PCI root and the device, MSIs are disabled.
|
||||||
|
|
||||||
|
It is also worth checking the device driver to see whether it supports MSIs.
|
||||||
|
For example, it may contain calls to pci_irq_alloc_vectors() with the
|
||||||
|
PCI_IRQ_MSI or PCI_IRQ_MSIX flags.
|
424
Documentation/PCI/pci-error-recovery.rst
Normal file
424
Documentation/PCI/pci-error-recovery.rst
Normal file
@ -0,0 +1,424 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
==================
|
||||||
|
PCI Error Recovery
|
||||||
|
==================
|
||||||
|
|
||||||
|
|
||||||
|
:Authors: - Linas Vepstas <linasvepstas@gmail.com>
|
||||||
|
- Richard Lary <rlary@us.ibm.com>
|
||||||
|
- Mike Mason <mmlnx@us.ibm.com>
|
||||||
|
|
||||||
|
|
||||||
|
Many PCI bus controllers are able to detect a variety of hardware
|
||||||
|
PCI errors on the bus, such as parity errors on the data and address
|
||||||
|
buses, as well as SERR and PERR errors. Some of the more advanced
|
||||||
|
chipsets are able to deal with these errors; these include PCI-E chipsets,
|
||||||
|
and the PCI-host bridges found on IBM Power4, Power5 and Power6-based
|
||||||
|
pSeries boxes. A typical action taken is to disconnect the affected device,
|
||||||
|
halting all I/O to it. The goal of a disconnection is to avoid system
|
||||||
|
corruption; for example, to halt system memory corruption due to DMA's
|
||||||
|
to "wild" addresses. Typically, a reconnection mechanism is also
|
||||||
|
offered, so that the affected PCI device(s) are reset and put back
|
||||||
|
into working condition. The reset phase requires coordination
|
||||||
|
between the affected device drivers and the PCI controller chip.
|
||||||
|
This document describes a generic API for notifying device drivers
|
||||||
|
of a bus disconnection, and then performing error recovery.
|
||||||
|
This API is currently implemented in the 2.6.16 and later kernels.
|
||||||
|
|
||||||
|
Reporting and recovery is performed in several steps. First, when
|
||||||
|
a PCI hardware error has resulted in a bus disconnect, that event
|
||||||
|
is reported as soon as possible to all affected device drivers,
|
||||||
|
including multiple instances of a device driver on multi-function
|
||||||
|
cards. This allows device drivers to avoid deadlocking in spinloops,
|
||||||
|
waiting for some i/o-space register to change, when it never will.
|
||||||
|
It also gives the drivers a chance to defer incoming I/O as
|
||||||
|
needed.
|
||||||
|
|
||||||
|
Next, recovery is performed in several stages. Most of the complexity
|
||||||
|
is forced by the need to handle multi-function devices, that is,
|
||||||
|
devices that have multiple device drivers associated with them.
|
||||||
|
In the first stage, each driver is allowed to indicate what type
|
||||||
|
of reset it desires, the choices being a simple re-enabling of I/O
|
||||||
|
or requesting a slot reset.
|
||||||
|
|
||||||
|
If any driver requests a slot reset, that is what will be done.
|
||||||
|
|
||||||
|
After a reset and/or a re-enabling of I/O, all drivers are
|
||||||
|
again notified, so that they may then perform any device setup/config
|
||||||
|
that may be required. After these have all completed, a final
|
||||||
|
"resume normal operations" event is sent out.
|
||||||
|
|
||||||
|
The biggest reason for choosing a kernel-based implementation rather
|
||||||
|
than a user-space implementation was the need to deal with bus
|
||||||
|
disconnects of PCI devices attached to storage media, and, in particular,
|
||||||
|
disconnects from devices holding the root file system. If the root
|
||||||
|
file system is disconnected, a user-space mechanism would have to go
|
||||||
|
through a large number of contortions to complete recovery. Almost all
|
||||||
|
of the current Linux file systems are not tolerant of disconnection
|
||||||
|
from/reconnection to their underlying block device. By contrast,
|
||||||
|
bus errors are easy to manage in the device driver. Indeed, most
|
||||||
|
device drivers already handle very similar recovery procedures;
|
||||||
|
for example, the SCSI-generic layer already provides significant
|
||||||
|
mechanisms for dealing with SCSI bus errors and SCSI bus resets.
|
||||||
|
|
||||||
|
|
||||||
|
Detailed Design
|
||||||
|
===============
|
||||||
|
|
||||||
|
Design and implementation details below, based on a chain of
|
||||||
|
public email discussions with Ben Herrenschmidt, circa 5 April 2005.
|
||||||
|
|
||||||
|
The error recovery API support is exposed to the driver in the form of
|
||||||
|
a structure of function pointers pointed to by a new field in struct
|
||||||
|
pci_driver. A driver that fails to provide the structure is "non-aware",
|
||||||
|
and the actual recovery steps taken are platform dependent. The
|
||||||
|
arch/powerpc implementation will simulate a PCI hotplug remove/add.
|
||||||
|
|
||||||
|
This structure has the form::
|
||||||
|
|
||||||
|
struct pci_error_handlers
|
||||||
|
{
|
||||||
|
int (*error_detected)(struct pci_dev *dev, enum pci_channel_state);
|
||||||
|
int (*mmio_enabled)(struct pci_dev *dev);
|
||||||
|
int (*slot_reset)(struct pci_dev *dev);
|
||||||
|
void (*resume)(struct pci_dev *dev);
|
||||||
|
};
|
||||||
|
|
||||||
|
The possible channel states are::
|
||||||
|
|
||||||
|
enum pci_channel_state {
|
||||||
|
pci_channel_io_normal, /* I/O channel is in normal state */
|
||||||
|
pci_channel_io_frozen, /* I/O to channel is blocked */
|
||||||
|
pci_channel_io_perm_failure, /* PCI card is dead */
|
||||||
|
};
|
||||||
|
|
||||||
|
Possible return values are::
|
||||||
|
|
||||||
|
enum pci_ers_result {
|
||||||
|
PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */
|
||||||
|
PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */
|
||||||
|
PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */
|
||||||
|
PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */
|
||||||
|
PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */
|
||||||
|
};
|
||||||
|
|
||||||
|
A driver does not have to implement all of these callbacks; however,
|
||||||
|
if it implements any, it must implement error_detected(). If a callback
|
||||||
|
is not implemented, the corresponding feature is considered unsupported.
|
||||||
|
For example, if mmio_enabled() and resume() aren't there, then it
|
||||||
|
is assumed that the driver is not doing any direct recovery and requires
|
||||||
|
a slot reset. Typically a driver will want to know about
|
||||||
|
a slot_reset().
|
||||||
|
|
||||||
|
The actual steps taken by a platform to recover from a PCI error
|
||||||
|
event will be platform-dependent, but will follow the general
|
||||||
|
sequence described below.
|
||||||
|
|
||||||
|
STEP 0: Error Event
|
||||||
|
-------------------
|
||||||
|
A PCI bus error is detected by the PCI hardware. On powerpc, the slot
|
||||||
|
is isolated, in that all I/O is blocked: all reads return 0xffffffff,
|
||||||
|
all writes are ignored.
|
||||||
|
|
||||||
|
|
||||||
|
STEP 1: Notification
|
||||||
|
--------------------
|
||||||
|
Platform calls the error_detected() callback on every instance of
|
||||||
|
every driver affected by the error.
|
||||||
|
|
||||||
|
At this point, the device might not be accessible anymore, depending on
|
||||||
|
the platform (the slot will be isolated on powerpc). The driver may
|
||||||
|
already have "noticed" the error because of a failing I/O, but this
|
||||||
|
is the proper "synchronization point", that is, it gives the driver
|
||||||
|
a chance to cleanup, waiting for pending stuff (timers, whatever, etc...)
|
||||||
|
to complete; it can take semaphores, schedule, etc... everything but
|
||||||
|
touch the device. Within this function and after it returns, the driver
|
||||||
|
shouldn't do any new IOs. Called in task context. This is sort of a
|
||||||
|
"quiesce" point. See note about interrupts at the end of this doc.
|
||||||
|
|
||||||
|
All drivers participating in this system must implement this call.
|
||||||
|
The driver must return one of the following result codes:
|
||||||
|
|
||||||
|
- PCI_ERS_RESULT_CAN_RECOVER
|
||||||
|
Driver returns this if it thinks it might be able to recover
|
||||||
|
the HW by just banging IOs or if it wants to be given
|
||||||
|
a chance to extract some diagnostic information (see
|
||||||
|
mmio_enable, below).
|
||||||
|
- PCI_ERS_RESULT_NEED_RESET
|
||||||
|
Driver returns this if it can't recover without a
|
||||||
|
slot reset.
|
||||||
|
- PCI_ERS_RESULT_DISCONNECT
|
||||||
|
Driver returns this if it doesn't want to recover at all.
|
||||||
|
|
||||||
|
The next step taken will depend on the result codes returned by the
|
||||||
|
drivers.
|
||||||
|
|
||||||
|
If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER,
|
||||||
|
then the platform should re-enable IOs on the slot (or do nothing in
|
||||||
|
particular, if the platform doesn't isolate slots), and recovery
|
||||||
|
proceeds to STEP 2 (MMIO Enable).
|
||||||
|
|
||||||
|
If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
|
||||||
|
then recovery proceeds to STEP 4 (Slot Reset).
|
||||||
|
|
||||||
|
If the platform is unable to recover the slot, the next step
|
||||||
|
is STEP 6 (Permanent Failure).
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The current powerpc implementation assumes that a device driver will
|
||||||
|
*not* schedule or semaphore in this routine; the current powerpc
|
||||||
|
implementation uses one kernel thread to notify all devices;
|
||||||
|
thus, if one device sleeps/schedules, all devices are affected.
|
||||||
|
Doing better requires complex multi-threaded logic in the error
|
||||||
|
recovery implementation (e.g. waiting for all notification threads
|
||||||
|
to "join" before proceeding with recovery.) This seems excessively
|
||||||
|
complex and not worth implementing.
|
||||||
|
|
||||||
|
The current powerpc implementation doesn't much care if the device
|
||||||
|
attempts I/O at this point, or not. I/O's will fail, returning
|
||||||
|
a value of 0xff on read, and writes will be dropped. If more than
|
||||||
|
EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
|
||||||
|
assumes that the device driver has gone into an infinite loop
|
||||||
|
and prints an error to syslog. A reboot is then required to
|
||||||
|
get the device working again.
|
||||||
|
|
||||||
|
STEP 2: MMIO Enabled
|
||||||
|
--------------------
|
||||||
|
The platform re-enables MMIO to the device (but typically not the
|
||||||
|
DMA), and then calls the mmio_enabled() callback on all affected
|
||||||
|
device drivers.
|
||||||
|
|
||||||
|
This is the "early recovery" call. IOs are allowed again, but DMA is
|
||||||
|
not, with some restrictions. This is NOT a callback for the driver to
|
||||||
|
start operations again, only to peek/poke at the device, extract diagnostic
|
||||||
|
information, if any, and eventually do things like trigger a device local
|
||||||
|
reset or some such, but not restart operations. This callback is made if
|
||||||
|
all drivers on a segment agree that they can try to recover and if no automatic
|
||||||
|
link reset was performed by the HW. If the platform can't just re-enable IOs
|
||||||
|
without a slot reset or a link reset, it will not call this callback, and
|
||||||
|
instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The following is proposed; no platform implements this yet:
|
||||||
|
Proposal: All I/O's should be done _synchronously_ from within
|
||||||
|
this callback, errors triggered by them will be returned via
|
||||||
|
the normal pci_check_whatever() API, no new error_detected()
|
||||||
|
callback will be issued due to an error happening here. However,
|
||||||
|
such an error might cause IOs to be re-blocked for the whole
|
||||||
|
segment, and thus invalidate the recovery that other devices
|
||||||
|
on the same segment might have done, forcing the whole segment
|
||||||
|
into one of the next states, that is, link reset or slot reset.
|
||||||
|
|
||||||
|
The driver should return one of the following result codes:
|
||||||
|
- PCI_ERS_RESULT_RECOVERED
|
||||||
|
Driver returns this if it thinks the device is fully
|
||||||
|
functional and thinks it is ready to start
|
||||||
|
normal driver operations again. There is no
|
||||||
|
guarantee that the driver will actually be
|
||||||
|
allowed to proceed, as another driver on the
|
||||||
|
same segment might have failed and thus triggered a
|
||||||
|
slot reset on platforms that support it.
|
||||||
|
|
||||||
|
- PCI_ERS_RESULT_NEED_RESET
|
||||||
|
Driver returns this if it thinks the device is not
|
||||||
|
recoverable in its current state and it needs a slot
|
||||||
|
reset to proceed.
|
||||||
|
|
||||||
|
- PCI_ERS_RESULT_DISCONNECT
|
||||||
|
Same as above. Total failure, no recovery even after
|
||||||
|
reset driver dead. (To be defined more precisely)
|
||||||
|
|
||||||
|
The next step taken depends on the results returned by the drivers.
|
||||||
|
If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
|
||||||
|
proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
|
||||||
|
|
||||||
|
If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
|
||||||
|
proceeds to STEP 4 (Slot Reset)
|
||||||
|
|
||||||
|
STEP 3: Link Reset
|
||||||
|
------------------
|
||||||
|
The platform resets the link. This is a PCI-Express specific step
|
||||||
|
and is done whenever a fatal error has been detected that can be
|
||||||
|
"solved" by resetting the link.
|
||||||
|
|
||||||
|
STEP 4: Slot Reset
|
||||||
|
------------------
|
||||||
|
|
||||||
|
In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
|
||||||
|
the platform will perform a slot reset on the requesting PCI device(s).
|
||||||
|
The actual steps taken by a platform to perform a slot reset
|
||||||
|
will be platform-dependent. Upon completion of slot reset, the
|
||||||
|
platform will call the device slot_reset() callback.
|
||||||
|
|
||||||
|
Powerpc platforms implement two levels of slot reset:
|
||||||
|
soft reset(default) and fundamental(optional) reset.
|
||||||
|
|
||||||
|
Powerpc soft reset consists of asserting the adapter #RST line and then
|
||||||
|
restoring the PCI BAR's and PCI configuration header to a state
|
||||||
|
that is equivalent to what it would be after a fresh system
|
||||||
|
power-on followed by power-on BIOS/system firmware initialization.
|
||||||
|
Soft reset is also known as hot-reset.
|
||||||
|
|
||||||
|
Powerpc fundamental reset is supported by PCI Express cards only
|
||||||
|
and results in device's state machines, hardware logic, port states and
|
||||||
|
configuration registers to initialize to their default conditions.
|
||||||
|
|
||||||
|
For most PCI devices, a soft reset will be sufficient for recovery.
|
||||||
|
Optional fundamental reset is provided to support a limited number
|
||||||
|
of PCI Express devices for which a soft reset is not sufficient
|
||||||
|
for recovery.
|
||||||
|
|
||||||
|
If the platform supports PCI hotplug, then the reset might be
|
||||||
|
performed by toggling the slot electrical power off/on.
|
||||||
|
|
||||||
|
It is important for the platform to restore the PCI config space
|
||||||
|
to the "fresh poweron" state, rather than the "last state". After
|
||||||
|
a slot reset, the device driver will almost always use its standard
|
||||||
|
device initialization routines, and an unusual config space setup
|
||||||
|
may result in hung devices, kernel panics, or silent data corruption.
|
||||||
|
|
||||||
|
This call gives drivers the chance to re-initialize the hardware
|
||||||
|
(re-download firmware, etc.). At this point, the driver may assume
|
||||||
|
that the card is in a fresh state and is fully functional. The slot
|
||||||
|
is unfrozen and the driver has full access to PCI config space,
|
||||||
|
memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X)
|
||||||
|
will also be available.
|
||||||
|
|
||||||
|
Drivers should not restart normal I/O processing operations
|
||||||
|
at this point. If all device drivers report success on this
|
||||||
|
callback, the platform will call resume() to complete the sequence,
|
||||||
|
and let the driver restart normal I/O processing.
|
||||||
|
|
||||||
|
A driver can still return a critical failure for this function if
|
||||||
|
it can't get the device operational after reset. If the platform
|
||||||
|
previously tried a soft reset, it might now try a hard reset (power
|
||||||
|
cycle) and then call slot_reset() again. It the device still can't
|
||||||
|
be recovered, there is nothing more that can be done; the platform
|
||||||
|
will typically report a "permanent failure" in such a case. The
|
||||||
|
device will be considered "dead" in this case.
|
||||||
|
|
||||||
|
Drivers for multi-function cards will need to coordinate among
|
||||||
|
themselves as to which driver instance will perform any "one-shot"
|
||||||
|
or global device initialization. For example, the Symbios sym53cxx2
|
||||||
|
driver performs device init only from PCI function 0::
|
||||||
|
|
||||||
|
+ if (PCI_FUNC(pdev->devfn) == 0)
|
||||||
|
+ sym_reset_scsi_bus(np, 0);
|
||||||
|
|
||||||
|
Result codes:
|
||||||
|
- PCI_ERS_RESULT_DISCONNECT
|
||||||
|
Same as above.
|
||||||
|
|
||||||
|
Drivers for PCI Express cards that require a fundamental reset must
|
||||||
|
set the needs_freset bit in the pci_dev structure in their probe function.
|
||||||
|
For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
|
||||||
|
PCI card types::
|
||||||
|
|
||||||
|
+ /* Set EEH reset type to fundamental if required by hba */
|
||||||
|
+ if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha))
|
||||||
|
+ pdev->needs_freset = 1;
|
||||||
|
+
|
||||||
|
|
||||||
|
Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
|
||||||
|
Failure).
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The current powerpc implementation does not try a power-cycle
|
||||||
|
reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
|
||||||
|
However, it probably should.
|
||||||
|
|
||||||
|
|
||||||
|
STEP 5: Resume Operations
|
||||||
|
-------------------------
|
||||||
|
The platform will call the resume() callback on all affected device
|
||||||
|
drivers if all drivers on the segment have returned
|
||||||
|
PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks.
|
||||||
|
The goal of this callback is to tell the driver to restart activity,
|
||||||
|
that everything is back and running. This callback does not return
|
||||||
|
a result code.
|
||||||
|
|
||||||
|
At this point, if a new error happens, the platform will restart
|
||||||
|
a new error recovery sequence.
|
||||||
|
|
||||||
|
STEP 6: Permanent Failure
|
||||||
|
-------------------------
|
||||||
|
A "permanent failure" has occurred, and the platform cannot recover
|
||||||
|
the device. The platform will call error_detected() with a
|
||||||
|
pci_channel_state value of pci_channel_io_perm_failure.
|
||||||
|
|
||||||
|
The device driver should, at this point, assume the worst. It should
|
||||||
|
cancel all pending I/O, refuse all new I/O, returning -EIO to
|
||||||
|
higher layers. The device driver should then clean up all of its
|
||||||
|
memory and remove itself from kernel operations, much as it would
|
||||||
|
during system shutdown.
|
||||||
|
|
||||||
|
The platform will typically notify the system operator of the
|
||||||
|
permanent failure in some way. If the device is hotplug-capable,
|
||||||
|
the operator will probably want to remove and replace the device.
|
||||||
|
Note, however, not all failures are truly "permanent". Some are
|
||||||
|
caused by over-heating, some by a poorly seated card. Many
|
||||||
|
PCI error events are caused by software bugs, e.g. DMA's to
|
||||||
|
wild addresses or bogus split transactions due to programming
|
||||||
|
errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
|
||||||
|
for additional detail on real-life experience of the causes of
|
||||||
|
software errors.
|
||||||
|
|
||||||
|
|
||||||
|
Conclusion; General Remarks
|
||||||
|
---------------------------
|
||||||
|
The way the callbacks are called is platform policy. A platform with
|
||||||
|
no slot reset capability may want to just "ignore" drivers that can't
|
||||||
|
recover (disconnect them) and try to let other cards on the same segment
|
||||||
|
recover. Keep in mind that in most real life cases, though, there will
|
||||||
|
be only one driver per segment.
|
||||||
|
|
||||||
|
Now, a note about interrupts. If you get an interrupt and your
|
||||||
|
device is dead or has been isolated, there is a problem :)
|
||||||
|
The current policy is to turn this into a platform policy.
|
||||||
|
That is, the recovery API only requires that:
|
||||||
|
|
||||||
|
- There is no guarantee that interrupt delivery can proceed from any
|
||||||
|
device on the segment starting from the error detection and until the
|
||||||
|
slot_reset callback is called, at which point interrupts are expected
|
||||||
|
to be fully operational.
|
||||||
|
|
||||||
|
- There is no guarantee that interrupt delivery is stopped, that is,
|
||||||
|
a driver that gets an interrupt after detecting an error, or that detects
|
||||||
|
an error within the interrupt handler such that it prevents proper
|
||||||
|
ack'ing of the interrupt (and thus removal of the source) should just
|
||||||
|
return IRQ_NOTHANDLED. It's up to the platform to deal with that
|
||||||
|
condition, typically by masking the IRQ source during the duration of
|
||||||
|
the error handling. It is expected that the platform "knows" which
|
||||||
|
interrupts are routed to error-management capable slots and can deal
|
||||||
|
with temporarily disabling that IRQ number during error processing (this
|
||||||
|
isn't terribly complex). That means some IRQ latency for other devices
|
||||||
|
sharing the interrupt, but there is simply no other way. High end
|
||||||
|
platforms aren't supposed to share interrupts between many devices
|
||||||
|
anyway :)
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Implementation details for the powerpc platform are discussed in
|
||||||
|
the file Documentation/powerpc/eeh-pci-error-recovery.txt
|
||||||
|
|
||||||
|
As of this writing, there is a growing list of device drivers with
|
||||||
|
patches implementing error recovery. Not all of these patches are in
|
||||||
|
mainline yet. These may be used as "examples":
|
||||||
|
|
||||||
|
- drivers/scsi/ipr
|
||||||
|
- drivers/scsi/sym53c8xx_2
|
||||||
|
- drivers/scsi/qla2xxx
|
||||||
|
- drivers/scsi/lpfc
|
||||||
|
- drivers/next/bnx2.c
|
||||||
|
- drivers/next/e100.c
|
||||||
|
- drivers/net/e1000
|
||||||
|
- drivers/net/e1000e
|
||||||
|
- drivers/net/ixgb
|
||||||
|
- drivers/net/ixgbe
|
||||||
|
- drivers/net/cxgb3
|
||||||
|
- drivers/net/s2io.c
|
||||||
|
- drivers/net/qlge
|
@ -1,413 +0,0 @@
|
|||||||
|
|
||||||
PCI Error Recovery
|
|
||||||
------------------
|
|
||||||
February 2, 2006
|
|
||||||
|
|
||||||
Current document maintainer:
|
|
||||||
Linas Vepstas <linasvepstas@gmail.com>
|
|
||||||
updated by Richard Lary <rlary@us.ibm.com>
|
|
||||||
and Mike Mason <mmlnx@us.ibm.com> on 27-Jul-2009
|
|
||||||
|
|
||||||
|
|
||||||
Many PCI bus controllers are able to detect a variety of hardware
|
|
||||||
PCI errors on the bus, such as parity errors on the data and address
|
|
||||||
buses, as well as SERR and PERR errors. Some of the more advanced
|
|
||||||
chipsets are able to deal with these errors; these include PCI-E chipsets,
|
|
||||||
and the PCI-host bridges found on IBM Power4, Power5 and Power6-based
|
|
||||||
pSeries boxes. A typical action taken is to disconnect the affected device,
|
|
||||||
halting all I/O to it. The goal of a disconnection is to avoid system
|
|
||||||
corruption; for example, to halt system memory corruption due to DMA's
|
|
||||||
to "wild" addresses. Typically, a reconnection mechanism is also
|
|
||||||
offered, so that the affected PCI device(s) are reset and put back
|
|
||||||
into working condition. The reset phase requires coordination
|
|
||||||
between the affected device drivers and the PCI controller chip.
|
|
||||||
This document describes a generic API for notifying device drivers
|
|
||||||
of a bus disconnection, and then performing error recovery.
|
|
||||||
This API is currently implemented in the 2.6.16 and later kernels.
|
|
||||||
|
|
||||||
Reporting and recovery is performed in several steps. First, when
|
|
||||||
a PCI hardware error has resulted in a bus disconnect, that event
|
|
||||||
is reported as soon as possible to all affected device drivers,
|
|
||||||
including multiple instances of a device driver on multi-function
|
|
||||||
cards. This allows device drivers to avoid deadlocking in spinloops,
|
|
||||||
waiting for some i/o-space register to change, when it never will.
|
|
||||||
It also gives the drivers a chance to defer incoming I/O as
|
|
||||||
needed.
|
|
||||||
|
|
||||||
Next, recovery is performed in several stages. Most of the complexity
|
|
||||||
is forced by the need to handle multi-function devices, that is,
|
|
||||||
devices that have multiple device drivers associated with them.
|
|
||||||
In the first stage, each driver is allowed to indicate what type
|
|
||||||
of reset it desires, the choices being a simple re-enabling of I/O
|
|
||||||
or requesting a slot reset.
|
|
||||||
|
|
||||||
If any driver requests a slot reset, that is what will be done.
|
|
||||||
|
|
||||||
After a reset and/or a re-enabling of I/O, all drivers are
|
|
||||||
again notified, so that they may then perform any device setup/config
|
|
||||||
that may be required. After these have all completed, a final
|
|
||||||
"resume normal operations" event is sent out.
|
|
||||||
|
|
||||||
The biggest reason for choosing a kernel-based implementation rather
|
|
||||||
than a user-space implementation was the need to deal with bus
|
|
||||||
disconnects of PCI devices attached to storage media, and, in particular,
|
|
||||||
disconnects from devices holding the root file system. If the root
|
|
||||||
file system is disconnected, a user-space mechanism would have to go
|
|
||||||
through a large number of contortions to complete recovery. Almost all
|
|
||||||
of the current Linux file systems are not tolerant of disconnection
|
|
||||||
from/reconnection to their underlying block device. By contrast,
|
|
||||||
bus errors are easy to manage in the device driver. Indeed, most
|
|
||||||
device drivers already handle very similar recovery procedures;
|
|
||||||
for example, the SCSI-generic layer already provides significant
|
|
||||||
mechanisms for dealing with SCSI bus errors and SCSI bus resets.
|
|
||||||
|
|
||||||
|
|
||||||
Detailed Design
|
|
||||||
---------------
|
|
||||||
Design and implementation details below, based on a chain of
|
|
||||||
public email discussions with Ben Herrenschmidt, circa 5 April 2005.
|
|
||||||
|
|
||||||
The error recovery API support is exposed to the driver in the form of
|
|
||||||
a structure of function pointers pointed to by a new field in struct
|
|
||||||
pci_driver. A driver that fails to provide the structure is "non-aware",
|
|
||||||
and the actual recovery steps taken are platform dependent. The
|
|
||||||
arch/powerpc implementation will simulate a PCI hotplug remove/add.
|
|
||||||
|
|
||||||
This structure has the form:
|
|
||||||
struct pci_error_handlers
|
|
||||||
{
|
|
||||||
int (*error_detected)(struct pci_dev *dev, enum pci_channel_state);
|
|
||||||
int (*mmio_enabled)(struct pci_dev *dev);
|
|
||||||
int (*slot_reset)(struct pci_dev *dev);
|
|
||||||
void (*resume)(struct pci_dev *dev);
|
|
||||||
};
|
|
||||||
|
|
||||||
The possible channel states are:
|
|
||||||
enum pci_channel_state {
|
|
||||||
pci_channel_io_normal, /* I/O channel is in normal state */
|
|
||||||
pci_channel_io_frozen, /* I/O to channel is blocked */
|
|
||||||
pci_channel_io_perm_failure, /* PCI card is dead */
|
|
||||||
};
|
|
||||||
|
|
||||||
Possible return values are:
|
|
||||||
enum pci_ers_result {
|
|
||||||
PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */
|
|
||||||
PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */
|
|
||||||
PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */
|
|
||||||
PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */
|
|
||||||
PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */
|
|
||||||
};
|
|
||||||
|
|
||||||
A driver does not have to implement all of these callbacks; however,
|
|
||||||
if it implements any, it must implement error_detected(). If a callback
|
|
||||||
is not implemented, the corresponding feature is considered unsupported.
|
|
||||||
For example, if mmio_enabled() and resume() aren't there, then it
|
|
||||||
is assumed that the driver is not doing any direct recovery and requires
|
|
||||||
a slot reset. Typically a driver will want to know about
|
|
||||||
a slot_reset().
|
|
||||||
|
|
||||||
The actual steps taken by a platform to recover from a PCI error
|
|
||||||
event will be platform-dependent, but will follow the general
|
|
||||||
sequence described below.
|
|
||||||
|
|
||||||
STEP 0: Error Event
|
|
||||||
-------------------
|
|
||||||
A PCI bus error is detected by the PCI hardware. On powerpc, the slot
|
|
||||||
is isolated, in that all I/O is blocked: all reads return 0xffffffff,
|
|
||||||
all writes are ignored.
|
|
||||||
|
|
||||||
|
|
||||||
STEP 1: Notification
|
|
||||||
--------------------
|
|
||||||
Platform calls the error_detected() callback on every instance of
|
|
||||||
every driver affected by the error.
|
|
||||||
|
|
||||||
At this point, the device might not be accessible anymore, depending on
|
|
||||||
the platform (the slot will be isolated on powerpc). The driver may
|
|
||||||
already have "noticed" the error because of a failing I/O, but this
|
|
||||||
is the proper "synchronization point", that is, it gives the driver
|
|
||||||
a chance to cleanup, waiting for pending stuff (timers, whatever, etc...)
|
|
||||||
to complete; it can take semaphores, schedule, etc... everything but
|
|
||||||
touch the device. Within this function and after it returns, the driver
|
|
||||||
shouldn't do any new IOs. Called in task context. This is sort of a
|
|
||||||
"quiesce" point. See note about interrupts at the end of this doc.
|
|
||||||
|
|
||||||
All drivers participating in this system must implement this call.
|
|
||||||
The driver must return one of the following result codes:
|
|
||||||
- PCI_ERS_RESULT_CAN_RECOVER:
|
|
||||||
Driver returns this if it thinks it might be able to recover
|
|
||||||
the HW by just banging IOs or if it wants to be given
|
|
||||||
a chance to extract some diagnostic information (see
|
|
||||||
mmio_enable, below).
|
|
||||||
- PCI_ERS_RESULT_NEED_RESET:
|
|
||||||
Driver returns this if it can't recover without a
|
|
||||||
slot reset.
|
|
||||||
- PCI_ERS_RESULT_DISCONNECT:
|
|
||||||
Driver returns this if it doesn't want to recover at all.
|
|
||||||
|
|
||||||
The next step taken will depend on the result codes returned by the
|
|
||||||
drivers.
|
|
||||||
|
|
||||||
If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER,
|
|
||||||
then the platform should re-enable IOs on the slot (or do nothing in
|
|
||||||
particular, if the platform doesn't isolate slots), and recovery
|
|
||||||
proceeds to STEP 2 (MMIO Enable).
|
|
||||||
|
|
||||||
If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
|
|
||||||
then recovery proceeds to STEP 4 (Slot Reset).
|
|
||||||
|
|
||||||
If the platform is unable to recover the slot, the next step
|
|
||||||
is STEP 6 (Permanent Failure).
|
|
||||||
|
|
||||||
>>> The current powerpc implementation assumes that a device driver will
|
|
||||||
>>> *not* schedule or semaphore in this routine; the current powerpc
|
|
||||||
>>> implementation uses one kernel thread to notify all devices;
|
|
||||||
>>> thus, if one device sleeps/schedules, all devices are affected.
|
|
||||||
>>> Doing better requires complex multi-threaded logic in the error
|
|
||||||
>>> recovery implementation (e.g. waiting for all notification threads
|
|
||||||
>>> to "join" before proceeding with recovery.) This seems excessively
|
|
||||||
>>> complex and not worth implementing.
|
|
||||||
|
|
||||||
>>> The current powerpc implementation doesn't much care if the device
|
|
||||||
>>> attempts I/O at this point, or not. I/O's will fail, returning
|
|
||||||
>>> a value of 0xff on read, and writes will be dropped. If more than
|
|
||||||
>>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
|
|
||||||
>>> assumes that the device driver has gone into an infinite loop
|
|
||||||
>>> and prints an error to syslog. A reboot is then required to
|
|
||||||
>>> get the device working again.
|
|
||||||
|
|
||||||
STEP 2: MMIO Enabled
|
|
||||||
-------------------
|
|
||||||
The platform re-enables MMIO to the device (but typically not the
|
|
||||||
DMA), and then calls the mmio_enabled() callback on all affected
|
|
||||||
device drivers.
|
|
||||||
|
|
||||||
This is the "early recovery" call. IOs are allowed again, but DMA is
|
|
||||||
not, with some restrictions. This is NOT a callback for the driver to
|
|
||||||
start operations again, only to peek/poke at the device, extract diagnostic
|
|
||||||
information, if any, and eventually do things like trigger a device local
|
|
||||||
reset or some such, but not restart operations. This callback is made if
|
|
||||||
all drivers on a segment agree that they can try to recover and if no automatic
|
|
||||||
link reset was performed by the HW. If the platform can't just re-enable IOs
|
|
||||||
without a slot reset or a link reset, it will not call this callback, and
|
|
||||||
instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
|
|
||||||
|
|
||||||
>>> The following is proposed; no platform implements this yet:
|
|
||||||
>>> Proposal: All I/O's should be done _synchronously_ from within
|
|
||||||
>>> this callback, errors triggered by them will be returned via
|
|
||||||
>>> the normal pci_check_whatever() API, no new error_detected()
|
|
||||||
>>> callback will be issued due to an error happening here. However,
|
|
||||||
>>> such an error might cause IOs to be re-blocked for the whole
|
|
||||||
>>> segment, and thus invalidate the recovery that other devices
|
|
||||||
>>> on the same segment might have done, forcing the whole segment
|
|
||||||
>>> into one of the next states, that is, link reset or slot reset.
|
|
||||||
|
|
||||||
The driver should return one of the following result codes:
|
|
||||||
- PCI_ERS_RESULT_RECOVERED
|
|
||||||
Driver returns this if it thinks the device is fully
|
|
||||||
functional and thinks it is ready to start
|
|
||||||
normal driver operations again. There is no
|
|
||||||
guarantee that the driver will actually be
|
|
||||||
allowed to proceed, as another driver on the
|
|
||||||
same segment might have failed and thus triggered a
|
|
||||||
slot reset on platforms that support it.
|
|
||||||
|
|
||||||
- PCI_ERS_RESULT_NEED_RESET
|
|
||||||
Driver returns this if it thinks the device is not
|
|
||||||
recoverable in its current state and it needs a slot
|
|
||||||
reset to proceed.
|
|
||||||
|
|
||||||
- PCI_ERS_RESULT_DISCONNECT
|
|
||||||
Same as above. Total failure, no recovery even after
|
|
||||||
reset driver dead. (To be defined more precisely)
|
|
||||||
|
|
||||||
The next step taken depends on the results returned by the drivers.
|
|
||||||
If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
|
|
||||||
proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
|
|
||||||
|
|
||||||
If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
|
|
||||||
proceeds to STEP 4 (Slot Reset)
|
|
||||||
|
|
||||||
STEP 3: Link Reset
|
|
||||||
------------------
|
|
||||||
The platform resets the link. This is a PCI-Express specific step
|
|
||||||
and is done whenever a fatal error has been detected that can be
|
|
||||||
"solved" by resetting the link.
|
|
||||||
|
|
||||||
STEP 4: Slot Reset
|
|
||||||
------------------
|
|
||||||
|
|
||||||
In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
|
|
||||||
the platform will perform a slot reset on the requesting PCI device(s).
|
|
||||||
The actual steps taken by a platform to perform a slot reset
|
|
||||||
will be platform-dependent. Upon completion of slot reset, the
|
|
||||||
platform will call the device slot_reset() callback.
|
|
||||||
|
|
||||||
Powerpc platforms implement two levels of slot reset:
|
|
||||||
soft reset(default) and fundamental(optional) reset.
|
|
||||||
|
|
||||||
Powerpc soft reset consists of asserting the adapter #RST line and then
|
|
||||||
restoring the PCI BAR's and PCI configuration header to a state
|
|
||||||
that is equivalent to what it would be after a fresh system
|
|
||||||
power-on followed by power-on BIOS/system firmware initialization.
|
|
||||||
Soft reset is also known as hot-reset.
|
|
||||||
|
|
||||||
Powerpc fundamental reset is supported by PCI Express cards only
|
|
||||||
and results in device's state machines, hardware logic, port states and
|
|
||||||
configuration registers to initialize to their default conditions.
|
|
||||||
|
|
||||||
For most PCI devices, a soft reset will be sufficient for recovery.
|
|
||||||
Optional fundamental reset is provided to support a limited number
|
|
||||||
of PCI Express devices for which a soft reset is not sufficient
|
|
||||||
for recovery.
|
|
||||||
|
|
||||||
If the platform supports PCI hotplug, then the reset might be
|
|
||||||
performed by toggling the slot electrical power off/on.
|
|
||||||
|
|
||||||
It is important for the platform to restore the PCI config space
|
|
||||||
to the "fresh poweron" state, rather than the "last state". After
|
|
||||||
a slot reset, the device driver will almost always use its standard
|
|
||||||
device initialization routines, and an unusual config space setup
|
|
||||||
may result in hung devices, kernel panics, or silent data corruption.
|
|
||||||
|
|
||||||
This call gives drivers the chance to re-initialize the hardware
|
|
||||||
(re-download firmware, etc.). At this point, the driver may assume
|
|
||||||
that the card is in a fresh state and is fully functional. The slot
|
|
||||||
is unfrozen and the driver has full access to PCI config space,
|
|
||||||
memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X)
|
|
||||||
will also be available.
|
|
||||||
|
|
||||||
Drivers should not restart normal I/O processing operations
|
|
||||||
at this point. If all device drivers report success on this
|
|
||||||
callback, the platform will call resume() to complete the sequence,
|
|
||||||
and let the driver restart normal I/O processing.
|
|
||||||
|
|
||||||
A driver can still return a critical failure for this function if
|
|
||||||
it can't get the device operational after reset. If the platform
|
|
||||||
previously tried a soft reset, it might now try a hard reset (power
|
|
||||||
cycle) and then call slot_reset() again. It the device still can't
|
|
||||||
be recovered, there is nothing more that can be done; the platform
|
|
||||||
will typically report a "permanent failure" in such a case. The
|
|
||||||
device will be considered "dead" in this case.
|
|
||||||
|
|
||||||
Drivers for multi-function cards will need to coordinate among
|
|
||||||
themselves as to which driver instance will perform any "one-shot"
|
|
||||||
or global device initialization. For example, the Symbios sym53cxx2
|
|
||||||
driver performs device init only from PCI function 0:
|
|
||||||
|
|
||||||
+ if (PCI_FUNC(pdev->devfn) == 0)
|
|
||||||
+ sym_reset_scsi_bus(np, 0);
|
|
||||||
|
|
||||||
Result codes:
|
|
||||||
- PCI_ERS_RESULT_DISCONNECT
|
|
||||||
Same as above.
|
|
||||||
|
|
||||||
Drivers for PCI Express cards that require a fundamental reset must
|
|
||||||
set the needs_freset bit in the pci_dev structure in their probe function.
|
|
||||||
For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
|
|
||||||
PCI card types:
|
|
||||||
|
|
||||||
+ /* Set EEH reset type to fundamental if required by hba */
|
|
||||||
+ if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha))
|
|
||||||
+ pdev->needs_freset = 1;
|
|
||||||
+
|
|
||||||
|
|
||||||
Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
|
|
||||||
Failure).
|
|
||||||
|
|
||||||
>>> The current powerpc implementation does not try a power-cycle
|
|
||||||
>>> reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
|
|
||||||
>>> However, it probably should.
|
|
||||||
|
|
||||||
|
|
||||||
STEP 5: Resume Operations
|
|
||||||
-------------------------
|
|
||||||
The platform will call the resume() callback on all affected device
|
|
||||||
drivers if all drivers on the segment have returned
|
|
||||||
PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks.
|
|
||||||
The goal of this callback is to tell the driver to restart activity,
|
|
||||||
that everything is back and running. This callback does not return
|
|
||||||
a result code.
|
|
||||||
|
|
||||||
At this point, if a new error happens, the platform will restart
|
|
||||||
a new error recovery sequence.
|
|
||||||
|
|
||||||
STEP 6: Permanent Failure
|
|
||||||
-------------------------
|
|
||||||
A "permanent failure" has occurred, and the platform cannot recover
|
|
||||||
the device. The platform will call error_detected() with a
|
|
||||||
pci_channel_state value of pci_channel_io_perm_failure.
|
|
||||||
|
|
||||||
The device driver should, at this point, assume the worst. It should
|
|
||||||
cancel all pending I/O, refuse all new I/O, returning -EIO to
|
|
||||||
higher layers. The device driver should then clean up all of its
|
|
||||||
memory and remove itself from kernel operations, much as it would
|
|
||||||
during system shutdown.
|
|
||||||
|
|
||||||
The platform will typically notify the system operator of the
|
|
||||||
permanent failure in some way. If the device is hotplug-capable,
|
|
||||||
the operator will probably want to remove and replace the device.
|
|
||||||
Note, however, not all failures are truly "permanent". Some are
|
|
||||||
caused by over-heating, some by a poorly seated card. Many
|
|
||||||
PCI error events are caused by software bugs, e.g. DMA's to
|
|
||||||
wild addresses or bogus split transactions due to programming
|
|
||||||
errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
|
|
||||||
for additional detail on real-life experience of the causes of
|
|
||||||
software errors.
|
|
||||||
|
|
||||||
|
|
||||||
Conclusion; General Remarks
|
|
||||||
---------------------------
|
|
||||||
The way the callbacks are called is platform policy. A platform with
|
|
||||||
no slot reset capability may want to just "ignore" drivers that can't
|
|
||||||
recover (disconnect them) and try to let other cards on the same segment
|
|
||||||
recover. Keep in mind that in most real life cases, though, there will
|
|
||||||
be only one driver per segment.
|
|
||||||
|
|
||||||
Now, a note about interrupts. If you get an interrupt and your
|
|
||||||
device is dead or has been isolated, there is a problem :)
|
|
||||||
The current policy is to turn this into a platform policy.
|
|
||||||
That is, the recovery API only requires that:
|
|
||||||
|
|
||||||
- There is no guarantee that interrupt delivery can proceed from any
|
|
||||||
device on the segment starting from the error detection and until the
|
|
||||||
slot_reset callback is called, at which point interrupts are expected
|
|
||||||
to be fully operational.
|
|
||||||
|
|
||||||
- There is no guarantee that interrupt delivery is stopped, that is,
|
|
||||||
a driver that gets an interrupt after detecting an error, or that detects
|
|
||||||
an error within the interrupt handler such that it prevents proper
|
|
||||||
ack'ing of the interrupt (and thus removal of the source) should just
|
|
||||||
return IRQ_NOTHANDLED. It's up to the platform to deal with that
|
|
||||||
condition, typically by masking the IRQ source during the duration of
|
|
||||||
the error handling. It is expected that the platform "knows" which
|
|
||||||
interrupts are routed to error-management capable slots and can deal
|
|
||||||
with temporarily disabling that IRQ number during error processing (this
|
|
||||||
isn't terribly complex). That means some IRQ latency for other devices
|
|
||||||
sharing the interrupt, but there is simply no other way. High end
|
|
||||||
platforms aren't supposed to share interrupts between many devices
|
|
||||||
anyway :)
|
|
||||||
|
|
||||||
>>> Implementation details for the powerpc platform are discussed in
|
|
||||||
>>> the file Documentation/powerpc/eeh-pci-error-recovery.txt
|
|
||||||
|
|
||||||
>>> As of this writing, there is a growing list of device drivers with
|
|
||||||
>>> patches implementing error recovery. Not all of these patches are in
|
|
||||||
>>> mainline yet. These may be used as "examples":
|
|
||||||
>>>
|
|
||||||
>>> drivers/scsi/ipr
|
|
||||||
>>> drivers/scsi/sym53c8xx_2
|
|
||||||
>>> drivers/scsi/qla2xxx
|
|
||||||
>>> drivers/scsi/lpfc
|
|
||||||
>>> drivers/next/bnx2.c
|
|
||||||
>>> drivers/next/e100.c
|
|
||||||
>>> drivers/net/e1000
|
|
||||||
>>> drivers/net/e1000e
|
|
||||||
>>> drivers/net/ixgb
|
|
||||||
>>> drivers/net/ixgbe
|
|
||||||
>>> drivers/net/cxgb3
|
|
||||||
>>> drivers/net/s2io.c
|
|
||||||
>>> drivers/net/qlge
|
|
||||||
|
|
||||||
The End
|
|
||||||
-------
|
|
172
Documentation/PCI/pci-iov-howto.rst
Normal file
172
Documentation/PCI/pci-iov-howto.rst
Normal file
@ -0,0 +1,172 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
.. include:: <isonum.txt>
|
||||||
|
|
||||||
|
====================================
|
||||||
|
PCI Express I/O Virtualization Howto
|
||||||
|
====================================
|
||||||
|
|
||||||
|
:Copyright: |copy| 2009 Intel Corporation
|
||||||
|
:Authors: - Yu Zhao <yu.zhao@intel.com>
|
||||||
|
- Donald Dutile <ddutile@redhat.com>
|
||||||
|
|
||||||
|
Overview
|
||||||
|
========
|
||||||
|
|
||||||
|
What is SR-IOV
|
||||||
|
--------------
|
||||||
|
|
||||||
|
Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
|
||||||
|
capability which makes one physical device appear as multiple virtual
|
||||||
|
devices. The physical device is referred to as Physical Function (PF)
|
||||||
|
while the virtual devices are referred to as Virtual Functions (VF).
|
||||||
|
Allocation of the VF can be dynamically controlled by the PF via
|
||||||
|
registers encapsulated in the capability. By default, this feature is
|
||||||
|
not enabled and the PF behaves as traditional PCIe device. Once it's
|
||||||
|
turned on, each VF's PCI configuration space can be accessed by its own
|
||||||
|
Bus, Device and Function Number (Routing ID). And each VF also has PCI
|
||||||
|
Memory Space, which is used to map its register set. VF device driver
|
||||||
|
operates on the register set so it can be functional and appear as a
|
||||||
|
real existing PCI device.
|
||||||
|
|
||||||
|
User Guide
|
||||||
|
==========
|
||||||
|
|
||||||
|
How can I enable SR-IOV capability
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
|
Multiple methods are available for SR-IOV enablement.
|
||||||
|
In the first method, the device driver (PF driver) will control the
|
||||||
|
enabling and disabling of the capability via API provided by SR-IOV core.
|
||||||
|
If the hardware has SR-IOV capability, loading its PF driver would
|
||||||
|
enable it and all VFs associated with the PF. Some PF drivers require
|
||||||
|
a module parameter to be set to determine the number of VFs to enable.
|
||||||
|
In the second method, a write to the sysfs file sriov_numvfs will
|
||||||
|
enable and disable the VFs associated with a PCIe PF. This method
|
||||||
|
enables per-PF, VF enable/disable values versus the first method,
|
||||||
|
which applies to all PFs of the same device. Additionally, the
|
||||||
|
PCI SRIOV core support ensures that enable/disable operations are
|
||||||
|
valid to reduce duplication in multiple drivers for the same
|
||||||
|
checks, e.g., check numvfs == 0 if enabling VFs, ensure
|
||||||
|
numvfs <= totalvfs.
|
||||||
|
The second method is the recommended method for new/future VF devices.
|
||||||
|
|
||||||
|
How can I use the Virtual Functions
|
||||||
|
-----------------------------------
|
||||||
|
|
||||||
|
The VF is treated as hot-plugged PCI devices in the kernel, so they
|
||||||
|
should be able to work in the same way as real PCI devices. The VF
|
||||||
|
requires device driver that is same as a normal PCI device's.
|
||||||
|
|
||||||
|
Developer Guide
|
||||||
|
===============
|
||||||
|
|
||||||
|
SR-IOV API
|
||||||
|
----------
|
||||||
|
|
||||||
|
To enable SR-IOV capability:
|
||||||
|
|
||||||
|
(a) For the first method, in the driver::
|
||||||
|
|
||||||
|
int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
|
||||||
|
|
||||||
|
'nr_virtfn' is number of VFs to be enabled.
|
||||||
|
|
||||||
|
(b) For the second method, from sysfs::
|
||||||
|
|
||||||
|
echo 'nr_virtfn' > \
|
||||||
|
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
|
||||||
|
|
||||||
|
To disable SR-IOV capability:
|
||||||
|
|
||||||
|
(a) For the first method, in the driver::
|
||||||
|
|
||||||
|
void pci_disable_sriov(struct pci_dev *dev);
|
||||||
|
|
||||||
|
(b) For the second method, from sysfs::
|
||||||
|
|
||||||
|
echo 0 > \
|
||||||
|
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
|
||||||
|
|
||||||
|
To enable auto probing VFs by a compatible driver on the host, run
|
||||||
|
command below before enabling SR-IOV capabilities. This is the
|
||||||
|
default behavior.
|
||||||
|
::
|
||||||
|
|
||||||
|
echo 1 > \
|
||||||
|
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
|
||||||
|
|
||||||
|
To disable auto probing VFs by a compatible driver on the host, run
|
||||||
|
command below before enabling SR-IOV capabilities. Updating this
|
||||||
|
entry will not affect VFs which are already probed.
|
||||||
|
::
|
||||||
|
|
||||||
|
echo 0 > \
|
||||||
|
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
|
||||||
|
|
||||||
|
Usage example
|
||||||
|
-------------
|
||||||
|
|
||||||
|
Following piece of code illustrates the usage of the SR-IOV API.
|
||||||
|
::
|
||||||
|
|
||||||
|
static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
|
||||||
|
{
|
||||||
|
pci_enable_sriov(dev, NR_VIRTFN);
|
||||||
|
|
||||||
|
...
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void dev_remove(struct pci_dev *dev)
|
||||||
|
{
|
||||||
|
pci_disable_sriov(dev);
|
||||||
|
|
||||||
|
...
|
||||||
|
}
|
||||||
|
|
||||||
|
static int dev_suspend(struct pci_dev *dev, pm_message_t state)
|
||||||
|
{
|
||||||
|
...
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
static int dev_resume(struct pci_dev *dev)
|
||||||
|
{
|
||||||
|
...
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void dev_shutdown(struct pci_dev *dev)
|
||||||
|
{
|
||||||
|
...
|
||||||
|
}
|
||||||
|
|
||||||
|
static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
|
||||||
|
{
|
||||||
|
if (numvfs > 0) {
|
||||||
|
...
|
||||||
|
pci_enable_sriov(dev, numvfs);
|
||||||
|
...
|
||||||
|
return numvfs;
|
||||||
|
}
|
||||||
|
if (numvfs == 0) {
|
||||||
|
....
|
||||||
|
pci_disable_sriov(dev);
|
||||||
|
...
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
static struct pci_driver dev_driver = {
|
||||||
|
.name = "SR-IOV Physical Function driver",
|
||||||
|
.id_table = dev_id_table,
|
||||||
|
.probe = dev_probe,
|
||||||
|
.remove = dev_remove,
|
||||||
|
.suspend = dev_suspend,
|
||||||
|
.resume = dev_resume,
|
||||||
|
.shutdown = dev_shutdown,
|
||||||
|
.sriov_configure = dev_sriov_configure,
|
||||||
|
};
|
@ -1,147 +0,0 @@
|
|||||||
PCI Express I/O Virtualization Howto
|
|
||||||
Copyright (C) 2009 Intel Corporation
|
|
||||||
Yu Zhao <yu.zhao@intel.com>
|
|
||||||
|
|
||||||
Update: November 2012
|
|
||||||
-- sysfs-based SRIOV enable-/disable-ment
|
|
||||||
Donald Dutile <ddutile@redhat.com>
|
|
||||||
|
|
||||||
1. Overview
|
|
||||||
|
|
||||||
1.1 What is SR-IOV
|
|
||||||
|
|
||||||
Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
|
|
||||||
capability which makes one physical device appear as multiple virtual
|
|
||||||
devices. The physical device is referred to as Physical Function (PF)
|
|
||||||
while the virtual devices are referred to as Virtual Functions (VF).
|
|
||||||
Allocation of the VF can be dynamically controlled by the PF via
|
|
||||||
registers encapsulated in the capability. By default, this feature is
|
|
||||||
not enabled and the PF behaves as traditional PCIe device. Once it's
|
|
||||||
turned on, each VF's PCI configuration space can be accessed by its own
|
|
||||||
Bus, Device and Function Number (Routing ID). And each VF also has PCI
|
|
||||||
Memory Space, which is used to map its register set. VF device driver
|
|
||||||
operates on the register set so it can be functional and appear as a
|
|
||||||
real existing PCI device.
|
|
||||||
|
|
||||||
2. User Guide
|
|
||||||
|
|
||||||
2.1 How can I enable SR-IOV capability
|
|
||||||
|
|
||||||
Multiple methods are available for SR-IOV enablement.
|
|
||||||
In the first method, the device driver (PF driver) will control the
|
|
||||||
enabling and disabling of the capability via API provided by SR-IOV core.
|
|
||||||
If the hardware has SR-IOV capability, loading its PF driver would
|
|
||||||
enable it and all VFs associated with the PF. Some PF drivers require
|
|
||||||
a module parameter to be set to determine the number of VFs to enable.
|
|
||||||
In the second method, a write to the sysfs file sriov_numvfs will
|
|
||||||
enable and disable the VFs associated with a PCIe PF. This method
|
|
||||||
enables per-PF, VF enable/disable values versus the first method,
|
|
||||||
which applies to all PFs of the same device. Additionally, the
|
|
||||||
PCI SRIOV core support ensures that enable/disable operations are
|
|
||||||
valid to reduce duplication in multiple drivers for the same
|
|
||||||
checks, e.g., check numvfs == 0 if enabling VFs, ensure
|
|
||||||
numvfs <= totalvfs.
|
|
||||||
The second method is the recommended method for new/future VF devices.
|
|
||||||
|
|
||||||
2.2 How can I use the Virtual Functions
|
|
||||||
|
|
||||||
The VF is treated as hot-plugged PCI devices in the kernel, so they
|
|
||||||
should be able to work in the same way as real PCI devices. The VF
|
|
||||||
requires device driver that is same as a normal PCI device's.
|
|
||||||
|
|
||||||
3. Developer Guide
|
|
||||||
|
|
||||||
3.1 SR-IOV API
|
|
||||||
|
|
||||||
To enable SR-IOV capability:
|
|
||||||
(a) For the first method, in the driver:
|
|
||||||
int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
|
|
||||||
'nr_virtfn' is number of VFs to be enabled.
|
|
||||||
(b) For the second method, from sysfs:
|
|
||||||
echo 'nr_virtfn' > \
|
|
||||||
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
|
|
||||||
|
|
||||||
To disable SR-IOV capability:
|
|
||||||
(a) For the first method, in the driver:
|
|
||||||
void pci_disable_sriov(struct pci_dev *dev);
|
|
||||||
(b) For the second method, from sysfs:
|
|
||||||
echo 0 > \
|
|
||||||
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
|
|
||||||
|
|
||||||
To enable auto probing VFs by a compatible driver on the host, run
|
|
||||||
command below before enabling SR-IOV capabilities. This is the
|
|
||||||
default behavior.
|
|
||||||
echo 1 > \
|
|
||||||
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
|
|
||||||
|
|
||||||
To disable auto probing VFs by a compatible driver on the host, run
|
|
||||||
command below before enabling SR-IOV capabilities. Updating this
|
|
||||||
entry will not affect VFs which are already probed.
|
|
||||||
echo 0 > \
|
|
||||||
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
|
|
||||||
|
|
||||||
3.2 Usage example
|
|
||||||
|
|
||||||
Following piece of code illustrates the usage of the SR-IOV API.
|
|
||||||
|
|
||||||
static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
|
|
||||||
{
|
|
||||||
pci_enable_sriov(dev, NR_VIRTFN);
|
|
||||||
|
|
||||||
...
|
|
||||||
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
static void dev_remove(struct pci_dev *dev)
|
|
||||||
{
|
|
||||||
pci_disable_sriov(dev);
|
|
||||||
|
|
||||||
...
|
|
||||||
}
|
|
||||||
|
|
||||||
static int dev_suspend(struct pci_dev *dev, pm_message_t state)
|
|
||||||
{
|
|
||||||
...
|
|
||||||
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
static int dev_resume(struct pci_dev *dev)
|
|
||||||
{
|
|
||||||
...
|
|
||||||
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
static void dev_shutdown(struct pci_dev *dev)
|
|
||||||
{
|
|
||||||
...
|
|
||||||
}
|
|
||||||
|
|
||||||
static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
|
|
||||||
{
|
|
||||||
if (numvfs > 0) {
|
|
||||||
...
|
|
||||||
pci_enable_sriov(dev, numvfs);
|
|
||||||
...
|
|
||||||
return numvfs;
|
|
||||||
}
|
|
||||||
if (numvfs == 0) {
|
|
||||||
....
|
|
||||||
pci_disable_sriov(dev);
|
|
||||||
...
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
static struct pci_driver dev_driver = {
|
|
||||||
.name = "SR-IOV Physical Function driver",
|
|
||||||
.id_table = dev_id_table,
|
|
||||||
.probe = dev_probe,
|
|
||||||
.remove = dev_remove,
|
|
||||||
.suspend = dev_suspend,
|
|
||||||
.resume = dev_resume,
|
|
||||||
.shutdown = dev_shutdown,
|
|
||||||
.sriov_configure = dev_sriov_configure,
|
|
||||||
};
|
|
578
Documentation/PCI/pci.rst
Normal file
578
Documentation/PCI/pci.rst
Normal file
@ -0,0 +1,578 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
==============================
|
||||||
|
How To Write Linux PCI Drivers
|
||||||
|
==============================
|
||||||
|
|
||||||
|
:Authors: - Martin Mares <mj@ucw.cz>
|
||||||
|
- Grant Grundler <grundler@parisc-linux.org>
|
||||||
|
|
||||||
|
The world of PCI is vast and full of (mostly unpleasant) surprises.
|
||||||
|
Since each CPU architecture implements different chip-sets and PCI devices
|
||||||
|
have different requirements (erm, "features"), the result is the PCI support
|
||||||
|
in the Linux kernel is not as trivial as one would wish. This short paper
|
||||||
|
tries to introduce all potential driver authors to Linux APIs for
|
||||||
|
PCI device drivers.
|
||||||
|
|
||||||
|
A more complete resource is the third edition of "Linux Device Drivers"
|
||||||
|
by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman.
|
||||||
|
LDD3 is available for free (under Creative Commons License) from:
|
||||||
|
http://lwn.net/Kernel/LDD3/.
|
||||||
|
|
||||||
|
However, keep in mind that all documents are subject to "bit rot".
|
||||||
|
Refer to the source code if things are not working as described here.
|
||||||
|
|
||||||
|
Please send questions/comments/patches about Linux PCI API to the
|
||||||
|
"Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list.
|
||||||
|
|
||||||
|
|
||||||
|
Structure of PCI drivers
|
||||||
|
========================
|
||||||
|
PCI drivers "discover" PCI devices in a system via pci_register_driver().
|
||||||
|
Actually, it's the other way around. When the PCI generic code discovers
|
||||||
|
a new device, the driver with a matching "description" will be notified.
|
||||||
|
Details on this below.
|
||||||
|
|
||||||
|
pci_register_driver() leaves most of the probing for devices to
|
||||||
|
the PCI layer and supports online insertion/removal of devices [thus
|
||||||
|
supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver].
|
||||||
|
pci_register_driver() call requires passing in a table of function
|
||||||
|
pointers and thus dictates the high level structure of a driver.
|
||||||
|
|
||||||
|
Once the driver knows about a PCI device and takes ownership, the
|
||||||
|
driver generally needs to perform the following initialization:
|
||||||
|
|
||||||
|
- Enable the device
|
||||||
|
- Request MMIO/IOP resources
|
||||||
|
- Set the DMA mask size (for both coherent and streaming DMA)
|
||||||
|
- Allocate and initialize shared control data (pci_allocate_coherent())
|
||||||
|
- Access device configuration space (if needed)
|
||||||
|
- Register IRQ handler (request_irq())
|
||||||
|
- Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
|
||||||
|
- Enable DMA/processing engines
|
||||||
|
|
||||||
|
When done using the device, and perhaps the module needs to be unloaded,
|
||||||
|
the driver needs to take the follow steps:
|
||||||
|
|
||||||
|
- Disable the device from generating IRQs
|
||||||
|
- Release the IRQ (free_irq())
|
||||||
|
- Stop all DMA activity
|
||||||
|
- Release DMA buffers (both streaming and coherent)
|
||||||
|
- Unregister from other subsystems (e.g. scsi or netdev)
|
||||||
|
- Release MMIO/IOP resources
|
||||||
|
- Disable the device
|
||||||
|
|
||||||
|
Most of these topics are covered in the following sections.
|
||||||
|
For the rest look at LDD3 or <linux/pci.h> .
|
||||||
|
|
||||||
|
If the PCI subsystem is not configured (CONFIG_PCI is not set), most of
|
||||||
|
the PCI functions described below are defined as inline functions either
|
||||||
|
completely empty or just returning an appropriate error codes to avoid
|
||||||
|
lots of ifdefs in the drivers.
|
||||||
|
|
||||||
|
|
||||||
|
pci_register_driver() call
|
||||||
|
==========================
|
||||||
|
|
||||||
|
PCI device drivers call ``pci_register_driver()`` during their
|
||||||
|
initialization with a pointer to a structure describing the driver
|
||||||
|
(``struct pci_driver``):
|
||||||
|
|
||||||
|
.. kernel-doc:: include/linux/pci.h
|
||||||
|
:functions: pci_driver
|
||||||
|
|
||||||
|
The ID table is an array of ``struct pci_device_id`` entries ending with an
|
||||||
|
all-zero entry. Definitions with static const are generally preferred.
|
||||||
|
|
||||||
|
.. kernel-doc:: include/linux/mod_devicetable.h
|
||||||
|
:functions: pci_device_id
|
||||||
|
|
||||||
|
Most drivers only need ``PCI_DEVICE()`` or ``PCI_DEVICE_CLASS()`` to set up
|
||||||
|
a pci_device_id table.
|
||||||
|
|
||||||
|
New PCI IDs may be added to a device driver pci_ids table at runtime
|
||||||
|
as shown below::
|
||||||
|
|
||||||
|
echo "vendor device subvendor subdevice class class_mask driver_data" > \
|
||||||
|
/sys/bus/pci/drivers/{driver}/new_id
|
||||||
|
|
||||||
|
All fields are passed in as hexadecimal values (no leading 0x).
|
||||||
|
The vendor and device fields are mandatory, the others are optional. Users
|
||||||
|
need pass only as many optional fields as necessary:
|
||||||
|
|
||||||
|
- subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF)
|
||||||
|
- class and classmask fields default to 0
|
||||||
|
- driver_data defaults to 0UL.
|
||||||
|
|
||||||
|
Note that driver_data must match the value used by any of the pci_device_id
|
||||||
|
entries defined in the driver. This makes the driver_data field mandatory
|
||||||
|
if all the pci_device_id entries have a non-zero driver_data value.
|
||||||
|
|
||||||
|
Once added, the driver probe routine will be invoked for any unclaimed
|
||||||
|
PCI devices listed in its (newly updated) pci_ids list.
|
||||||
|
|
||||||
|
When the driver exits, it just calls pci_unregister_driver() and the PCI layer
|
||||||
|
automatically calls the remove hook for all devices handled by the driver.
|
||||||
|
|
||||||
|
|
||||||
|
"Attributes" for driver functions/data
|
||||||
|
--------------------------------------
|
||||||
|
|
||||||
|
Please mark the initialization and cleanup functions where appropriate
|
||||||
|
(the corresponding macros are defined in <linux/init.h>):
|
||||||
|
|
||||||
|
====== =================================================
|
||||||
|
__init Initialization code. Thrown away after the driver
|
||||||
|
initializes.
|
||||||
|
__exit Exit code. Ignored for non-modular drivers.
|
||||||
|
====== =================================================
|
||||||
|
|
||||||
|
Tips on when/where to use the above attributes:
|
||||||
|
- The module_init()/module_exit() functions (and all
|
||||||
|
initialization functions called _only_ from these)
|
||||||
|
should be marked __init/__exit.
|
||||||
|
|
||||||
|
- Do not mark the struct pci_driver.
|
||||||
|
|
||||||
|
- Do NOT mark a function if you are not sure which mark to use.
|
||||||
|
Better to not mark the function than mark the function wrong.
|
||||||
|
|
||||||
|
|
||||||
|
How to find PCI devices manually
|
||||||
|
================================
|
||||||
|
|
||||||
|
PCI drivers should have a really good reason for not using the
|
||||||
|
pci_register_driver() interface to search for PCI devices.
|
||||||
|
The main reason PCI devices are controlled by multiple drivers
|
||||||
|
is because one PCI device implements several different HW services.
|
||||||
|
E.g. combined serial/parallel port/floppy controller.
|
||||||
|
|
||||||
|
A manual search may be performed using the following constructs:
|
||||||
|
|
||||||
|
Searching by vendor and device ID::
|
||||||
|
|
||||||
|
struct pci_dev *dev = NULL;
|
||||||
|
while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev))
|
||||||
|
configure_device(dev);
|
||||||
|
|
||||||
|
Searching by class ID (iterate in a similar way)::
|
||||||
|
|
||||||
|
pci_get_class(CLASS_ID, dev)
|
||||||
|
|
||||||
|
Searching by both vendor/device and subsystem vendor/device ID::
|
||||||
|
|
||||||
|
pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev).
|
||||||
|
|
||||||
|
You can use the constant PCI_ANY_ID as a wildcard replacement for
|
||||||
|
VENDOR_ID or DEVICE_ID. This allows searching for any device from a
|
||||||
|
specific vendor, for example.
|
||||||
|
|
||||||
|
These functions are hotplug-safe. They increment the reference count on
|
||||||
|
the pci_dev that they return. You must eventually (possibly at module unload)
|
||||||
|
decrement the reference count on these devices by calling pci_dev_put().
|
||||||
|
|
||||||
|
|
||||||
|
Device Initialization Steps
|
||||||
|
===========================
|
||||||
|
|
||||||
|
As noted in the introduction, most PCI drivers need the following steps
|
||||||
|
for device initialization:
|
||||||
|
|
||||||
|
- Enable the device
|
||||||
|
- Request MMIO/IOP resources
|
||||||
|
- Set the DMA mask size (for both coherent and streaming DMA)
|
||||||
|
- Allocate and initialize shared control data (pci_allocate_coherent())
|
||||||
|
- Access device configuration space (if needed)
|
||||||
|
- Register IRQ handler (request_irq())
|
||||||
|
- Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
|
||||||
|
- Enable DMA/processing engines.
|
||||||
|
|
||||||
|
The driver can access PCI config space registers at any time.
|
||||||
|
(Well, almost. When running BIST, config space can go away...but
|
||||||
|
that will just result in a PCI Bus Master Abort and config reads
|
||||||
|
will return garbage).
|
||||||
|
|
||||||
|
|
||||||
|
Enable the PCI device
|
||||||
|
---------------------
|
||||||
|
Before touching any device registers, the driver needs to enable
|
||||||
|
the PCI device by calling pci_enable_device(). This will:
|
||||||
|
|
||||||
|
- wake up the device if it was in suspended state,
|
||||||
|
- allocate I/O and memory regions of the device (if BIOS did not),
|
||||||
|
- allocate an IRQ (if BIOS did not).
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
pci_enable_device() can fail! Check the return value.
|
||||||
|
|
||||||
|
.. warning::
|
||||||
|
OS BUG: we don't check resource allocations before enabling those
|
||||||
|
resources. The sequence would make more sense if we called
|
||||||
|
pci_request_resources() before calling pci_enable_device().
|
||||||
|
Currently, the device drivers can't detect the bug when when two
|
||||||
|
devices have been allocated the same range. This is not a common
|
||||||
|
problem and unlikely to get fixed soon.
|
||||||
|
|
||||||
|
This has been discussed before but not changed as of 2.6.19:
|
||||||
|
http://lkml.org/lkml/2006/3/2/194
|
||||||
|
|
||||||
|
|
||||||
|
pci_set_master() will enable DMA by setting the bus master bit
|
||||||
|
in the PCI_COMMAND register. It also fixes the latency timer value if
|
||||||
|
it's set to something bogus by the BIOS. pci_clear_master() will
|
||||||
|
disable DMA by clearing the bus master bit.
|
||||||
|
|
||||||
|
If the PCI device can use the PCI Memory-Write-Invalidate transaction,
|
||||||
|
call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval
|
||||||
|
and also ensures that the cache line size register is set correctly.
|
||||||
|
Check the return value of pci_set_mwi() as not all architectures
|
||||||
|
or chip-sets may support Memory-Write-Invalidate. Alternatively,
|
||||||
|
if Mem-Wr-Inval would be nice to have but is not required, call
|
||||||
|
pci_try_set_mwi() to have the system do its best effort at enabling
|
||||||
|
Mem-Wr-Inval.
|
||||||
|
|
||||||
|
|
||||||
|
Request MMIO/IOP resources
|
||||||
|
--------------------------
|
||||||
|
Memory (MMIO), and I/O port addresses should NOT be read directly
|
||||||
|
from the PCI device config space. Use the values in the pci_dev structure
|
||||||
|
as the PCI "bus address" might have been remapped to a "host physical"
|
||||||
|
address by the arch/chip-set specific kernel support.
|
||||||
|
|
||||||
|
See Documentation/io-mapping.txt for how to access device registers
|
||||||
|
or device memory.
|
||||||
|
|
||||||
|
The device driver needs to call pci_request_region() to verify
|
||||||
|
no other device is already using the same address resource.
|
||||||
|
Conversely, drivers should call pci_release_region() AFTER
|
||||||
|
calling pci_disable_device().
|
||||||
|
The idea is to prevent two devices colliding on the same address range.
|
||||||
|
|
||||||
|
.. tip::
|
||||||
|
See OS BUG comment above. Currently (2.6.19), The driver can only
|
||||||
|
determine MMIO and IO Port resource availability _after_ calling
|
||||||
|
pci_enable_device().
|
||||||
|
|
||||||
|
Generic flavors of pci_request_region() are request_mem_region()
|
||||||
|
(for MMIO ranges) and request_region() (for IO Port ranges).
|
||||||
|
Use these for address resources that are not described by "normal" PCI
|
||||||
|
BARs.
|
||||||
|
|
||||||
|
Also see pci_request_selected_regions() below.
|
||||||
|
|
||||||
|
|
||||||
|
Set the DMA mask size
|
||||||
|
---------------------
|
||||||
|
.. note::
|
||||||
|
If anything below doesn't make sense, please refer to
|
||||||
|
Documentation/DMA-API.txt. This section is just a reminder that
|
||||||
|
drivers need to indicate DMA capabilities of the device and is not
|
||||||
|
an authoritative source for DMA interfaces.
|
||||||
|
|
||||||
|
While all drivers should explicitly indicate the DMA capability
|
||||||
|
(e.g. 32 or 64 bit) of the PCI bus master, devices with more than
|
||||||
|
32-bit bus master capability for streaming data need the driver
|
||||||
|
to "register" this capability by calling pci_set_dma_mask() with
|
||||||
|
appropriate parameters. In general this allows more efficient DMA
|
||||||
|
on systems where System RAM exists above 4G _physical_ address.
|
||||||
|
|
||||||
|
Drivers for all PCI-X and PCIe compliant devices must call
|
||||||
|
pci_set_dma_mask() as they are 64-bit DMA devices.
|
||||||
|
|
||||||
|
Similarly, drivers must also "register" this capability if the device
|
||||||
|
can directly address "consistent memory" in System RAM above 4G physical
|
||||||
|
address by calling pci_set_consistent_dma_mask().
|
||||||
|
Again, this includes drivers for all PCI-X and PCIe compliant devices.
|
||||||
|
Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are
|
||||||
|
64-bit DMA capable for payload ("streaming") data but not control
|
||||||
|
("consistent") data.
|
||||||
|
|
||||||
|
|
||||||
|
Setup shared control data
|
||||||
|
-------------------------
|
||||||
|
Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared)
|
||||||
|
memory. See Documentation/DMA-API.txt for a full description of
|
||||||
|
the DMA APIs. This section is just a reminder that it needs to be done
|
||||||
|
before enabling DMA on the device.
|
||||||
|
|
||||||
|
|
||||||
|
Initialize device registers
|
||||||
|
---------------------------
|
||||||
|
Some drivers will need specific "capability" fields programmed
|
||||||
|
or other "vendor specific" register initialized or reset.
|
||||||
|
E.g. clearing pending interrupts.
|
||||||
|
|
||||||
|
|
||||||
|
Register IRQ handler
|
||||||
|
--------------------
|
||||||
|
While calling request_irq() is the last step described here,
|
||||||
|
this is often just another intermediate step to initialize a device.
|
||||||
|
This step can often be deferred until the device is opened for use.
|
||||||
|
|
||||||
|
All interrupt handlers for IRQ lines should be registered with IRQF_SHARED
|
||||||
|
and use the devid to map IRQs to devices (remember that all PCI IRQ lines
|
||||||
|
can be shared).
|
||||||
|
|
||||||
|
request_irq() will associate an interrupt handler and device handle
|
||||||
|
with an interrupt number. Historically interrupt numbers represent
|
||||||
|
IRQ lines which run from the PCI device to the Interrupt controller.
|
||||||
|
With MSI and MSI-X (more below) the interrupt number is a CPU "vector".
|
||||||
|
|
||||||
|
request_irq() also enables the interrupt. Make sure the device is
|
||||||
|
quiesced and does not have any interrupts pending before registering
|
||||||
|
the interrupt handler.
|
||||||
|
|
||||||
|
MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts"
|
||||||
|
which deliver interrupts to the CPU via a DMA write to a Local APIC.
|
||||||
|
The fundamental difference between MSI and MSI-X is how multiple
|
||||||
|
"vectors" get allocated. MSI requires contiguous blocks of vectors
|
||||||
|
while MSI-X can allocate several individual ones.
|
||||||
|
|
||||||
|
MSI capability can be enabled by calling pci_alloc_irq_vectors() with the
|
||||||
|
PCI_IRQ_MSI and/or PCI_IRQ_MSIX flags before calling request_irq(). This
|
||||||
|
causes the PCI support to program CPU vector data into the PCI device
|
||||||
|
capability registers. Many architectures, chip-sets, or BIOSes do NOT
|
||||||
|
support MSI or MSI-X and a call to pci_alloc_irq_vectors with just
|
||||||
|
the PCI_IRQ_MSI and PCI_IRQ_MSIX flags will fail, so try to always
|
||||||
|
specify PCI_IRQ_LEGACY as well.
|
||||||
|
|
||||||
|
Drivers that have different interrupt handlers for MSI/MSI-X and
|
||||||
|
legacy INTx should chose the right one based on the msi_enabled
|
||||||
|
and msix_enabled flags in the pci_dev structure after calling
|
||||||
|
pci_alloc_irq_vectors.
|
||||||
|
|
||||||
|
There are (at least) two really good reasons for using MSI:
|
||||||
|
|
||||||
|
1) MSI is an exclusive interrupt vector by definition.
|
||||||
|
This means the interrupt handler doesn't have to verify
|
||||||
|
its device caused the interrupt.
|
||||||
|
|
||||||
|
2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed
|
||||||
|
to be visible to the host CPU(s) when the MSI is delivered. This
|
||||||
|
is important for both data coherency and avoiding stale control data.
|
||||||
|
This guarantee allows the driver to omit MMIO reads to flush
|
||||||
|
the DMA stream.
|
||||||
|
|
||||||
|
See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples
|
||||||
|
of MSI/MSI-X usage.
|
||||||
|
|
||||||
|
|
||||||
|
PCI device shutdown
|
||||||
|
===================
|
||||||
|
|
||||||
|
When a PCI device driver is being unloaded, most of the following
|
||||||
|
steps need to be performed:
|
||||||
|
|
||||||
|
- Disable the device from generating IRQs
|
||||||
|
- Release the IRQ (free_irq())
|
||||||
|
- Stop all DMA activity
|
||||||
|
- Release DMA buffers (both streaming and consistent)
|
||||||
|
- Unregister from other subsystems (e.g. scsi or netdev)
|
||||||
|
- Disable device from responding to MMIO/IO Port addresses
|
||||||
|
- Release MMIO/IO Port resource(s)
|
||||||
|
|
||||||
|
|
||||||
|
Stop IRQs on the device
|
||||||
|
-----------------------
|
||||||
|
How to do this is chip/device specific. If it's not done, it opens
|
||||||
|
the possibility of a "screaming interrupt" if (and only if)
|
||||||
|
the IRQ is shared with another device.
|
||||||
|
|
||||||
|
When the shared IRQ handler is "unhooked", the remaining devices
|
||||||
|
using the same IRQ line will still need the IRQ enabled. Thus if the
|
||||||
|
"unhooked" device asserts IRQ line, the system will respond assuming
|
||||||
|
it was one of the remaining devices asserted the IRQ line. Since none
|
||||||
|
of the other devices will handle the IRQ, the system will "hang" until
|
||||||
|
it decides the IRQ isn't going to get handled and masks the IRQ (100,000
|
||||||
|
iterations later). Once the shared IRQ is masked, the remaining devices
|
||||||
|
will stop functioning properly. Not a nice situation.
|
||||||
|
|
||||||
|
This is another reason to use MSI or MSI-X if it's available.
|
||||||
|
MSI and MSI-X are defined to be exclusive interrupts and thus
|
||||||
|
are not susceptible to the "screaming interrupt" problem.
|
||||||
|
|
||||||
|
|
||||||
|
Release the IRQ
|
||||||
|
---------------
|
||||||
|
Once the device is quiesced (no more IRQs), one can call free_irq().
|
||||||
|
This function will return control once any pending IRQs are handled,
|
||||||
|
"unhook" the drivers IRQ handler from that IRQ, and finally release
|
||||||
|
the IRQ if no one else is using it.
|
||||||
|
|
||||||
|
|
||||||
|
Stop all DMA activity
|
||||||
|
---------------------
|
||||||
|
It's extremely important to stop all DMA operations BEFORE attempting
|
||||||
|
to deallocate DMA control data. Failure to do so can result in memory
|
||||||
|
corruption, hangs, and on some chip-sets a hard crash.
|
||||||
|
|
||||||
|
Stopping DMA after stopping the IRQs can avoid races where the
|
||||||
|
IRQ handler might restart DMA engines.
|
||||||
|
|
||||||
|
While this step sounds obvious and trivial, several "mature" drivers
|
||||||
|
didn't get this step right in the past.
|
||||||
|
|
||||||
|
|
||||||
|
Release DMA buffers
|
||||||
|
-------------------
|
||||||
|
Once DMA is stopped, clean up streaming DMA first.
|
||||||
|
I.e. unmap data buffers and return buffers to "upstream"
|
||||||
|
owners if there is one.
|
||||||
|
|
||||||
|
Then clean up "consistent" buffers which contain the control data.
|
||||||
|
|
||||||
|
See Documentation/DMA-API.txt for details on unmapping interfaces.
|
||||||
|
|
||||||
|
|
||||||
|
Unregister from other subsystems
|
||||||
|
--------------------------------
|
||||||
|
Most low level PCI device drivers support some other subsystem
|
||||||
|
like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your
|
||||||
|
driver isn't losing resources from that other subsystem.
|
||||||
|
If this happens, typically the symptom is an Oops (panic) when
|
||||||
|
the subsystem attempts to call into a driver that has been unloaded.
|
||||||
|
|
||||||
|
|
||||||
|
Disable Device from responding to MMIO/IO Port addresses
|
||||||
|
--------------------------------------------------------
|
||||||
|
io_unmap() MMIO or IO Port resources and then call pci_disable_device().
|
||||||
|
This is the symmetric opposite of pci_enable_device().
|
||||||
|
Do not access device registers after calling pci_disable_device().
|
||||||
|
|
||||||
|
|
||||||
|
Release MMIO/IO Port Resource(s)
|
||||||
|
--------------------------------
|
||||||
|
Call pci_release_region() to mark the MMIO or IO Port range as available.
|
||||||
|
Failure to do so usually results in the inability to reload the driver.
|
||||||
|
|
||||||
|
|
||||||
|
How to access PCI config space
|
||||||
|
==============================
|
||||||
|
|
||||||
|
You can use `pci_(read|write)_config_(byte|word|dword)` to access the config
|
||||||
|
space of a device represented by `struct pci_dev *`. All these functions return
|
||||||
|
0 when successful or an error code (`PCIBIOS_...`) which can be translated to a
|
||||||
|
text string by pcibios_strerror. Most drivers expect that accesses to valid PCI
|
||||||
|
devices don't fail.
|
||||||
|
|
||||||
|
If you don't have a struct pci_dev available, you can call
|
||||||
|
`pci_bus_(read|write)_config_(byte|word|dword)` to access a given device
|
||||||
|
and function on that bus.
|
||||||
|
|
||||||
|
If you access fields in the standard portion of the config header, please
|
||||||
|
use symbolic names of locations and bits declared in <linux/pci.h>.
|
||||||
|
|
||||||
|
If you need to access Extended PCI Capability registers, just call
|
||||||
|
pci_find_capability() for the particular capability and it will find the
|
||||||
|
corresponding register block for you.
|
||||||
|
|
||||||
|
|
||||||
|
Other interesting functions
|
||||||
|
===========================
|
||||||
|
|
||||||
|
============================= ================================================
|
||||||
|
pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain,
|
||||||
|
bus and slot and number. If the device is
|
||||||
|
found, its reference count is increased.
|
||||||
|
pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3)
|
||||||
|
pci_find_capability() Find specified capability in device's capability
|
||||||
|
list.
|
||||||
|
pci_resource_start() Returns bus start address for a given PCI region
|
||||||
|
pci_resource_end() Returns bus end address for a given PCI region
|
||||||
|
pci_resource_len() Returns the byte length of a PCI region
|
||||||
|
pci_set_drvdata() Set private driver data pointer for a pci_dev
|
||||||
|
pci_get_drvdata() Return private driver data pointer for a pci_dev
|
||||||
|
pci_set_mwi() Enable Memory-Write-Invalidate transactions.
|
||||||
|
pci_clear_mwi() Disable Memory-Write-Invalidate transactions.
|
||||||
|
============================= ================================================
|
||||||
|
|
||||||
|
|
||||||
|
Miscellaneous hints
|
||||||
|
===================
|
||||||
|
|
||||||
|
When displaying PCI device names to the user (for example when a driver wants
|
||||||
|
to tell the user what card has it found), please use pci_name(pci_dev).
|
||||||
|
|
||||||
|
Always refer to the PCI devices by a pointer to the pci_dev structure.
|
||||||
|
All PCI layer functions use this identification and it's the only
|
||||||
|
reasonable one. Don't use bus/slot/function numbers except for very
|
||||||
|
special purposes -- on systems with multiple primary buses their semantics
|
||||||
|
can be pretty complex.
|
||||||
|
|
||||||
|
Don't try to turn on Fast Back to Back writes in your driver. All devices
|
||||||
|
on the bus need to be capable of doing it, so this is something which needs
|
||||||
|
to be handled by platform and generic code, not individual drivers.
|
||||||
|
|
||||||
|
|
||||||
|
Vendor and device identifications
|
||||||
|
=================================
|
||||||
|
|
||||||
|
Do not add new device or vendor IDs to include/linux/pci_ids.h unless they
|
||||||
|
are shared across multiple drivers. You can add private definitions in
|
||||||
|
your driver if they're helpful, or just use plain hex constants.
|
||||||
|
|
||||||
|
The device IDs are arbitrary hex numbers (vendor controlled) and normally used
|
||||||
|
only in a single location, the pci_device_id table.
|
||||||
|
|
||||||
|
Please DO submit new vendor/device IDs to http://pci-ids.ucw.cz/.
|
||||||
|
There are mirrors of the pci.ids file at http://pciids.sourceforge.net/
|
||||||
|
and https://github.com/pciutils/pciids.
|
||||||
|
|
||||||
|
|
||||||
|
Obsolete functions
|
||||||
|
==================
|
||||||
|
|
||||||
|
There are several functions which you might come across when trying to
|
||||||
|
port an old driver to the new PCI interface. They are no longer present
|
||||||
|
in the kernel as they aren't compatible with hotplug or PCI domains or
|
||||||
|
having sane locking.
|
||||||
|
|
||||||
|
================= ===========================================
|
||||||
|
pci_find_device() Superseded by pci_get_device()
|
||||||
|
pci_find_subsys() Superseded by pci_get_subsys()
|
||||||
|
pci_find_slot() Superseded by pci_get_domain_bus_and_slot()
|
||||||
|
pci_get_slot() Superseded by pci_get_domain_bus_and_slot()
|
||||||
|
================= ===========================================
|
||||||
|
|
||||||
|
The alternative is the traditional PCI device driver that walks PCI
|
||||||
|
device lists. This is still possible but discouraged.
|
||||||
|
|
||||||
|
|
||||||
|
MMIO Space and "Write Posting"
|
||||||
|
==============================
|
||||||
|
|
||||||
|
Converting a driver from using I/O Port space to using MMIO space
|
||||||
|
often requires some additional changes. Specifically, "write posting"
|
||||||
|
needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2)
|
||||||
|
already do this. I/O Port space guarantees write transactions reach the PCI
|
||||||
|
device before the CPU can continue. Writes to MMIO space allow the CPU
|
||||||
|
to continue before the transaction reaches the PCI device. HW weenies
|
||||||
|
call this "Write Posting" because the write completion is "posted" to
|
||||||
|
the CPU before the transaction has reached its destination.
|
||||||
|
|
||||||
|
Thus, timing sensitive code should add readl() where the CPU is
|
||||||
|
expected to wait before doing other work. The classic "bit banging"
|
||||||
|
sequence works fine for I/O Port space::
|
||||||
|
|
||||||
|
for (i = 8; --i; val >>= 1) {
|
||||||
|
outb(val & 1, ioport_reg); /* write bit */
|
||||||
|
udelay(10);
|
||||||
|
}
|
||||||
|
|
||||||
|
The same sequence for MMIO space should be::
|
||||||
|
|
||||||
|
for (i = 8; --i; val >>= 1) {
|
||||||
|
writeb(val & 1, mmio_reg); /* write bit */
|
||||||
|
readb(safe_mmio_reg); /* flush posted write */
|
||||||
|
udelay(10);
|
||||||
|
}
|
||||||
|
|
||||||
|
It is important that "safe_mmio_reg" not have any side effects that
|
||||||
|
interferes with the correct operation of the device.
|
||||||
|
|
||||||
|
Another case to watch out for is when resetting a PCI device. Use PCI
|
||||||
|
Configuration space reads to flush the writel(). This will gracefully
|
||||||
|
handle the PCI master abort on all platforms if the PCI device is
|
||||||
|
expected to not respond to a readl(). Most x86 platforms will allow
|
||||||
|
MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage
|
||||||
|
(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail").
|
@ -1,636 +0,0 @@
|
|||||||
|
|
||||||
How To Write Linux PCI Drivers
|
|
||||||
|
|
||||||
by Martin Mares <mj@ucw.cz> on 07-Feb-2000
|
|
||||||
updated by Grant Grundler <grundler@parisc-linux.org> on 23-Dec-2006
|
|
||||||
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
The world of PCI is vast and full of (mostly unpleasant) surprises.
|
|
||||||
Since each CPU architecture implements different chip-sets and PCI devices
|
|
||||||
have different requirements (erm, "features"), the result is the PCI support
|
|
||||||
in the Linux kernel is not as trivial as one would wish. This short paper
|
|
||||||
tries to introduce all potential driver authors to Linux APIs for
|
|
||||||
PCI device drivers.
|
|
||||||
|
|
||||||
A more complete resource is the third edition of "Linux Device Drivers"
|
|
||||||
by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman.
|
|
||||||
LDD3 is available for free (under Creative Commons License) from:
|
|
||||||
|
|
||||||
http://lwn.net/Kernel/LDD3/
|
|
||||||
|
|
||||||
However, keep in mind that all documents are subject to "bit rot".
|
|
||||||
Refer to the source code if things are not working as described here.
|
|
||||||
|
|
||||||
Please send questions/comments/patches about Linux PCI API to the
|
|
||||||
"Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
0. Structure of PCI drivers
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
PCI drivers "discover" PCI devices in a system via pci_register_driver().
|
|
||||||
Actually, it's the other way around. When the PCI generic code discovers
|
|
||||||
a new device, the driver with a matching "description" will be notified.
|
|
||||||
Details on this below.
|
|
||||||
|
|
||||||
pci_register_driver() leaves most of the probing for devices to
|
|
||||||
the PCI layer and supports online insertion/removal of devices [thus
|
|
||||||
supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver].
|
|
||||||
pci_register_driver() call requires passing in a table of function
|
|
||||||
pointers and thus dictates the high level structure of a driver.
|
|
||||||
|
|
||||||
Once the driver knows about a PCI device and takes ownership, the
|
|
||||||
driver generally needs to perform the following initialization:
|
|
||||||
|
|
||||||
Enable the device
|
|
||||||
Request MMIO/IOP resources
|
|
||||||
Set the DMA mask size (for both coherent and streaming DMA)
|
|
||||||
Allocate and initialize shared control data (pci_allocate_coherent())
|
|
||||||
Access device configuration space (if needed)
|
|
||||||
Register IRQ handler (request_irq())
|
|
||||||
Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
|
|
||||||
Enable DMA/processing engines
|
|
||||||
|
|
||||||
When done using the device, and perhaps the module needs to be unloaded,
|
|
||||||
the driver needs to take the follow steps:
|
|
||||||
Disable the device from generating IRQs
|
|
||||||
Release the IRQ (free_irq())
|
|
||||||
Stop all DMA activity
|
|
||||||
Release DMA buffers (both streaming and coherent)
|
|
||||||
Unregister from other subsystems (e.g. scsi or netdev)
|
|
||||||
Release MMIO/IOP resources
|
|
||||||
Disable the device
|
|
||||||
|
|
||||||
Most of these topics are covered in the following sections.
|
|
||||||
For the rest look at LDD3 or <linux/pci.h> .
|
|
||||||
|
|
||||||
If the PCI subsystem is not configured (CONFIG_PCI is not set), most of
|
|
||||||
the PCI functions described below are defined as inline functions either
|
|
||||||
completely empty or just returning an appropriate error codes to avoid
|
|
||||||
lots of ifdefs in the drivers.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
1. pci_register_driver() call
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
PCI device drivers call pci_register_driver() during their
|
|
||||||
initialization with a pointer to a structure describing the driver
|
|
||||||
(struct pci_driver):
|
|
||||||
|
|
||||||
field name Description
|
|
||||||
---------- ------------------------------------------------------
|
|
||||||
id_table Pointer to table of device ID's the driver is
|
|
||||||
interested in. Most drivers should export this
|
|
||||||
table using MODULE_DEVICE_TABLE(pci,...).
|
|
||||||
|
|
||||||
probe This probing function gets called (during execution
|
|
||||||
of pci_register_driver() for already existing
|
|
||||||
devices or later if a new device gets inserted) for
|
|
||||||
all PCI devices which match the ID table and are not
|
|
||||||
"owned" by the other drivers yet. This function gets
|
|
||||||
passed a "struct pci_dev *" for each device whose
|
|
||||||
entry in the ID table matches the device. The probe
|
|
||||||
function returns zero when the driver chooses to
|
|
||||||
take "ownership" of the device or an error code
|
|
||||||
(negative number) otherwise.
|
|
||||||
The probe function always gets called from process
|
|
||||||
context, so it can sleep.
|
|
||||||
|
|
||||||
remove The remove() function gets called whenever a device
|
|
||||||
being handled by this driver is removed (either during
|
|
||||||
deregistration of the driver or when it's manually
|
|
||||||
pulled out of a hot-pluggable slot).
|
|
||||||
The remove function always gets called from process
|
|
||||||
context, so it can sleep.
|
|
||||||
|
|
||||||
suspend Put device into low power state.
|
|
||||||
suspend_late Put device into low power state.
|
|
||||||
|
|
||||||
resume_early Wake device from low power state.
|
|
||||||
resume Wake device from low power state.
|
|
||||||
|
|
||||||
(Please see Documentation/power/pci.txt for descriptions
|
|
||||||
of PCI Power Management and the related functions.)
|
|
||||||
|
|
||||||
shutdown Hook into reboot_notifier_list (kernel/sys.c).
|
|
||||||
Intended to stop any idling DMA operations.
|
|
||||||
Useful for enabling wake-on-lan (NIC) or changing
|
|
||||||
the power state of a device before reboot.
|
|
||||||
e.g. drivers/net/e100.c.
|
|
||||||
|
|
||||||
err_handler See Documentation/PCI/pci-error-recovery.txt
|
|
||||||
|
|
||||||
|
|
||||||
The ID table is an array of struct pci_device_id entries ending with an
|
|
||||||
all-zero entry. Definitions with static const are generally preferred.
|
|
||||||
|
|
||||||
Each entry consists of:
|
|
||||||
|
|
||||||
vendor,device Vendor and device ID to match (or PCI_ANY_ID)
|
|
||||||
|
|
||||||
subvendor, Subsystem vendor and device ID to match (or PCI_ANY_ID)
|
|
||||||
subdevice,
|
|
||||||
|
|
||||||
class Device class, subclass, and "interface" to match.
|
|
||||||
See Appendix D of the PCI Local Bus Spec or
|
|
||||||
include/linux/pci_ids.h for a full list of classes.
|
|
||||||
Most drivers do not need to specify class/class_mask
|
|
||||||
as vendor/device is normally sufficient.
|
|
||||||
|
|
||||||
class_mask limit which sub-fields of the class field are compared.
|
|
||||||
See drivers/scsi/sym53c8xx_2/ for example of usage.
|
|
||||||
|
|
||||||
driver_data Data private to the driver.
|
|
||||||
Most drivers don't need to use driver_data field.
|
|
||||||
Best practice is to use driver_data as an index
|
|
||||||
into a static list of equivalent device types,
|
|
||||||
instead of using it as a pointer.
|
|
||||||
|
|
||||||
|
|
||||||
Most drivers only need PCI_DEVICE() or PCI_DEVICE_CLASS() to set up
|
|
||||||
a pci_device_id table.
|
|
||||||
|
|
||||||
New PCI IDs may be added to a device driver pci_ids table at runtime
|
|
||||||
as shown below:
|
|
||||||
|
|
||||||
echo "vendor device subvendor subdevice class class_mask driver_data" > \
|
|
||||||
/sys/bus/pci/drivers/{driver}/new_id
|
|
||||||
|
|
||||||
All fields are passed in as hexadecimal values (no leading 0x).
|
|
||||||
The vendor and device fields are mandatory, the others are optional. Users
|
|
||||||
need pass only as many optional fields as necessary:
|
|
||||||
o subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF)
|
|
||||||
o class and classmask fields default to 0
|
|
||||||
o driver_data defaults to 0UL.
|
|
||||||
|
|
||||||
Note that driver_data must match the value used by any of the pci_device_id
|
|
||||||
entries defined in the driver. This makes the driver_data field mandatory
|
|
||||||
if all the pci_device_id entries have a non-zero driver_data value.
|
|
||||||
|
|
||||||
Once added, the driver probe routine will be invoked for any unclaimed
|
|
||||||
PCI devices listed in its (newly updated) pci_ids list.
|
|
||||||
|
|
||||||
When the driver exits, it just calls pci_unregister_driver() and the PCI layer
|
|
||||||
automatically calls the remove hook for all devices handled by the driver.
|
|
||||||
|
|
||||||
|
|
||||||
1.1 "Attributes" for driver functions/data
|
|
||||||
|
|
||||||
Please mark the initialization and cleanup functions where appropriate
|
|
||||||
(the corresponding macros are defined in <linux/init.h>):
|
|
||||||
|
|
||||||
__init Initialization code. Thrown away after the driver
|
|
||||||
initializes.
|
|
||||||
__exit Exit code. Ignored for non-modular drivers.
|
|
||||||
|
|
||||||
Tips on when/where to use the above attributes:
|
|
||||||
o The module_init()/module_exit() functions (and all
|
|
||||||
initialization functions called _only_ from these)
|
|
||||||
should be marked __init/__exit.
|
|
||||||
|
|
||||||
o Do not mark the struct pci_driver.
|
|
||||||
|
|
||||||
o Do NOT mark a function if you are not sure which mark to use.
|
|
||||||
Better to not mark the function than mark the function wrong.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
2. How to find PCI devices manually
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
PCI drivers should have a really good reason for not using the
|
|
||||||
pci_register_driver() interface to search for PCI devices.
|
|
||||||
The main reason PCI devices are controlled by multiple drivers
|
|
||||||
is because one PCI device implements several different HW services.
|
|
||||||
E.g. combined serial/parallel port/floppy controller.
|
|
||||||
|
|
||||||
A manual search may be performed using the following constructs:
|
|
||||||
|
|
||||||
Searching by vendor and device ID:
|
|
||||||
|
|
||||||
struct pci_dev *dev = NULL;
|
|
||||||
while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev))
|
|
||||||
configure_device(dev);
|
|
||||||
|
|
||||||
Searching by class ID (iterate in a similar way):
|
|
||||||
|
|
||||||
pci_get_class(CLASS_ID, dev)
|
|
||||||
|
|
||||||
Searching by both vendor/device and subsystem vendor/device ID:
|
|
||||||
|
|
||||||
pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev).
|
|
||||||
|
|
||||||
You can use the constant PCI_ANY_ID as a wildcard replacement for
|
|
||||||
VENDOR_ID or DEVICE_ID. This allows searching for any device from a
|
|
||||||
specific vendor, for example.
|
|
||||||
|
|
||||||
These functions are hotplug-safe. They increment the reference count on
|
|
||||||
the pci_dev that they return. You must eventually (possibly at module unload)
|
|
||||||
decrement the reference count on these devices by calling pci_dev_put().
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
3. Device Initialization Steps
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
As noted in the introduction, most PCI drivers need the following steps
|
|
||||||
for device initialization:
|
|
||||||
|
|
||||||
Enable the device
|
|
||||||
Request MMIO/IOP resources
|
|
||||||
Set the DMA mask size (for both coherent and streaming DMA)
|
|
||||||
Allocate and initialize shared control data (pci_allocate_coherent())
|
|
||||||
Access device configuration space (if needed)
|
|
||||||
Register IRQ handler (request_irq())
|
|
||||||
Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
|
|
||||||
Enable DMA/processing engines.
|
|
||||||
|
|
||||||
The driver can access PCI config space registers at any time.
|
|
||||||
(Well, almost. When running BIST, config space can go away...but
|
|
||||||
that will just result in a PCI Bus Master Abort and config reads
|
|
||||||
will return garbage).
|
|
||||||
|
|
||||||
|
|
||||||
3.1 Enable the PCI device
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Before touching any device registers, the driver needs to enable
|
|
||||||
the PCI device by calling pci_enable_device(). This will:
|
|
||||||
o wake up the device if it was in suspended state,
|
|
||||||
o allocate I/O and memory regions of the device (if BIOS did not),
|
|
||||||
o allocate an IRQ (if BIOS did not).
|
|
||||||
|
|
||||||
NOTE: pci_enable_device() can fail! Check the return value.
|
|
||||||
|
|
||||||
[ OS BUG: we don't check resource allocations before enabling those
|
|
||||||
resources. The sequence would make more sense if we called
|
|
||||||
pci_request_resources() before calling pci_enable_device().
|
|
||||||
Currently, the device drivers can't detect the bug when when two
|
|
||||||
devices have been allocated the same range. This is not a common
|
|
||||||
problem and unlikely to get fixed soon.
|
|
||||||
|
|
||||||
This has been discussed before but not changed as of 2.6.19:
|
|
||||||
http://lkml.org/lkml/2006/3/2/194
|
|
||||||
]
|
|
||||||
|
|
||||||
pci_set_master() will enable DMA by setting the bus master bit
|
|
||||||
in the PCI_COMMAND register. It also fixes the latency timer value if
|
|
||||||
it's set to something bogus by the BIOS. pci_clear_master() will
|
|
||||||
disable DMA by clearing the bus master bit.
|
|
||||||
|
|
||||||
If the PCI device can use the PCI Memory-Write-Invalidate transaction,
|
|
||||||
call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval
|
|
||||||
and also ensures that the cache line size register is set correctly.
|
|
||||||
Check the return value of pci_set_mwi() as not all architectures
|
|
||||||
or chip-sets may support Memory-Write-Invalidate. Alternatively,
|
|
||||||
if Mem-Wr-Inval would be nice to have but is not required, call
|
|
||||||
pci_try_set_mwi() to have the system do its best effort at enabling
|
|
||||||
Mem-Wr-Inval.
|
|
||||||
|
|
||||||
|
|
||||||
3.2 Request MMIO/IOP resources
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Memory (MMIO), and I/O port addresses should NOT be read directly
|
|
||||||
from the PCI device config space. Use the values in the pci_dev structure
|
|
||||||
as the PCI "bus address" might have been remapped to a "host physical"
|
|
||||||
address by the arch/chip-set specific kernel support.
|
|
||||||
|
|
||||||
See Documentation/io-mapping.txt for how to access device registers
|
|
||||||
or device memory.
|
|
||||||
|
|
||||||
The device driver needs to call pci_request_region() to verify
|
|
||||||
no other device is already using the same address resource.
|
|
||||||
Conversely, drivers should call pci_release_region() AFTER
|
|
||||||
calling pci_disable_device().
|
|
||||||
The idea is to prevent two devices colliding on the same address range.
|
|
||||||
|
|
||||||
[ See OS BUG comment above. Currently (2.6.19), The driver can only
|
|
||||||
determine MMIO and IO Port resource availability _after_ calling
|
|
||||||
pci_enable_device(). ]
|
|
||||||
|
|
||||||
Generic flavors of pci_request_region() are request_mem_region()
|
|
||||||
(for MMIO ranges) and request_region() (for IO Port ranges).
|
|
||||||
Use these for address resources that are not described by "normal" PCI
|
|
||||||
BARs.
|
|
||||||
|
|
||||||
Also see pci_request_selected_regions() below.
|
|
||||||
|
|
||||||
|
|
||||||
3.3 Set the DMA mask size
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
[ If anything below doesn't make sense, please refer to
|
|
||||||
Documentation/DMA-API.txt. This section is just a reminder that
|
|
||||||
drivers need to indicate DMA capabilities of the device and is not
|
|
||||||
an authoritative source for DMA interfaces. ]
|
|
||||||
|
|
||||||
While all drivers should explicitly indicate the DMA capability
|
|
||||||
(e.g. 32 or 64 bit) of the PCI bus master, devices with more than
|
|
||||||
32-bit bus master capability for streaming data need the driver
|
|
||||||
to "register" this capability by calling pci_set_dma_mask() with
|
|
||||||
appropriate parameters. In general this allows more efficient DMA
|
|
||||||
on systems where System RAM exists above 4G _physical_ address.
|
|
||||||
|
|
||||||
Drivers for all PCI-X and PCIe compliant devices must call
|
|
||||||
pci_set_dma_mask() as they are 64-bit DMA devices.
|
|
||||||
|
|
||||||
Similarly, drivers must also "register" this capability if the device
|
|
||||||
can directly address "consistent memory" in System RAM above 4G physical
|
|
||||||
address by calling pci_set_consistent_dma_mask().
|
|
||||||
Again, this includes drivers for all PCI-X and PCIe compliant devices.
|
|
||||||
Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are
|
|
||||||
64-bit DMA capable for payload ("streaming") data but not control
|
|
||||||
("consistent") data.
|
|
||||||
|
|
||||||
|
|
||||||
3.4 Setup shared control data
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared)
|
|
||||||
memory. See Documentation/DMA-API.txt for a full description of
|
|
||||||
the DMA APIs. This section is just a reminder that it needs to be done
|
|
||||||
before enabling DMA on the device.
|
|
||||||
|
|
||||||
|
|
||||||
3.5 Initialize device registers
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Some drivers will need specific "capability" fields programmed
|
|
||||||
or other "vendor specific" register initialized or reset.
|
|
||||||
E.g. clearing pending interrupts.
|
|
||||||
|
|
||||||
|
|
||||||
3.6 Register IRQ handler
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
While calling request_irq() is the last step described here,
|
|
||||||
this is often just another intermediate step to initialize a device.
|
|
||||||
This step can often be deferred until the device is opened for use.
|
|
||||||
|
|
||||||
All interrupt handlers for IRQ lines should be registered with IRQF_SHARED
|
|
||||||
and use the devid to map IRQs to devices (remember that all PCI IRQ lines
|
|
||||||
can be shared).
|
|
||||||
|
|
||||||
request_irq() will associate an interrupt handler and device handle
|
|
||||||
with an interrupt number. Historically interrupt numbers represent
|
|
||||||
IRQ lines which run from the PCI device to the Interrupt controller.
|
|
||||||
With MSI and MSI-X (more below) the interrupt number is a CPU "vector".
|
|
||||||
|
|
||||||
request_irq() also enables the interrupt. Make sure the device is
|
|
||||||
quiesced and does not have any interrupts pending before registering
|
|
||||||
the interrupt handler.
|
|
||||||
|
|
||||||
MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts"
|
|
||||||
which deliver interrupts to the CPU via a DMA write to a Local APIC.
|
|
||||||
The fundamental difference between MSI and MSI-X is how multiple
|
|
||||||
"vectors" get allocated. MSI requires contiguous blocks of vectors
|
|
||||||
while MSI-X can allocate several individual ones.
|
|
||||||
|
|
||||||
MSI capability can be enabled by calling pci_alloc_irq_vectors() with the
|
|
||||||
PCI_IRQ_MSI and/or PCI_IRQ_MSIX flags before calling request_irq(). This
|
|
||||||
causes the PCI support to program CPU vector data into the PCI device
|
|
||||||
capability registers. Many architectures, chip-sets, or BIOSes do NOT
|
|
||||||
support MSI or MSI-X and a call to pci_alloc_irq_vectors with just
|
|
||||||
the PCI_IRQ_MSI and PCI_IRQ_MSIX flags will fail, so try to always
|
|
||||||
specify PCI_IRQ_LEGACY as well.
|
|
||||||
|
|
||||||
Drivers that have different interrupt handlers for MSI/MSI-X and
|
|
||||||
legacy INTx should chose the right one based on the msi_enabled
|
|
||||||
and msix_enabled flags in the pci_dev structure after calling
|
|
||||||
pci_alloc_irq_vectors.
|
|
||||||
|
|
||||||
There are (at least) two really good reasons for using MSI:
|
|
||||||
1) MSI is an exclusive interrupt vector by definition.
|
|
||||||
This means the interrupt handler doesn't have to verify
|
|
||||||
its device caused the interrupt.
|
|
||||||
|
|
||||||
2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed
|
|
||||||
to be visible to the host CPU(s) when the MSI is delivered. This
|
|
||||||
is important for both data coherency and avoiding stale control data.
|
|
||||||
This guarantee allows the driver to omit MMIO reads to flush
|
|
||||||
the DMA stream.
|
|
||||||
|
|
||||||
See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples
|
|
||||||
of MSI/MSI-X usage.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
4. PCI device shutdown
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
When a PCI device driver is being unloaded, most of the following
|
|
||||||
steps need to be performed:
|
|
||||||
|
|
||||||
Disable the device from generating IRQs
|
|
||||||
Release the IRQ (free_irq())
|
|
||||||
Stop all DMA activity
|
|
||||||
Release DMA buffers (both streaming and consistent)
|
|
||||||
Unregister from other subsystems (e.g. scsi or netdev)
|
|
||||||
Disable device from responding to MMIO/IO Port addresses
|
|
||||||
Release MMIO/IO Port resource(s)
|
|
||||||
|
|
||||||
|
|
||||||
4.1 Stop IRQs on the device
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
How to do this is chip/device specific. If it's not done, it opens
|
|
||||||
the possibility of a "screaming interrupt" if (and only if)
|
|
||||||
the IRQ is shared with another device.
|
|
||||||
|
|
||||||
When the shared IRQ handler is "unhooked", the remaining devices
|
|
||||||
using the same IRQ line will still need the IRQ enabled. Thus if the
|
|
||||||
"unhooked" device asserts IRQ line, the system will respond assuming
|
|
||||||
it was one of the remaining devices asserted the IRQ line. Since none
|
|
||||||
of the other devices will handle the IRQ, the system will "hang" until
|
|
||||||
it decides the IRQ isn't going to get handled and masks the IRQ (100,000
|
|
||||||
iterations later). Once the shared IRQ is masked, the remaining devices
|
|
||||||
will stop functioning properly. Not a nice situation.
|
|
||||||
|
|
||||||
This is another reason to use MSI or MSI-X if it's available.
|
|
||||||
MSI and MSI-X are defined to be exclusive interrupts and thus
|
|
||||||
are not susceptible to the "screaming interrupt" problem.
|
|
||||||
|
|
||||||
|
|
||||||
4.2 Release the IRQ
|
|
||||||
~~~~~~~~~~~~~~~~~~~
|
|
||||||
Once the device is quiesced (no more IRQs), one can call free_irq().
|
|
||||||
This function will return control once any pending IRQs are handled,
|
|
||||||
"unhook" the drivers IRQ handler from that IRQ, and finally release
|
|
||||||
the IRQ if no one else is using it.
|
|
||||||
|
|
||||||
|
|
||||||
4.3 Stop all DMA activity
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
It's extremely important to stop all DMA operations BEFORE attempting
|
|
||||||
to deallocate DMA control data. Failure to do so can result in memory
|
|
||||||
corruption, hangs, and on some chip-sets a hard crash.
|
|
||||||
|
|
||||||
Stopping DMA after stopping the IRQs can avoid races where the
|
|
||||||
IRQ handler might restart DMA engines.
|
|
||||||
|
|
||||||
While this step sounds obvious and trivial, several "mature" drivers
|
|
||||||
didn't get this step right in the past.
|
|
||||||
|
|
||||||
|
|
||||||
4.4 Release DMA buffers
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Once DMA is stopped, clean up streaming DMA first.
|
|
||||||
I.e. unmap data buffers and return buffers to "upstream"
|
|
||||||
owners if there is one.
|
|
||||||
|
|
||||||
Then clean up "consistent" buffers which contain the control data.
|
|
||||||
|
|
||||||
See Documentation/DMA-API.txt for details on unmapping interfaces.
|
|
||||||
|
|
||||||
|
|
||||||
4.5 Unregister from other subsystems
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Most low level PCI device drivers support some other subsystem
|
|
||||||
like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your
|
|
||||||
driver isn't losing resources from that other subsystem.
|
|
||||||
If this happens, typically the symptom is an Oops (panic) when
|
|
||||||
the subsystem attempts to call into a driver that has been unloaded.
|
|
||||||
|
|
||||||
|
|
||||||
4.6 Disable Device from responding to MMIO/IO Port addresses
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
io_unmap() MMIO or IO Port resources and then call pci_disable_device().
|
|
||||||
This is the symmetric opposite of pci_enable_device().
|
|
||||||
Do not access device registers after calling pci_disable_device().
|
|
||||||
|
|
||||||
|
|
||||||
4.7 Release MMIO/IO Port Resource(s)
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Call pci_release_region() to mark the MMIO or IO Port range as available.
|
|
||||||
Failure to do so usually results in the inability to reload the driver.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
5. How to access PCI config space
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
You can use pci_(read|write)_config_(byte|word|dword) to access the config
|
|
||||||
space of a device represented by struct pci_dev *. All these functions return 0
|
|
||||||
when successful or an error code (PCIBIOS_...) which can be translated to a text
|
|
||||||
string by pcibios_strerror. Most drivers expect that accesses to valid PCI
|
|
||||||
devices don't fail.
|
|
||||||
|
|
||||||
If you don't have a struct pci_dev available, you can call
|
|
||||||
pci_bus_(read|write)_config_(byte|word|dword) to access a given device
|
|
||||||
and function on that bus.
|
|
||||||
|
|
||||||
If you access fields in the standard portion of the config header, please
|
|
||||||
use symbolic names of locations and bits declared in <linux/pci.h>.
|
|
||||||
|
|
||||||
If you need to access Extended PCI Capability registers, just call
|
|
||||||
pci_find_capability() for the particular capability and it will find the
|
|
||||||
corresponding register block for you.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
6. Other interesting functions
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain,
|
|
||||||
bus and slot and number. If the device is
|
|
||||||
found, its reference count is increased.
|
|
||||||
pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3)
|
|
||||||
pci_find_capability() Find specified capability in device's capability
|
|
||||||
list.
|
|
||||||
pci_resource_start() Returns bus start address for a given PCI region
|
|
||||||
pci_resource_end() Returns bus end address for a given PCI region
|
|
||||||
pci_resource_len() Returns the byte length of a PCI region
|
|
||||||
pci_set_drvdata() Set private driver data pointer for a pci_dev
|
|
||||||
pci_get_drvdata() Return private driver data pointer for a pci_dev
|
|
||||||
pci_set_mwi() Enable Memory-Write-Invalidate transactions.
|
|
||||||
pci_clear_mwi() Disable Memory-Write-Invalidate transactions.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
7. Miscellaneous hints
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
When displaying PCI device names to the user (for example when a driver wants
|
|
||||||
to tell the user what card has it found), please use pci_name(pci_dev).
|
|
||||||
|
|
||||||
Always refer to the PCI devices by a pointer to the pci_dev structure.
|
|
||||||
All PCI layer functions use this identification and it's the only
|
|
||||||
reasonable one. Don't use bus/slot/function numbers except for very
|
|
||||||
special purposes -- on systems with multiple primary buses their semantics
|
|
||||||
can be pretty complex.
|
|
||||||
|
|
||||||
Don't try to turn on Fast Back to Back writes in your driver. All devices
|
|
||||||
on the bus need to be capable of doing it, so this is something which needs
|
|
||||||
to be handled by platform and generic code, not individual drivers.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
8. Vendor and device identifications
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
Do not add new device or vendor IDs to include/linux/pci_ids.h unless they
|
|
||||||
are shared across multiple drivers. You can add private definitions in
|
|
||||||
your driver if they're helpful, or just use plain hex constants.
|
|
||||||
|
|
||||||
The device IDs are arbitrary hex numbers (vendor controlled) and normally used
|
|
||||||
only in a single location, the pci_device_id table.
|
|
||||||
|
|
||||||
Please DO submit new vendor/device IDs to http://pci-ids.ucw.cz/.
|
|
||||||
There are mirrors of the pci.ids file at http://pciids.sourceforge.net/
|
|
||||||
and https://github.com/pciutils/pciids.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
9. Obsolete functions
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
There are several functions which you might come across when trying to
|
|
||||||
port an old driver to the new PCI interface. They are no longer present
|
|
||||||
in the kernel as they aren't compatible with hotplug or PCI domains or
|
|
||||||
having sane locking.
|
|
||||||
|
|
||||||
pci_find_device() Superseded by pci_get_device()
|
|
||||||
pci_find_subsys() Superseded by pci_get_subsys()
|
|
||||||
pci_find_slot() Superseded by pci_get_domain_bus_and_slot()
|
|
||||||
pci_get_slot() Superseded by pci_get_domain_bus_and_slot()
|
|
||||||
|
|
||||||
|
|
||||||
The alternative is the traditional PCI device driver that walks PCI
|
|
||||||
device lists. This is still possible but discouraged.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
10. MMIO Space and "Write Posting"
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
Converting a driver from using I/O Port space to using MMIO space
|
|
||||||
often requires some additional changes. Specifically, "write posting"
|
|
||||||
needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2)
|
|
||||||
already do this. I/O Port space guarantees write transactions reach the PCI
|
|
||||||
device before the CPU can continue. Writes to MMIO space allow the CPU
|
|
||||||
to continue before the transaction reaches the PCI device. HW weenies
|
|
||||||
call this "Write Posting" because the write completion is "posted" to
|
|
||||||
the CPU before the transaction has reached its destination.
|
|
||||||
|
|
||||||
Thus, timing sensitive code should add readl() where the CPU is
|
|
||||||
expected to wait before doing other work. The classic "bit banging"
|
|
||||||
sequence works fine for I/O Port space:
|
|
||||||
|
|
||||||
for (i = 8; --i; val >>= 1) {
|
|
||||||
outb(val & 1, ioport_reg); /* write bit */
|
|
||||||
udelay(10);
|
|
||||||
}
|
|
||||||
|
|
||||||
The same sequence for MMIO space should be:
|
|
||||||
|
|
||||||
for (i = 8; --i; val >>= 1) {
|
|
||||||
writeb(val & 1, mmio_reg); /* write bit */
|
|
||||||
readb(safe_mmio_reg); /* flush posted write */
|
|
||||||
udelay(10);
|
|
||||||
}
|
|
||||||
|
|
||||||
It is important that "safe_mmio_reg" not have any side effects that
|
|
||||||
interferes with the correct operation of the device.
|
|
||||||
|
|
||||||
Another case to watch out for is when resetting a PCI device. Use PCI
|
|
||||||
Configuration space reads to flush the writel(). This will gracefully
|
|
||||||
handle the PCI master abort on all platforms if the PCI device is
|
|
||||||
expected to not respond to a readl(). Most x86 platforms will allow
|
|
||||||
MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage
|
|
||||||
(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail").
|
|
||||||
|
|
311
Documentation/PCI/pcieaer-howto.rst
Normal file
311
Documentation/PCI/pcieaer-howto.rst
Normal file
@ -0,0 +1,311 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
.. include:: <isonum.txt>
|
||||||
|
|
||||||
|
===========================================================
|
||||||
|
The PCI Express Advanced Error Reporting Driver Guide HOWTO
|
||||||
|
===========================================================
|
||||||
|
|
||||||
|
:Authors: - T. Long Nguyen <tom.l.nguyen@intel.com>
|
||||||
|
- Yanmin Zhang <yanmin.zhang@intel.com>
|
||||||
|
|
||||||
|
:Copyright: |copy| 2006 Intel Corporation
|
||||||
|
|
||||||
|
Overview
|
||||||
|
===========
|
||||||
|
|
||||||
|
About this guide
|
||||||
|
----------------
|
||||||
|
|
||||||
|
This guide describes the basics of the PCI Express Advanced Error
|
||||||
|
Reporting (AER) driver and provides information on how to use it, as
|
||||||
|
well as how to enable the drivers of endpoint devices to conform with
|
||||||
|
PCI Express AER driver.
|
||||||
|
|
||||||
|
|
||||||
|
What is the PCI Express AER Driver?
|
||||||
|
-----------------------------------
|
||||||
|
|
||||||
|
PCI Express error signaling can occur on the PCI Express link itself
|
||||||
|
or on behalf of transactions initiated on the link. PCI Express
|
||||||
|
defines two error reporting paradigms: the baseline capability and
|
||||||
|
the Advanced Error Reporting capability. The baseline capability is
|
||||||
|
required of all PCI Express components providing a minimum defined
|
||||||
|
set of error reporting requirements. Advanced Error Reporting
|
||||||
|
capability is implemented with a PCI Express advanced error reporting
|
||||||
|
extended capability structure providing more robust error reporting.
|
||||||
|
|
||||||
|
The PCI Express AER driver provides the infrastructure to support PCI
|
||||||
|
Express Advanced Error Reporting capability. The PCI Express AER
|
||||||
|
driver provides three basic functions:
|
||||||
|
|
||||||
|
- Gathers the comprehensive error information if errors occurred.
|
||||||
|
- Reports error to the users.
|
||||||
|
- Performs error recovery actions.
|
||||||
|
|
||||||
|
AER driver only attaches root ports which support PCI-Express AER
|
||||||
|
capability.
|
||||||
|
|
||||||
|
|
||||||
|
User Guide
|
||||||
|
==========
|
||||||
|
|
||||||
|
Include the PCI Express AER Root Driver into the Linux Kernel
|
||||||
|
-------------------------------------------------------------
|
||||||
|
|
||||||
|
The PCI Express AER Root driver is a Root Port service driver attached
|
||||||
|
to the PCI Express Port Bus driver. If a user wants to use it, the driver
|
||||||
|
has to be compiled. Option CONFIG_PCIEAER supports this capability. It
|
||||||
|
depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and
|
||||||
|
CONFIG_PCIEAER = y.
|
||||||
|
|
||||||
|
Load PCI Express AER Root Driver
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
Some systems have AER support in firmware. Enabling Linux AER support at
|
||||||
|
the same time the firmware handles AER may result in unpredictable
|
||||||
|
behavior. Therefore, Linux does not handle AER events unless the firmware
|
||||||
|
grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0
|
||||||
|
Specification for details regarding _OSC usage.
|
||||||
|
|
||||||
|
AER error output
|
||||||
|
----------------
|
||||||
|
|
||||||
|
When a PCIe AER error is captured, an error message will be output to
|
||||||
|
console. If it's a correctable error, it is output as a warning.
|
||||||
|
Otherwise, it is printed as an error. So users could choose different
|
||||||
|
log level to filter out correctable error messages.
|
||||||
|
|
||||||
|
Below shows an example::
|
||||||
|
|
||||||
|
0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
|
||||||
|
0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000
|
||||||
|
0000:50:00.0: [20] Unsupported Request (First)
|
||||||
|
0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100
|
||||||
|
|
||||||
|
In the example, 'Requester ID' means the ID of the device who sends
|
||||||
|
the error message to root port. Pls. refer to pci express specs for
|
||||||
|
other fields.
|
||||||
|
|
||||||
|
AER Statistics / Counters
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
When PCIe AER errors are captured, the counters / statistics are also exposed
|
||||||
|
in the form of sysfs attributes which are documented at
|
||||||
|
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
|
||||||
|
|
||||||
|
Developer Guide
|
||||||
|
===============
|
||||||
|
|
||||||
|
To enable AER aware support requires a software driver to configure
|
||||||
|
the AER capability structure within its device and to provide callbacks.
|
||||||
|
|
||||||
|
To support AER better, developers need understand how AER does work
|
||||||
|
firstly.
|
||||||
|
|
||||||
|
PCI Express errors are classified into two types: correctable errors
|
||||||
|
and uncorrectable errors. This classification is based on the impacts
|
||||||
|
of those errors, which may result in degraded performance or function
|
||||||
|
failure.
|
||||||
|
|
||||||
|
Correctable errors pose no impacts on the functionality of the
|
||||||
|
interface. The PCI Express protocol can recover without any software
|
||||||
|
intervention or any loss of data. These errors are detected and
|
||||||
|
corrected by hardware. Unlike correctable errors, uncorrectable
|
||||||
|
errors impact functionality of the interface. Uncorrectable errors
|
||||||
|
can cause a particular transaction or a particular PCI Express link
|
||||||
|
to be unreliable. Depending on those error conditions, uncorrectable
|
||||||
|
errors are further classified into non-fatal errors and fatal errors.
|
||||||
|
Non-fatal errors cause the particular transaction to be unreliable,
|
||||||
|
but the PCI Express link itself is fully functional. Fatal errors, on
|
||||||
|
the other hand, cause the link to be unreliable.
|
||||||
|
|
||||||
|
When AER is enabled, a PCI Express device will automatically send an
|
||||||
|
error message to the PCIe root port above it when the device captures
|
||||||
|
an error. The Root Port, upon receiving an error reporting message,
|
||||||
|
internally processes and logs the error message in its PCI Express
|
||||||
|
capability structure. Error information being logged includes storing
|
||||||
|
the error reporting agent's requestor ID into the Error Source
|
||||||
|
Identification Registers and setting the error bits of the Root Error
|
||||||
|
Status Register accordingly. If AER error reporting is enabled in Root
|
||||||
|
Error Command Register, the Root Port generates an interrupt if an
|
||||||
|
error is detected.
|
||||||
|
|
||||||
|
Note that the errors as described above are related to the PCI Express
|
||||||
|
hierarchy and links. These errors do not include any device specific
|
||||||
|
errors because device specific errors will still get sent directly to
|
||||||
|
the device driver.
|
||||||
|
|
||||||
|
Configure the AER capability structure
|
||||||
|
--------------------------------------
|
||||||
|
|
||||||
|
AER aware drivers of PCI Express component need change the device
|
||||||
|
control registers to enable AER. They also could change AER registers,
|
||||||
|
including mask and severity registers. Helper function
|
||||||
|
pci_enable_pcie_error_reporting could be used to enable AER. See
|
||||||
|
section 3.3.
|
||||||
|
|
||||||
|
Provide callbacks
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
callback reset_link to reset pci express link
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
This callback is used to reset the pci express physical link when a
|
||||||
|
fatal error happens. The root port aer service driver provides a
|
||||||
|
default reset_link function, but different upstream ports might
|
||||||
|
have different specifications to reset pci express link, so all
|
||||||
|
upstream ports should provide their own reset_link functions.
|
||||||
|
|
||||||
|
In struct pcie_port_service_driver, a new pointer, reset_link, is
|
||||||
|
added.
|
||||||
|
::
|
||||||
|
|
||||||
|
pci_ers_result_t (*reset_link) (struct pci_dev *dev);
|
||||||
|
|
||||||
|
Section 3.2.2.2 provides more detailed info on when to call
|
||||||
|
reset_link.
|
||||||
|
|
||||||
|
PCI error-recovery callbacks
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The PCI Express AER Root driver uses error callbacks to coordinate
|
||||||
|
with downstream device drivers associated with a hierarchy in question
|
||||||
|
when performing error recovery actions.
|
||||||
|
|
||||||
|
Data struct pci_driver has a pointer, err_handler, to point to
|
||||||
|
pci_error_handlers who consists of a couple of callback function
|
||||||
|
pointers. AER driver follows the rules defined in
|
||||||
|
pci-error-recovery.txt except pci express specific parts (e.g.
|
||||||
|
reset_link). Pls. refer to pci-error-recovery.txt for detailed
|
||||||
|
definitions of the callbacks.
|
||||||
|
|
||||||
|
Below sections specify when to call the error callback functions.
|
||||||
|
|
||||||
|
Correctable errors
|
||||||
|
~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Correctable errors pose no impacts on the functionality of
|
||||||
|
the interface. The PCI Express protocol can recover without any
|
||||||
|
software intervention or any loss of data. These errors do not
|
||||||
|
require any recovery actions. The AER driver clears the device's
|
||||||
|
correctable error status register accordingly and logs these errors.
|
||||||
|
|
||||||
|
Non-correctable (non-fatal and fatal) errors
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
If an error message indicates a non-fatal error, performing link reset
|
||||||
|
at upstream is not required. The AER driver calls error_detected(dev,
|
||||||
|
pci_channel_io_normal) to all drivers associated within a hierarchy in
|
||||||
|
question. for example::
|
||||||
|
|
||||||
|
EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort
|
||||||
|
|
||||||
|
If Upstream port A captures an AER error, the hierarchy consists of
|
||||||
|
Downstream port B and EndPoint.
|
||||||
|
|
||||||
|
A driver may return PCI_ERS_RESULT_CAN_RECOVER,
|
||||||
|
PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
|
||||||
|
whether it can recover or the AER driver calls mmio_enabled as next.
|
||||||
|
|
||||||
|
If an error message indicates a fatal error, kernel will broadcast
|
||||||
|
error_detected(dev, pci_channel_io_frozen) to all drivers within
|
||||||
|
a hierarchy in question. Then, performing link reset at upstream is
|
||||||
|
necessary. As different kinds of devices might use different approaches
|
||||||
|
to reset link, AER port service driver is required to provide the
|
||||||
|
function to reset link. Firstly, kernel looks for if the upstream
|
||||||
|
component has an aer driver. If it has, kernel uses the reset_link
|
||||||
|
callback of the aer driver. If the upstream component has no aer driver
|
||||||
|
and the port is downstream port, we will perform a hot reset as the
|
||||||
|
default by setting the Secondary Bus Reset bit of the Bridge Control
|
||||||
|
register associated with the downstream port. As for upstream ports,
|
||||||
|
they should provide their own aer service drivers with reset_link
|
||||||
|
function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
|
||||||
|
reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
|
||||||
|
to mmio_enabled.
|
||||||
|
|
||||||
|
helper functions
|
||||||
|
----------------
|
||||||
|
::
|
||||||
|
|
||||||
|
int pci_enable_pcie_error_reporting(struct pci_dev *dev);
|
||||||
|
|
||||||
|
pci_enable_pcie_error_reporting enables the device to send error
|
||||||
|
messages to root port when an error is detected. Note that devices
|
||||||
|
don't enable the error reporting by default, so device drivers need
|
||||||
|
call this function to enable it.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
int pci_disable_pcie_error_reporting(struct pci_dev *dev);
|
||||||
|
|
||||||
|
pci_disable_pcie_error_reporting disables the device to send error
|
||||||
|
messages to root port when an error is detected.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);`
|
||||||
|
|
||||||
|
pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
|
||||||
|
error status register.
|
||||||
|
|
||||||
|
Frequent Asked Questions
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
Q:
|
||||||
|
What happens if a PCI Express device driver does not provide an
|
||||||
|
error recovery handler (pci_driver->err_handler is equal to NULL)?
|
||||||
|
|
||||||
|
A:
|
||||||
|
The devices attached with the driver won't be recovered. If the
|
||||||
|
error is fatal, kernel will print out warning messages. Please refer
|
||||||
|
to section 3 for more information.
|
||||||
|
|
||||||
|
Q:
|
||||||
|
What happens if an upstream port service driver does not provide
|
||||||
|
callback reset_link?
|
||||||
|
|
||||||
|
A:
|
||||||
|
Fatal error recovery will fail if the errors are reported by the
|
||||||
|
upstream ports who are attached by the service driver.
|
||||||
|
|
||||||
|
Q:
|
||||||
|
How does this infrastructure deal with driver that is not PCI
|
||||||
|
Express aware?
|
||||||
|
|
||||||
|
A:
|
||||||
|
This infrastructure calls the error callback functions of the
|
||||||
|
driver when an error happens. But if the driver is not aware of
|
||||||
|
PCI Express, the device might not report its own errors to root
|
||||||
|
port.
|
||||||
|
|
||||||
|
Q:
|
||||||
|
What modifications will that driver need to make it compatible
|
||||||
|
with the PCI Express AER Root driver?
|
||||||
|
|
||||||
|
A:
|
||||||
|
It could call the helper functions to enable AER in devices and
|
||||||
|
cleanup uncorrectable status register. Pls. refer to section 3.3.
|
||||||
|
|
||||||
|
|
||||||
|
Software error injection
|
||||||
|
========================
|
||||||
|
|
||||||
|
Debugging PCIe AER error recovery code is quite difficult because it
|
||||||
|
is hard to trigger real hardware errors. Software based error
|
||||||
|
injection can be used to fake various kinds of PCIe errors.
|
||||||
|
|
||||||
|
First you should enable PCIe AER software error injection in kernel
|
||||||
|
configuration, that is, following item should be in your .config.
|
||||||
|
|
||||||
|
CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
|
||||||
|
|
||||||
|
After reboot with new kernel or insert the module, a device file named
|
||||||
|
/dev/aer_inject should be created.
|
||||||
|
|
||||||
|
Then, you need a user space tool named aer-inject, which can be gotten
|
||||||
|
from:
|
||||||
|
|
||||||
|
https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
|
||||||
|
|
||||||
|
More information about aer-inject can be found in the document comes
|
||||||
|
with its source code.
|
@ -1,267 +0,0 @@
|
|||||||
The PCI Express Advanced Error Reporting Driver Guide HOWTO
|
|
||||||
T. Long Nguyen <tom.l.nguyen@intel.com>
|
|
||||||
Yanmin Zhang <yanmin.zhang@intel.com>
|
|
||||||
07/29/2006
|
|
||||||
|
|
||||||
|
|
||||||
1. Overview
|
|
||||||
|
|
||||||
1.1 About this guide
|
|
||||||
|
|
||||||
This guide describes the basics of the PCI Express Advanced Error
|
|
||||||
Reporting (AER) driver and provides information on how to use it, as
|
|
||||||
well as how to enable the drivers of endpoint devices to conform with
|
|
||||||
PCI Express AER driver.
|
|
||||||
|
|
||||||
1.2 Copyright (C) Intel Corporation 2006.
|
|
||||||
|
|
||||||
1.3 What is the PCI Express AER Driver?
|
|
||||||
|
|
||||||
PCI Express error signaling can occur on the PCI Express link itself
|
|
||||||
or on behalf of transactions initiated on the link. PCI Express
|
|
||||||
defines two error reporting paradigms: the baseline capability and
|
|
||||||
the Advanced Error Reporting capability. The baseline capability is
|
|
||||||
required of all PCI Express components providing a minimum defined
|
|
||||||
set of error reporting requirements. Advanced Error Reporting
|
|
||||||
capability is implemented with a PCI Express advanced error reporting
|
|
||||||
extended capability structure providing more robust error reporting.
|
|
||||||
|
|
||||||
The PCI Express AER driver provides the infrastructure to support PCI
|
|
||||||
Express Advanced Error Reporting capability. The PCI Express AER
|
|
||||||
driver provides three basic functions:
|
|
||||||
|
|
||||||
- Gathers the comprehensive error information if errors occurred.
|
|
||||||
- Reports error to the users.
|
|
||||||
- Performs error recovery actions.
|
|
||||||
|
|
||||||
AER driver only attaches root ports which support PCI-Express AER
|
|
||||||
capability.
|
|
||||||
|
|
||||||
|
|
||||||
2. User Guide
|
|
||||||
|
|
||||||
2.1 Include the PCI Express AER Root Driver into the Linux Kernel
|
|
||||||
|
|
||||||
The PCI Express AER Root driver is a Root Port service driver attached
|
|
||||||
to the PCI Express Port Bus driver. If a user wants to use it, the driver
|
|
||||||
has to be compiled. Option CONFIG_PCIEAER supports this capability. It
|
|
||||||
depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and
|
|
||||||
CONFIG_PCIEAER = y.
|
|
||||||
|
|
||||||
2.2 Load PCI Express AER Root Driver
|
|
||||||
|
|
||||||
Some systems have AER support in firmware. Enabling Linux AER support at
|
|
||||||
the same time the firmware handles AER may result in unpredictable
|
|
||||||
behavior. Therefore, Linux does not handle AER events unless the firmware
|
|
||||||
grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0
|
|
||||||
Specification for details regarding _OSC usage.
|
|
||||||
|
|
||||||
2.3 AER error output
|
|
||||||
|
|
||||||
When a PCIe AER error is captured, an error message will be output to
|
|
||||||
console. If it's a correctable error, it is output as a warning.
|
|
||||||
Otherwise, it is printed as an error. So users could choose different
|
|
||||||
log level to filter out correctable error messages.
|
|
||||||
|
|
||||||
Below shows an example:
|
|
||||||
0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
|
|
||||||
0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000
|
|
||||||
0000:50:00.0: [20] Unsupported Request (First)
|
|
||||||
0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100
|
|
||||||
|
|
||||||
In the example, 'Requester ID' means the ID of the device who sends
|
|
||||||
the error message to root port. Pls. refer to pci express specs for
|
|
||||||
other fields.
|
|
||||||
|
|
||||||
2.4 AER Statistics / Counters
|
|
||||||
|
|
||||||
When PCIe AER errors are captured, the counters / statistics are also exposed
|
|
||||||
in the form of sysfs attributes which are documented at
|
|
||||||
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
|
|
||||||
|
|
||||||
3. Developer Guide
|
|
||||||
|
|
||||||
To enable AER aware support requires a software driver to configure
|
|
||||||
the AER capability structure within its device and to provide callbacks.
|
|
||||||
|
|
||||||
To support AER better, developers need understand how AER does work
|
|
||||||
firstly.
|
|
||||||
|
|
||||||
PCI Express errors are classified into two types: correctable errors
|
|
||||||
and uncorrectable errors. This classification is based on the impacts
|
|
||||||
of those errors, which may result in degraded performance or function
|
|
||||||
failure.
|
|
||||||
|
|
||||||
Correctable errors pose no impacts on the functionality of the
|
|
||||||
interface. The PCI Express protocol can recover without any software
|
|
||||||
intervention or any loss of data. These errors are detected and
|
|
||||||
corrected by hardware. Unlike correctable errors, uncorrectable
|
|
||||||
errors impact functionality of the interface. Uncorrectable errors
|
|
||||||
can cause a particular transaction or a particular PCI Express link
|
|
||||||
to be unreliable. Depending on those error conditions, uncorrectable
|
|
||||||
errors are further classified into non-fatal errors and fatal errors.
|
|
||||||
Non-fatal errors cause the particular transaction to be unreliable,
|
|
||||||
but the PCI Express link itself is fully functional. Fatal errors, on
|
|
||||||
the other hand, cause the link to be unreliable.
|
|
||||||
|
|
||||||
When AER is enabled, a PCI Express device will automatically send an
|
|
||||||
error message to the PCIe root port above it when the device captures
|
|
||||||
an error. The Root Port, upon receiving an error reporting message,
|
|
||||||
internally processes and logs the error message in its PCI Express
|
|
||||||
capability structure. Error information being logged includes storing
|
|
||||||
the error reporting agent's requestor ID into the Error Source
|
|
||||||
Identification Registers and setting the error bits of the Root Error
|
|
||||||
Status Register accordingly. If AER error reporting is enabled in Root
|
|
||||||
Error Command Register, the Root Port generates an interrupt if an
|
|
||||||
error is detected.
|
|
||||||
|
|
||||||
Note that the errors as described above are related to the PCI Express
|
|
||||||
hierarchy and links. These errors do not include any device specific
|
|
||||||
errors because device specific errors will still get sent directly to
|
|
||||||
the device driver.
|
|
||||||
|
|
||||||
3.1 Configure the AER capability structure
|
|
||||||
|
|
||||||
AER aware drivers of PCI Express component need change the device
|
|
||||||
control registers to enable AER. They also could change AER registers,
|
|
||||||
including mask and severity registers. Helper function
|
|
||||||
pci_enable_pcie_error_reporting could be used to enable AER. See
|
|
||||||
section 3.3.
|
|
||||||
|
|
||||||
3.2. Provide callbacks
|
|
||||||
|
|
||||||
3.2.1 callback reset_link to reset pci express link
|
|
||||||
|
|
||||||
This callback is used to reset the pci express physical link when a
|
|
||||||
fatal error happens. The root port aer service driver provides a
|
|
||||||
default reset_link function, but different upstream ports might
|
|
||||||
have different specifications to reset pci express link, so all
|
|
||||||
upstream ports should provide their own reset_link functions.
|
|
||||||
|
|
||||||
In struct pcie_port_service_driver, a new pointer, reset_link, is
|
|
||||||
added.
|
|
||||||
|
|
||||||
pci_ers_result_t (*reset_link) (struct pci_dev *dev);
|
|
||||||
|
|
||||||
Section 3.2.2.2 provides more detailed info on when to call
|
|
||||||
reset_link.
|
|
||||||
|
|
||||||
3.2.2 PCI error-recovery callbacks
|
|
||||||
|
|
||||||
The PCI Express AER Root driver uses error callbacks to coordinate
|
|
||||||
with downstream device drivers associated with a hierarchy in question
|
|
||||||
when performing error recovery actions.
|
|
||||||
|
|
||||||
Data struct pci_driver has a pointer, err_handler, to point to
|
|
||||||
pci_error_handlers who consists of a couple of callback function
|
|
||||||
pointers. AER driver follows the rules defined in
|
|
||||||
pci-error-recovery.txt except pci express specific parts (e.g.
|
|
||||||
reset_link). Pls. refer to pci-error-recovery.txt for detailed
|
|
||||||
definitions of the callbacks.
|
|
||||||
|
|
||||||
Below sections specify when to call the error callback functions.
|
|
||||||
|
|
||||||
3.2.2.1 Correctable errors
|
|
||||||
|
|
||||||
Correctable errors pose no impacts on the functionality of
|
|
||||||
the interface. The PCI Express protocol can recover without any
|
|
||||||
software intervention or any loss of data. These errors do not
|
|
||||||
require any recovery actions. The AER driver clears the device's
|
|
||||||
correctable error status register accordingly and logs these errors.
|
|
||||||
|
|
||||||
3.2.2.2 Non-correctable (non-fatal and fatal) errors
|
|
||||||
|
|
||||||
If an error message indicates a non-fatal error, performing link reset
|
|
||||||
at upstream is not required. The AER driver calls error_detected(dev,
|
|
||||||
pci_channel_io_normal) to all drivers associated within a hierarchy in
|
|
||||||
question. for example,
|
|
||||||
EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort.
|
|
||||||
If Upstream port A captures an AER error, the hierarchy consists of
|
|
||||||
Downstream port B and EndPoint.
|
|
||||||
|
|
||||||
A driver may return PCI_ERS_RESULT_CAN_RECOVER,
|
|
||||||
PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
|
|
||||||
whether it can recover or the AER driver calls mmio_enabled as next.
|
|
||||||
|
|
||||||
If an error message indicates a fatal error, kernel will broadcast
|
|
||||||
error_detected(dev, pci_channel_io_frozen) to all drivers within
|
|
||||||
a hierarchy in question. Then, performing link reset at upstream is
|
|
||||||
necessary. As different kinds of devices might use different approaches
|
|
||||||
to reset link, AER port service driver is required to provide the
|
|
||||||
function to reset link. Firstly, kernel looks for if the upstream
|
|
||||||
component has an aer driver. If it has, kernel uses the reset_link
|
|
||||||
callback of the aer driver. If the upstream component has no aer driver
|
|
||||||
and the port is downstream port, we will perform a hot reset as the
|
|
||||||
default by setting the Secondary Bus Reset bit of the Bridge Control
|
|
||||||
register associated with the downstream port. As for upstream ports,
|
|
||||||
they should provide their own aer service drivers with reset_link
|
|
||||||
function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
|
|
||||||
reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
|
|
||||||
to mmio_enabled.
|
|
||||||
|
|
||||||
3.3 helper functions
|
|
||||||
|
|
||||||
3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev);
|
|
||||||
pci_enable_pcie_error_reporting enables the device to send error
|
|
||||||
messages to root port when an error is detected. Note that devices
|
|
||||||
don't enable the error reporting by default, so device drivers need
|
|
||||||
call this function to enable it.
|
|
||||||
|
|
||||||
3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev);
|
|
||||||
pci_disable_pcie_error_reporting disables the device to send error
|
|
||||||
messages to root port when an error is detected.
|
|
||||||
|
|
||||||
3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
|
|
||||||
pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
|
|
||||||
error status register.
|
|
||||||
|
|
||||||
3.4 Frequent Asked Questions
|
|
||||||
|
|
||||||
Q: What happens if a PCI Express device driver does not provide an
|
|
||||||
error recovery handler (pci_driver->err_handler is equal to NULL)?
|
|
||||||
|
|
||||||
A: The devices attached with the driver won't be recovered. If the
|
|
||||||
error is fatal, kernel will print out warning messages. Please refer
|
|
||||||
to section 3 for more information.
|
|
||||||
|
|
||||||
Q: What happens if an upstream port service driver does not provide
|
|
||||||
callback reset_link?
|
|
||||||
|
|
||||||
A: Fatal error recovery will fail if the errors are reported by the
|
|
||||||
upstream ports who are attached by the service driver.
|
|
||||||
|
|
||||||
Q: How does this infrastructure deal with driver that is not PCI
|
|
||||||
Express aware?
|
|
||||||
|
|
||||||
A: This infrastructure calls the error callback functions of the
|
|
||||||
driver when an error happens. But if the driver is not aware of
|
|
||||||
PCI Express, the device might not report its own errors to root
|
|
||||||
port.
|
|
||||||
|
|
||||||
Q: What modifications will that driver need to make it compatible
|
|
||||||
with the PCI Express AER Root driver?
|
|
||||||
|
|
||||||
A: It could call the helper functions to enable AER in devices and
|
|
||||||
cleanup uncorrectable status register. Pls. refer to section 3.3.
|
|
||||||
|
|
||||||
|
|
||||||
4. Software error injection
|
|
||||||
|
|
||||||
Debugging PCIe AER error recovery code is quite difficult because it
|
|
||||||
is hard to trigger real hardware errors. Software based error
|
|
||||||
injection can be used to fake various kinds of PCIe errors.
|
|
||||||
|
|
||||||
First you should enable PCIe AER software error injection in kernel
|
|
||||||
configuration, that is, following item should be in your .config.
|
|
||||||
|
|
||||||
CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
|
|
||||||
|
|
||||||
After reboot with new kernel or insert the module, a device file named
|
|
||||||
/dev/aer_inject should be created.
|
|
||||||
|
|
||||||
Then, you need a user space tool named aer-inject, which can be gotten
|
|
||||||
from:
|
|
||||||
https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
|
|
||||||
|
|
||||||
More information about aer-inject can be found in the document comes
|
|
||||||
with its source code.
|
|
220
Documentation/PCI/picebus-howto.rst
Normal file
220
Documentation/PCI/picebus-howto.rst
Normal file
@ -0,0 +1,220 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
.. include:: <isonum.txt>
|
||||||
|
|
||||||
|
===========================================
|
||||||
|
The PCI Express Port Bus Driver Guide HOWTO
|
||||||
|
===========================================
|
||||||
|
|
||||||
|
:Author: Tom L Nguyen tom.l.nguyen@intel.com 11/03/2004
|
||||||
|
:Copyright: |copy| 2004 Intel Corporation
|
||||||
|
|
||||||
|
About this guide
|
||||||
|
================
|
||||||
|
|
||||||
|
This guide describes the basics of the PCI Express Port Bus driver
|
||||||
|
and provides information on how to enable the service drivers to
|
||||||
|
register/unregister with the PCI Express Port Bus Driver.
|
||||||
|
|
||||||
|
|
||||||
|
What is the PCI Express Port Bus Driver
|
||||||
|
=======================================
|
||||||
|
|
||||||
|
A PCI Express Port is a logical PCI-PCI Bridge structure. There
|
||||||
|
are two types of PCI Express Port: the Root Port and the Switch
|
||||||
|
Port. The Root Port originates a PCI Express link from a PCI Express
|
||||||
|
Root Complex and the Switch Port connects PCI Express links to
|
||||||
|
internal logical PCI buses. The Switch Port, which has its secondary
|
||||||
|
bus representing the switch's internal routing logic, is called the
|
||||||
|
switch's Upstream Port. The switch's Downstream Port is bridging from
|
||||||
|
switch's internal routing bus to a bus representing the downstream
|
||||||
|
PCI Express link from the PCI Express Switch.
|
||||||
|
|
||||||
|
A PCI Express Port can provide up to four distinct functions,
|
||||||
|
referred to in this document as services, depending on its port type.
|
||||||
|
PCI Express Port's services include native hotplug support (HP),
|
||||||
|
power management event support (PME), advanced error reporting
|
||||||
|
support (AER), and virtual channel support (VC). These services may
|
||||||
|
be handled by a single complex driver or be individually distributed
|
||||||
|
and handled by corresponding service drivers.
|
||||||
|
|
||||||
|
Why use the PCI Express Port Bus Driver?
|
||||||
|
========================================
|
||||||
|
|
||||||
|
In existing Linux kernels, the Linux Device Driver Model allows a
|
||||||
|
physical device to be handled by only a single driver. The PCI
|
||||||
|
Express Port is a PCI-PCI Bridge device with multiple distinct
|
||||||
|
services. To maintain a clean and simple solution each service
|
||||||
|
may have its own software service driver. In this case several
|
||||||
|
service drivers will compete for a single PCI-PCI Bridge device.
|
||||||
|
For example, if the PCI Express Root Port native hotplug service
|
||||||
|
driver is loaded first, it claims a PCI-PCI Bridge Root Port. The
|
||||||
|
kernel therefore does not load other service drivers for that Root
|
||||||
|
Port. In other words, it is impossible to have multiple service
|
||||||
|
drivers load and run on a PCI-PCI Bridge device simultaneously
|
||||||
|
using the current driver model.
|
||||||
|
|
||||||
|
To enable multiple service drivers running simultaneously requires
|
||||||
|
having a PCI Express Port Bus driver, which manages all populated
|
||||||
|
PCI Express Ports and distributes all provided service requests
|
||||||
|
to the corresponding service drivers as required. Some key
|
||||||
|
advantages of using the PCI Express Port Bus driver are listed below:
|
||||||
|
|
||||||
|
- Allow multiple service drivers to run simultaneously on
|
||||||
|
a PCI-PCI Bridge Port device.
|
||||||
|
|
||||||
|
- Allow service drivers implemented in an independent
|
||||||
|
staged approach.
|
||||||
|
|
||||||
|
- Allow one service driver to run on multiple PCI-PCI Bridge
|
||||||
|
Port devices.
|
||||||
|
|
||||||
|
- Manage and distribute resources of a PCI-PCI Bridge Port
|
||||||
|
device to requested service drivers.
|
||||||
|
|
||||||
|
Configuring the PCI Express Port Bus Driver vs. Service Drivers
|
||||||
|
===============================================================
|
||||||
|
|
||||||
|
Including the PCI Express Port Bus Driver Support into the Kernel
|
||||||
|
-----------------------------------------------------------------
|
||||||
|
|
||||||
|
Including the PCI Express Port Bus driver depends on whether the PCI
|
||||||
|
Express support is included in the kernel config. The kernel will
|
||||||
|
automatically include the PCI Express Port Bus driver as a kernel
|
||||||
|
driver when the PCI Express support is enabled in the kernel.
|
||||||
|
|
||||||
|
Enabling Service Driver Support
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
PCI device drivers are implemented based on Linux Device Driver Model.
|
||||||
|
All service drivers are PCI device drivers. As discussed above, it is
|
||||||
|
impossible to load any service driver once the kernel has loaded the
|
||||||
|
PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver
|
||||||
|
Model requires some minimal changes on existing service drivers that
|
||||||
|
imposes no impact on the functionality of existing service drivers.
|
||||||
|
|
||||||
|
A service driver is required to use the two APIs shown below to
|
||||||
|
register its service with the PCI Express Port Bus driver (see
|
||||||
|
section 5.2.1 & 5.2.2). It is important that a service driver
|
||||||
|
initializes the pcie_port_service_driver data structure, included in
|
||||||
|
header file /include/linux/pcieport_if.h, before calling these APIs.
|
||||||
|
Failure to do so will result an identity mismatch, which prevents
|
||||||
|
the PCI Express Port Bus driver from loading a service driver.
|
||||||
|
|
||||||
|
pcie_port_service_register
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
::
|
||||||
|
|
||||||
|
int pcie_port_service_register(struct pcie_port_service_driver *new)
|
||||||
|
|
||||||
|
This API replaces the Linux Driver Model's pci_register_driver API. A
|
||||||
|
service driver should always calls pcie_port_service_register at
|
||||||
|
module init. Note that after service driver being loaded, calls
|
||||||
|
such as pci_enable_device(dev) and pci_set_master(dev) are no longer
|
||||||
|
necessary since these calls are executed by the PCI Port Bus driver.
|
||||||
|
|
||||||
|
pcie_port_service_unregister
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
::
|
||||||
|
|
||||||
|
void pcie_port_service_unregister(struct pcie_port_service_driver *new)
|
||||||
|
|
||||||
|
pcie_port_service_unregister replaces the Linux Driver Model's
|
||||||
|
pci_unregister_driver. It's always called by service driver when a
|
||||||
|
module exits.
|
||||||
|
|
||||||
|
Sample Code
|
||||||
|
~~~~~~~~~~~
|
||||||
|
|
||||||
|
Below is sample service driver code to initialize the port service
|
||||||
|
driver data structure.
|
||||||
|
::
|
||||||
|
|
||||||
|
static struct pcie_port_service_id service_id[] = { {
|
||||||
|
.vendor = PCI_ANY_ID,
|
||||||
|
.device = PCI_ANY_ID,
|
||||||
|
.port_type = PCIE_RC_PORT,
|
||||||
|
.service_type = PCIE_PORT_SERVICE_AER,
|
||||||
|
}, { /* end: all zeroes */ }
|
||||||
|
};
|
||||||
|
|
||||||
|
static struct pcie_port_service_driver root_aerdrv = {
|
||||||
|
.name = (char *)device_name,
|
||||||
|
.id_table = &service_id[0],
|
||||||
|
|
||||||
|
.probe = aerdrv_load,
|
||||||
|
.remove = aerdrv_unload,
|
||||||
|
|
||||||
|
.suspend = aerdrv_suspend,
|
||||||
|
.resume = aerdrv_resume,
|
||||||
|
};
|
||||||
|
|
||||||
|
Below is a sample code for registering/unregistering a service
|
||||||
|
driver.
|
||||||
|
::
|
||||||
|
|
||||||
|
static int __init aerdrv_service_init(void)
|
||||||
|
{
|
||||||
|
int retval = 0;
|
||||||
|
|
||||||
|
retval = pcie_port_service_register(&root_aerdrv);
|
||||||
|
if (!retval) {
|
||||||
|
/*
|
||||||
|
* FIX ME
|
||||||
|
*/
|
||||||
|
}
|
||||||
|
return retval;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void __exit aerdrv_service_exit(void)
|
||||||
|
{
|
||||||
|
pcie_port_service_unregister(&root_aerdrv);
|
||||||
|
}
|
||||||
|
|
||||||
|
module_init(aerdrv_service_init);
|
||||||
|
module_exit(aerdrv_service_exit);
|
||||||
|
|
||||||
|
Possible Resource Conflicts
|
||||||
|
===========================
|
||||||
|
|
||||||
|
Since all service drivers of a PCI-PCI Bridge Port device are
|
||||||
|
allowed to run simultaneously, below lists a few of possible resource
|
||||||
|
conflicts with proposed solutions.
|
||||||
|
|
||||||
|
MSI and MSI-X Vector Resource
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
Once MSI or MSI-X interrupts are enabled on a device, it stays in this
|
||||||
|
mode until they are disabled again. Since service drivers of the same
|
||||||
|
PCI-PCI Bridge port share the same physical device, if an individual
|
||||||
|
service driver enables or disables MSI/MSI-X mode it may result
|
||||||
|
unpredictable behavior.
|
||||||
|
|
||||||
|
To avoid this situation all service drivers are not permitted to
|
||||||
|
switch interrupt mode on its device. The PCI Express Port Bus driver
|
||||||
|
is responsible for determining the interrupt mode and this should be
|
||||||
|
transparent to service drivers. Service drivers need to know only
|
||||||
|
the vector IRQ assigned to the field irq of struct pcie_device, which
|
||||||
|
is passed in when the PCI Express Port Bus driver probes each service
|
||||||
|
driver. Service drivers should use (struct pcie_device*)dev->irq to
|
||||||
|
call request_irq/free_irq. In addition, the interrupt mode is stored
|
||||||
|
in the field interrupt_mode of struct pcie_device.
|
||||||
|
|
||||||
|
PCI Memory/IO Mapped Regions
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
Service drivers for PCI Express Power Management (PME), Advanced
|
||||||
|
Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access
|
||||||
|
PCI configuration space on the PCI Express port. In all cases the
|
||||||
|
registers accessed are independent of each other. This patch assumes
|
||||||
|
that all service drivers will be well behaved and not overwrite
|
||||||
|
other service driver's configuration settings.
|
||||||
|
|
||||||
|
PCI Config Registers
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
Each service driver runs its PCI config operations on its own
|
||||||
|
capability structure except the PCI Express capability structure, in
|
||||||
|
which Root Control register and Device Control register are shared
|
||||||
|
between PME and AER. This patch assumes that all service drivers
|
||||||
|
will be well behaved and not overwrite other service driver's
|
||||||
|
configuration settings.
|
@ -1,178 +0,0 @@
|
|||||||
:orphan:
|
|
||||||
|
|
||||||
========================================================
|
|
||||||
OpenCAPI (Open Coherent Accelerator Processor Interface)
|
|
||||||
========================================================
|
|
||||||
|
|
||||||
OpenCAPI is an interface between processors and accelerators. It aims
|
|
||||||
at being low-latency and high-bandwidth. The specification is
|
|
||||||
developed by the `OpenCAPI Consortium <http://opencapi.org/>`_.
|
|
||||||
|
|
||||||
It allows an accelerator (which could be a FPGA, ASICs, ...) to access
|
|
||||||
the host memory coherently, using virtual addresses. An OpenCAPI
|
|
||||||
device can also host its own memory, that can be accessed from the
|
|
||||||
host.
|
|
||||||
|
|
||||||
OpenCAPI is known in linux as 'ocxl', as the open, processor-agnostic
|
|
||||||
evolution of 'cxl' (the driver for the IBM CAPI interface for
|
|
||||||
powerpc), which was named that way to avoid confusion with the ISDN
|
|
||||||
CAPI subsystem.
|
|
||||||
|
|
||||||
|
|
||||||
High-level view
|
|
||||||
===============
|
|
||||||
|
|
||||||
OpenCAPI defines a Data Link Layer (DL) and Transaction Layer (TL), to
|
|
||||||
be implemented on top of a physical link. Any processor or device
|
|
||||||
implementing the DL and TL can start sharing memory.
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
+-----------+ +-------------+
|
|
||||||
| | | |
|
|
||||||
| | | Accelerated |
|
|
||||||
| Processor | | Function |
|
|
||||||
| | +--------+ | Unit | +--------+
|
|
||||||
| |--| Memory | | (AFU) |--| Memory |
|
|
||||||
| | +--------+ | | +--------+
|
|
||||||
+-----------+ +-------------+
|
|
||||||
| |
|
|
||||||
+-----------+ +-------------+
|
|
||||||
| TL | | TLX |
|
|
||||||
+-----------+ +-------------+
|
|
||||||
| |
|
|
||||||
+-----------+ +-------------+
|
|
||||||
| DL | | DLX |
|
|
||||||
+-----------+ +-------------+
|
|
||||||
| |
|
|
||||||
| PHY |
|
|
||||||
+---------------------------------------+
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Device discovery
|
|
||||||
================
|
|
||||||
|
|
||||||
OpenCAPI relies on a PCI-like configuration space, implemented on the
|
|
||||||
device. So the host can discover AFUs by querying the config space.
|
|
||||||
|
|
||||||
OpenCAPI devices in Linux are treated like PCI devices (with a few
|
|
||||||
caveats). The firmware is expected to abstract the hardware as if it
|
|
||||||
was a PCI link. A lot of the existing PCI infrastructure is reused:
|
|
||||||
devices are scanned and BARs are assigned during the standard PCI
|
|
||||||
enumeration. Commands like 'lspci' can therefore be used to see what
|
|
||||||
devices are available.
|
|
||||||
|
|
||||||
The configuration space defines the AFU(s) that can be found on the
|
|
||||||
physical adapter, such as its name, how many memory contexts it can
|
|
||||||
work with, the size of its MMIO areas, ...
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
MMIO
|
|
||||||
====
|
|
||||||
|
|
||||||
OpenCAPI defines two MMIO areas for each AFU:
|
|
||||||
|
|
||||||
* the global MMIO area, with registers pertinent to the whole AFU.
|
|
||||||
* a per-process MMIO area, which has a fixed size for each context.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
AFU interrupts
|
|
||||||
==============
|
|
||||||
|
|
||||||
OpenCAPI includes the possibility for an AFU to send an interrupt to a
|
|
||||||
host process. It is done through a 'intrp_req' defined in the
|
|
||||||
Transaction Layer, specifying a 64-bit object handle which defines the
|
|
||||||
interrupt.
|
|
||||||
|
|
||||||
The driver allows a process to allocate an interrupt and obtain its
|
|
||||||
64-bit object handle, that can be passed to the AFU.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
char devices
|
|
||||||
============
|
|
||||||
|
|
||||||
The driver creates one char device per AFU found on the physical
|
|
||||||
device. A physical device may have multiple functions and each
|
|
||||||
function can have multiple AFUs. At the time of this writing though,
|
|
||||||
it has only been tested with devices exporting only one AFU.
|
|
||||||
|
|
||||||
Char devices can be found in /dev/ocxl/ and are named as:
|
|
||||||
/dev/ocxl/<AFU name>.<location>.<index>
|
|
||||||
|
|
||||||
where <AFU name> is a max 20-character long name, as found in the
|
|
||||||
config space of the AFU.
|
|
||||||
<location> is added by the driver and can help distinguish devices
|
|
||||||
when a system has more than one instance of the same OpenCAPI device.
|
|
||||||
<index> is also to help distinguish AFUs in the unlikely case where a
|
|
||||||
device carries multiple copies of the same AFU.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Sysfs class
|
|
||||||
===========
|
|
||||||
|
|
||||||
An ocxl class is added for the devices representing the AFUs. See
|
|
||||||
/sys/class/ocxl. The layout is described in
|
|
||||||
Documentation/ABI/testing/sysfs-class-ocxl
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
User API
|
|
||||||
========
|
|
||||||
|
|
||||||
open
|
|
||||||
----
|
|
||||||
|
|
||||||
Based on the AFU definition found in the config space, an AFU may
|
|
||||||
support working with more than one memory context, in which case the
|
|
||||||
associated char device may be opened multiple times by different
|
|
||||||
processes.
|
|
||||||
|
|
||||||
|
|
||||||
ioctl
|
|
||||||
-----
|
|
||||||
|
|
||||||
OCXL_IOCTL_ATTACH:
|
|
||||||
|
|
||||||
Attach the memory context of the calling process to the AFU so that
|
|
||||||
the AFU can access its memory.
|
|
||||||
|
|
||||||
OCXL_IOCTL_IRQ_ALLOC:
|
|
||||||
|
|
||||||
Allocate an AFU interrupt and return an identifier.
|
|
||||||
|
|
||||||
OCXL_IOCTL_IRQ_FREE:
|
|
||||||
|
|
||||||
Free a previously allocated AFU interrupt.
|
|
||||||
|
|
||||||
OCXL_IOCTL_IRQ_SET_FD:
|
|
||||||
|
|
||||||
Associate an event fd to an AFU interrupt so that the user process
|
|
||||||
can be notified when the AFU sends an interrupt.
|
|
||||||
|
|
||||||
OCXL_IOCTL_GET_METADATA:
|
|
||||||
|
|
||||||
Obtains configuration information from the card, such at the size of
|
|
||||||
MMIO areas, the AFU version, and the PASID for the current context.
|
|
||||||
|
|
||||||
OCXL_IOCTL_ENABLE_P9_WAIT:
|
|
||||||
|
|
||||||
Allows the AFU to wake a userspace thread executing 'wait'. Returns
|
|
||||||
information to userspace to allow it to configure the AFU. Note that
|
|
||||||
this is only available on POWER9.
|
|
||||||
|
|
||||||
OCXL_IOCTL_GET_FEATURES:
|
|
||||||
|
|
||||||
Reports on which CPU features that affect OpenCAPI are usable from
|
|
||||||
userspace.
|
|
||||||
|
|
||||||
|
|
||||||
mmap
|
|
||||||
----
|
|
||||||
|
|
||||||
A process can mmap the per-process MMIO area for interactions with the
|
|
||||||
AFU.
|
|
31
Documentation/accounting/cgroupstats.rst
Normal file
31
Documentation/accounting/cgroupstats.rst
Normal file
@ -0,0 +1,31 @@
|
|||||||
|
==================
|
||||||
|
Control Groupstats
|
||||||
|
==================
|
||||||
|
|
||||||
|
Control Groupstats is inspired by the discussion at
|
||||||
|
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as
|
||||||
|
suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.
|
||||||
|
|
||||||
|
Per cgroup statistics infrastructure re-uses code from the taskstats
|
||||||
|
interface. A new set of cgroup operations are registered with commands
|
||||||
|
and attributes specific to cgroups. It should be very easy to
|
||||||
|
extend per cgroup statistics, by adding members to the cgroupstats
|
||||||
|
structure.
|
||||||
|
|
||||||
|
The current model for cgroupstats is a pull, a push model (to post
|
||||||
|
statistics on interesting events), should be very easy to add. Currently
|
||||||
|
user space requests for statistics by passing the cgroup path.
|
||||||
|
Statistics about the state of all the tasks in the cgroup is returned to
|
||||||
|
user space.
|
||||||
|
|
||||||
|
NOTE: We currently rely on delay accounting for extracting information
|
||||||
|
about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this
|
||||||
|
information will not be available.
|
||||||
|
|
||||||
|
To extract cgroup statistics a utility very similar to getdelays.c
|
||||||
|
has been developed, the sample output of the utility is shown below::
|
||||||
|
|
||||||
|
~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup/a"
|
||||||
|
sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0
|
||||||
|
~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup"
|
||||||
|
sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2
|
@ -1,27 +0,0 @@
|
|||||||
Control Groupstats is inspired by the discussion at
|
|
||||||
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as
|
|
||||||
suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.
|
|
||||||
|
|
||||||
Per cgroup statistics infrastructure re-uses code from the taskstats
|
|
||||||
interface. A new set of cgroup operations are registered with commands
|
|
||||||
and attributes specific to cgroups. It should be very easy to
|
|
||||||
extend per cgroup statistics, by adding members to the cgroupstats
|
|
||||||
structure.
|
|
||||||
|
|
||||||
The current model for cgroupstats is a pull, a push model (to post
|
|
||||||
statistics on interesting events), should be very easy to add. Currently
|
|
||||||
user space requests for statistics by passing the cgroup path.
|
|
||||||
Statistics about the state of all the tasks in the cgroup is returned to
|
|
||||||
user space.
|
|
||||||
|
|
||||||
NOTE: We currently rely on delay accounting for extracting information
|
|
||||||
about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this
|
|
||||||
information will not be available.
|
|
||||||
|
|
||||||
To extract cgroup statistics a utility very similar to getdelays.c
|
|
||||||
has been developed, the sample output of the utility is shown below
|
|
||||||
|
|
||||||
~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup/a"
|
|
||||||
sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0
|
|
||||||
~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup"
|
|
||||||
sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2
|
|
126
Documentation/accounting/delay-accounting.rst
Normal file
126
Documentation/accounting/delay-accounting.rst
Normal file
@ -0,0 +1,126 @@
|
|||||||
|
================
|
||||||
|
Delay accounting
|
||||||
|
================
|
||||||
|
|
||||||
|
Tasks encounter delays in execution when they wait
|
||||||
|
for some kernel resource to become available e.g. a
|
||||||
|
runnable task may wait for a free CPU to run on.
|
||||||
|
|
||||||
|
The per-task delay accounting functionality measures
|
||||||
|
the delays experienced by a task while
|
||||||
|
|
||||||
|
a) waiting for a CPU (while being runnable)
|
||||||
|
b) completion of synchronous block I/O initiated by the task
|
||||||
|
c) swapping in pages
|
||||||
|
d) memory reclaim
|
||||||
|
|
||||||
|
and makes these statistics available to userspace through
|
||||||
|
the taskstats interface.
|
||||||
|
|
||||||
|
Such delays provide feedback for setting a task's cpu priority,
|
||||||
|
io priority and rss limit values appropriately. Long delays for
|
||||||
|
important tasks could be a trigger for raising its corresponding priority.
|
||||||
|
|
||||||
|
The functionality, through its use of the taskstats interface, also provides
|
||||||
|
delay statistics aggregated for all tasks (or threads) belonging to a
|
||||||
|
thread group (corresponding to a traditional Unix process). This is a commonly
|
||||||
|
needed aggregation that is more efficiently done by the kernel.
|
||||||
|
|
||||||
|
Userspace utilities, particularly resource management applications, can also
|
||||||
|
aggregate delay statistics into arbitrary groups. To enable this, delay
|
||||||
|
statistics of a task are available both during its lifetime as well as on its
|
||||||
|
exit, ensuring continuous and complete monitoring can be done.
|
||||||
|
|
||||||
|
|
||||||
|
Interface
|
||||||
|
---------
|
||||||
|
|
||||||
|
Delay accounting uses the taskstats interface which is described
|
||||||
|
in detail in a separate document in this directory. Taskstats returns a
|
||||||
|
generic data structure to userspace corresponding to per-pid and per-tgid
|
||||||
|
statistics. The delay accounting functionality populates specific fields of
|
||||||
|
this structure. See
|
||||||
|
|
||||||
|
include/linux/taskstats.h
|
||||||
|
|
||||||
|
for a description of the fields pertaining to delay accounting.
|
||||||
|
It will generally be in the form of counters returning the cumulative
|
||||||
|
delay seen for cpu, sync block I/O, swapin, memory reclaim etc.
|
||||||
|
|
||||||
|
Taking the difference of two successive readings of a given
|
||||||
|
counter (say cpu_delay_total) for a task will give the delay
|
||||||
|
experienced by the task waiting for the corresponding resource
|
||||||
|
in that interval.
|
||||||
|
|
||||||
|
When a task exits, records containing the per-task statistics
|
||||||
|
are sent to userspace without requiring a command. If it is the last exiting
|
||||||
|
task of a thread group, the per-tgid statistics are also sent. More details
|
||||||
|
are given in the taskstats interface description.
|
||||||
|
|
||||||
|
The getdelays.c userspace utility in tools/accounting directory allows simple
|
||||||
|
commands to be run and the corresponding delay statistics to be displayed. It
|
||||||
|
also serves as an example of using the taskstats interface.
|
||||||
|
|
||||||
|
Usage
|
||||||
|
-----
|
||||||
|
|
||||||
|
Compile the kernel with::
|
||||||
|
|
||||||
|
CONFIG_TASK_DELAY_ACCT=y
|
||||||
|
CONFIG_TASKSTATS=y
|
||||||
|
|
||||||
|
Delay accounting is enabled by default at boot up.
|
||||||
|
To disable, add::
|
||||||
|
|
||||||
|
nodelayacct
|
||||||
|
|
||||||
|
to the kernel boot options. The rest of the instructions
|
||||||
|
below assume this has not been done.
|
||||||
|
|
||||||
|
After the system has booted up, use a utility
|
||||||
|
similar to getdelays.c to access the delays
|
||||||
|
seen by a given task or a task group (tgid).
|
||||||
|
The utility also allows a given command to be
|
||||||
|
executed and the corresponding delays to be
|
||||||
|
seen.
|
||||||
|
|
||||||
|
General format of the getdelays command::
|
||||||
|
|
||||||
|
getdelays [-t tgid] [-p pid] [-c cmd...]
|
||||||
|
|
||||||
|
|
||||||
|
Get delays, since system boot, for pid 10::
|
||||||
|
|
||||||
|
# ./getdelays -p 10
|
||||||
|
(output similar to next case)
|
||||||
|
|
||||||
|
Get sum of delays, since system boot, for all pids with tgid 5::
|
||||||
|
|
||||||
|
# ./getdelays -t 5
|
||||||
|
|
||||||
|
|
||||||
|
CPU count real total virtual total delay total
|
||||||
|
7876 92005750 100000000 24001500
|
||||||
|
IO count delay total
|
||||||
|
0 0
|
||||||
|
SWAP count delay total
|
||||||
|
0 0
|
||||||
|
RECLAIM count delay total
|
||||||
|
0 0
|
||||||
|
|
||||||
|
Get delays seen in executing a given simple command::
|
||||||
|
|
||||||
|
# ./getdelays -c ls /
|
||||||
|
|
||||||
|
bin data1 data3 data5 dev home media opt root srv sys usr
|
||||||
|
boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var
|
||||||
|
|
||||||
|
|
||||||
|
CPU count real total virtual total delay total
|
||||||
|
6 4000250 4000000 0
|
||||||
|
IO count delay total
|
||||||
|
0 0
|
||||||
|
SWAP count delay total
|
||||||
|
0 0
|
||||||
|
RECLAIM count delay total
|
||||||
|
0 0
|
@ -1,117 +0,0 @@
|
|||||||
Delay accounting
|
|
||||||
----------------
|
|
||||||
|
|
||||||
Tasks encounter delays in execution when they wait
|
|
||||||
for some kernel resource to become available e.g. a
|
|
||||||
runnable task may wait for a free CPU to run on.
|
|
||||||
|
|
||||||
The per-task delay accounting functionality measures
|
|
||||||
the delays experienced by a task while
|
|
||||||
|
|
||||||
a) waiting for a CPU (while being runnable)
|
|
||||||
b) completion of synchronous block I/O initiated by the task
|
|
||||||
c) swapping in pages
|
|
||||||
d) memory reclaim
|
|
||||||
|
|
||||||
and makes these statistics available to userspace through
|
|
||||||
the taskstats interface.
|
|
||||||
|
|
||||||
Such delays provide feedback for setting a task's cpu priority,
|
|
||||||
io priority and rss limit values appropriately. Long delays for
|
|
||||||
important tasks could be a trigger for raising its corresponding priority.
|
|
||||||
|
|
||||||
The functionality, through its use of the taskstats interface, also provides
|
|
||||||
delay statistics aggregated for all tasks (or threads) belonging to a
|
|
||||||
thread group (corresponding to a traditional Unix process). This is a commonly
|
|
||||||
needed aggregation that is more efficiently done by the kernel.
|
|
||||||
|
|
||||||
Userspace utilities, particularly resource management applications, can also
|
|
||||||
aggregate delay statistics into arbitrary groups. To enable this, delay
|
|
||||||
statistics of a task are available both during its lifetime as well as on its
|
|
||||||
exit, ensuring continuous and complete monitoring can be done.
|
|
||||||
|
|
||||||
|
|
||||||
Interface
|
|
||||||
---------
|
|
||||||
|
|
||||||
Delay accounting uses the taskstats interface which is described
|
|
||||||
in detail in a separate document in this directory. Taskstats returns a
|
|
||||||
generic data structure to userspace corresponding to per-pid and per-tgid
|
|
||||||
statistics. The delay accounting functionality populates specific fields of
|
|
||||||
this structure. See
|
|
||||||
include/linux/taskstats.h
|
|
||||||
for a description of the fields pertaining to delay accounting.
|
|
||||||
It will generally be in the form of counters returning the cumulative
|
|
||||||
delay seen for cpu, sync block I/O, swapin, memory reclaim etc.
|
|
||||||
|
|
||||||
Taking the difference of two successive readings of a given
|
|
||||||
counter (say cpu_delay_total) for a task will give the delay
|
|
||||||
experienced by the task waiting for the corresponding resource
|
|
||||||
in that interval.
|
|
||||||
|
|
||||||
When a task exits, records containing the per-task statistics
|
|
||||||
are sent to userspace without requiring a command. If it is the last exiting
|
|
||||||
task of a thread group, the per-tgid statistics are also sent. More details
|
|
||||||
are given in the taskstats interface description.
|
|
||||||
|
|
||||||
The getdelays.c userspace utility in tools/accounting directory allows simple
|
|
||||||
commands to be run and the corresponding delay statistics to be displayed. It
|
|
||||||
also serves as an example of using the taskstats interface.
|
|
||||||
|
|
||||||
Usage
|
|
||||||
-----
|
|
||||||
|
|
||||||
Compile the kernel with
|
|
||||||
CONFIG_TASK_DELAY_ACCT=y
|
|
||||||
CONFIG_TASKSTATS=y
|
|
||||||
|
|
||||||
Delay accounting is enabled by default at boot up.
|
|
||||||
To disable, add
|
|
||||||
nodelayacct
|
|
||||||
to the kernel boot options. The rest of the instructions
|
|
||||||
below assume this has not been done.
|
|
||||||
|
|
||||||
After the system has booted up, use a utility
|
|
||||||
similar to getdelays.c to access the delays
|
|
||||||
seen by a given task or a task group (tgid).
|
|
||||||
The utility also allows a given command to be
|
|
||||||
executed and the corresponding delays to be
|
|
||||||
seen.
|
|
||||||
|
|
||||||
General format of the getdelays command
|
|
||||||
|
|
||||||
getdelays [-t tgid] [-p pid] [-c cmd...]
|
|
||||||
|
|
||||||
|
|
||||||
Get delays, since system boot, for pid 10
|
|
||||||
# ./getdelays -p 10
|
|
||||||
(output similar to next case)
|
|
||||||
|
|
||||||
Get sum of delays, since system boot, for all pids with tgid 5
|
|
||||||
# ./getdelays -t 5
|
|
||||||
|
|
||||||
|
|
||||||
CPU count real total virtual total delay total
|
|
||||||
7876 92005750 100000000 24001500
|
|
||||||
IO count delay total
|
|
||||||
0 0
|
|
||||||
SWAP count delay total
|
|
||||||
0 0
|
|
||||||
RECLAIM count delay total
|
|
||||||
0 0
|
|
||||||
|
|
||||||
Get delays seen in executing a given simple command
|
|
||||||
# ./getdelays -c ls /
|
|
||||||
|
|
||||||
bin data1 data3 data5 dev home media opt root srv sys usr
|
|
||||||
boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var
|
|
||||||
|
|
||||||
|
|
||||||
CPU count real total virtual total delay total
|
|
||||||
6 4000250 4000000 0
|
|
||||||
IO count delay total
|
|
||||||
0 0
|
|
||||||
SWAP count delay total
|
|
||||||
0 0
|
|
||||||
RECLAIM count delay total
|
|
||||||
0 0
|
|
14
Documentation/accounting/index.rst
Normal file
14
Documentation/accounting/index.rst
Normal file
@ -0,0 +1,14 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
==========
|
||||||
|
Accounting
|
||||||
|
==========
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
cgroupstats
|
||||||
|
delay-accounting
|
||||||
|
psi
|
||||||
|
taskstats
|
||||||
|
taskstats-struct
|
182
Documentation/accounting/psi.rst
Normal file
182
Documentation/accounting/psi.rst
Normal file
@ -0,0 +1,182 @@
|
|||||||
|
================================
|
||||||
|
PSI - Pressure Stall Information
|
||||||
|
================================
|
||||||
|
|
||||||
|
:Date: April, 2018
|
||||||
|
:Author: Johannes Weiner <hannes@cmpxchg.org>
|
||||||
|
|
||||||
|
When CPU, memory or IO devices are contended, workloads experience
|
||||||
|
latency spikes, throughput losses, and run the risk of OOM kills.
|
||||||
|
|
||||||
|
Without an accurate measure of such contention, users are forced to
|
||||||
|
either play it safe and under-utilize their hardware resources, or
|
||||||
|
roll the dice and frequently suffer the disruptions resulting from
|
||||||
|
excessive overcommit.
|
||||||
|
|
||||||
|
The psi feature identifies and quantifies the disruptions caused by
|
||||||
|
such resource crunches and the time impact it has on complex workloads
|
||||||
|
or even entire systems.
|
||||||
|
|
||||||
|
Having an accurate measure of productivity losses caused by resource
|
||||||
|
scarcity aids users in sizing workloads to hardware--or provisioning
|
||||||
|
hardware according to workload demand.
|
||||||
|
|
||||||
|
As psi aggregates this information in realtime, systems can be managed
|
||||||
|
dynamically using techniques such as load shedding, migrating jobs to
|
||||||
|
other systems or data centers, or strategically pausing or killing low
|
||||||
|
priority or restartable batch jobs.
|
||||||
|
|
||||||
|
This allows maximizing hardware utilization without sacrificing
|
||||||
|
workload health or risking major disruptions such as OOM kills.
|
||||||
|
|
||||||
|
Pressure interface
|
||||||
|
==================
|
||||||
|
|
||||||
|
Pressure information for each resource is exported through the
|
||||||
|
respective file in /proc/pressure/ -- cpu, memory, and io.
|
||||||
|
|
||||||
|
The format for CPU is as such::
|
||||||
|
|
||||||
|
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
|
||||||
|
|
||||||
|
and for memory and IO::
|
||||||
|
|
||||||
|
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
|
||||||
|
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
|
||||||
|
|
||||||
|
The "some" line indicates the share of time in which at least some
|
||||||
|
tasks are stalled on a given resource.
|
||||||
|
|
||||||
|
The "full" line indicates the share of time in which all non-idle
|
||||||
|
tasks are stalled on a given resource simultaneously. In this state
|
||||||
|
actual CPU cycles are going to waste, and a workload that spends
|
||||||
|
extended time in this state is considered to be thrashing. This has
|
||||||
|
severe impact on performance, and it's useful to distinguish this
|
||||||
|
situation from a state where some tasks are stalled but the CPU is
|
||||||
|
still doing productive work. As such, time spent in this subset of the
|
||||||
|
stall state is tracked separately and exported in the "full" averages.
|
||||||
|
|
||||||
|
The ratios (in %) are tracked as recent trends over ten, sixty, and
|
||||||
|
three hundred second windows, which gives insight into short term events
|
||||||
|
as well as medium and long term trends. The total absolute stall time
|
||||||
|
(in us) is tracked and exported as well, to allow detection of latency
|
||||||
|
spikes which wouldn't necessarily make a dent in the time averages,
|
||||||
|
or to average trends over custom time frames.
|
||||||
|
|
||||||
|
Monitoring for pressure thresholds
|
||||||
|
==================================
|
||||||
|
|
||||||
|
Users can register triggers and use poll() to be woken up when resource
|
||||||
|
pressure exceeds certain thresholds.
|
||||||
|
|
||||||
|
A trigger describes the maximum cumulative stall time over a specific
|
||||||
|
time window, e.g. 100ms of total stall time within any 500ms window to
|
||||||
|
generate a wakeup event.
|
||||||
|
|
||||||
|
To register a trigger user has to open psi interface file under
|
||||||
|
/proc/pressure/ representing the resource to be monitored and write the
|
||||||
|
desired threshold and time window. The open file descriptor should be
|
||||||
|
used to wait for trigger events using select(), poll() or epoll().
|
||||||
|
The following format is used::
|
||||||
|
|
||||||
|
<some|full> <stall amount in us> <time window in us>
|
||||||
|
|
||||||
|
For example writing "some 150000 1000000" into /proc/pressure/memory
|
||||||
|
would add 150ms threshold for partial memory stall measured within
|
||||||
|
1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
|
||||||
|
would add 50ms threshold for full io stall measured within 1sec time window.
|
||||||
|
|
||||||
|
Triggers can be set on more than one psi metric and more than one trigger
|
||||||
|
for the same psi metric can be specified. However for each trigger a separate
|
||||||
|
file descriptor is required to be able to poll it separately from others,
|
||||||
|
therefore for each trigger a separate open() syscall should be made even
|
||||||
|
when opening the same psi interface file.
|
||||||
|
|
||||||
|
Monitors activate only when system enters stall state for the monitored
|
||||||
|
psi metric and deactivates upon exit from the stall state. While system is
|
||||||
|
in the stall state psi signal growth is monitored at a rate of 10 times per
|
||||||
|
tracking window.
|
||||||
|
|
||||||
|
The kernel accepts window sizes ranging from 500ms to 10s, therefore min
|
||||||
|
monitoring update interval is 50ms and max is 1s. Min limit is set to
|
||||||
|
prevent overly frequent polling. Max limit is chosen as a high enough number
|
||||||
|
after which monitors are most likely not needed and psi averages can be used
|
||||||
|
instead.
|
||||||
|
|
||||||
|
When activated, psi monitor stays active for at least the duration of one
|
||||||
|
tracking window to avoid repeated activations/deactivations when system is
|
||||||
|
bouncing in and out of the stall state.
|
||||||
|
|
||||||
|
Notifications to the userspace are rate-limited to one per tracking window.
|
||||||
|
|
||||||
|
The trigger will de-register when the file descriptor used to define the
|
||||||
|
trigger is closed.
|
||||||
|
|
||||||
|
Userspace monitor usage example
|
||||||
|
===============================
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
#include <errno.h>
|
||||||
|
#include <fcntl.h>
|
||||||
|
#include <stdio.h>
|
||||||
|
#include <poll.h>
|
||||||
|
#include <string.h>
|
||||||
|
#include <unistd.h>
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Monitor memory partial stall with 1s tracking window size
|
||||||
|
* and 150ms threshold.
|
||||||
|
*/
|
||||||
|
int main() {
|
||||||
|
const char trig[] = "some 150000 1000000";
|
||||||
|
struct pollfd fds;
|
||||||
|
int n;
|
||||||
|
|
||||||
|
fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
|
||||||
|
if (fds.fd < 0) {
|
||||||
|
printf("/proc/pressure/memory open error: %s\n",
|
||||||
|
strerror(errno));
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
fds.events = POLLPRI;
|
||||||
|
|
||||||
|
if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
|
||||||
|
printf("/proc/pressure/memory write error: %s\n",
|
||||||
|
strerror(errno));
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
printf("waiting for events...\n");
|
||||||
|
while (1) {
|
||||||
|
n = poll(&fds, 1, -1);
|
||||||
|
if (n < 0) {
|
||||||
|
printf("poll error: %s\n", strerror(errno));
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
if (fds.revents & POLLERR) {
|
||||||
|
printf("got POLLERR, event source is gone\n");
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
if (fds.revents & POLLPRI) {
|
||||||
|
printf("event triggered!\n");
|
||||||
|
} else {
|
||||||
|
printf("unknown event received: 0x%x\n", fds.revents);
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
Cgroup2 interface
|
||||||
|
=================
|
||||||
|
|
||||||
|
In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
|
||||||
|
mounted, pressure stall information is also tracked for tasks grouped
|
||||||
|
into cgroups. Each subdirectory in the cgroupfs mountpoint contains
|
||||||
|
cpu.pressure, memory.pressure, and io.pressure files; the format is
|
||||||
|
the same as the /proc/pressure/ files.
|
||||||
|
|
||||||
|
Per-cgroup psi monitors can be specified and used the same way as
|
||||||
|
system-wide ones.
|
@ -1,180 +0,0 @@
|
|||||||
================================
|
|
||||||
PSI - Pressure Stall Information
|
|
||||||
================================
|
|
||||||
|
|
||||||
:Date: April, 2018
|
|
||||||
:Author: Johannes Weiner <hannes@cmpxchg.org>
|
|
||||||
|
|
||||||
When CPU, memory or IO devices are contended, workloads experience
|
|
||||||
latency spikes, throughput losses, and run the risk of OOM kills.
|
|
||||||
|
|
||||||
Without an accurate measure of such contention, users are forced to
|
|
||||||
either play it safe and under-utilize their hardware resources, or
|
|
||||||
roll the dice and frequently suffer the disruptions resulting from
|
|
||||||
excessive overcommit.
|
|
||||||
|
|
||||||
The psi feature identifies and quantifies the disruptions caused by
|
|
||||||
such resource crunches and the time impact it has on complex workloads
|
|
||||||
or even entire systems.
|
|
||||||
|
|
||||||
Having an accurate measure of productivity losses caused by resource
|
|
||||||
scarcity aids users in sizing workloads to hardware--or provisioning
|
|
||||||
hardware according to workload demand.
|
|
||||||
|
|
||||||
As psi aggregates this information in realtime, systems can be managed
|
|
||||||
dynamically using techniques such as load shedding, migrating jobs to
|
|
||||||
other systems or data centers, or strategically pausing or killing low
|
|
||||||
priority or restartable batch jobs.
|
|
||||||
|
|
||||||
This allows maximizing hardware utilization without sacrificing
|
|
||||||
workload health or risking major disruptions such as OOM kills.
|
|
||||||
|
|
||||||
Pressure interface
|
|
||||||
==================
|
|
||||||
|
|
||||||
Pressure information for each resource is exported through the
|
|
||||||
respective file in /proc/pressure/ -- cpu, memory, and io.
|
|
||||||
|
|
||||||
The format for CPU is as such:
|
|
||||||
|
|
||||||
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
|
|
||||||
|
|
||||||
and for memory and IO:
|
|
||||||
|
|
||||||
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
|
|
||||||
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
|
|
||||||
|
|
||||||
The "some" line indicates the share of time in which at least some
|
|
||||||
tasks are stalled on a given resource.
|
|
||||||
|
|
||||||
The "full" line indicates the share of time in which all non-idle
|
|
||||||
tasks are stalled on a given resource simultaneously. In this state
|
|
||||||
actual CPU cycles are going to waste, and a workload that spends
|
|
||||||
extended time in this state is considered to be thrashing. This has
|
|
||||||
severe impact on performance, and it's useful to distinguish this
|
|
||||||
situation from a state where some tasks are stalled but the CPU is
|
|
||||||
still doing productive work. As such, time spent in this subset of the
|
|
||||||
stall state is tracked separately and exported in the "full" averages.
|
|
||||||
|
|
||||||
The ratios (in %) are tracked as recent trends over ten, sixty, and
|
|
||||||
three hundred second windows, which gives insight into short term events
|
|
||||||
as well as medium and long term trends. The total absolute stall time
|
|
||||||
(in us) is tracked and exported as well, to allow detection of latency
|
|
||||||
spikes which wouldn't necessarily make a dent in the time averages,
|
|
||||||
or to average trends over custom time frames.
|
|
||||||
|
|
||||||
Monitoring for pressure thresholds
|
|
||||||
==================================
|
|
||||||
|
|
||||||
Users can register triggers and use poll() to be woken up when resource
|
|
||||||
pressure exceeds certain thresholds.
|
|
||||||
|
|
||||||
A trigger describes the maximum cumulative stall time over a specific
|
|
||||||
time window, e.g. 100ms of total stall time within any 500ms window to
|
|
||||||
generate a wakeup event.
|
|
||||||
|
|
||||||
To register a trigger user has to open psi interface file under
|
|
||||||
/proc/pressure/ representing the resource to be monitored and write the
|
|
||||||
desired threshold and time window. The open file descriptor should be
|
|
||||||
used to wait for trigger events using select(), poll() or epoll().
|
|
||||||
The following format is used:
|
|
||||||
|
|
||||||
<some|full> <stall amount in us> <time window in us>
|
|
||||||
|
|
||||||
For example writing "some 150000 1000000" into /proc/pressure/memory
|
|
||||||
would add 150ms threshold for partial memory stall measured within
|
|
||||||
1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
|
|
||||||
would add 50ms threshold for full io stall measured within 1sec time window.
|
|
||||||
|
|
||||||
Triggers can be set on more than one psi metric and more than one trigger
|
|
||||||
for the same psi metric can be specified. However for each trigger a separate
|
|
||||||
file descriptor is required to be able to poll it separately from others,
|
|
||||||
therefore for each trigger a separate open() syscall should be made even
|
|
||||||
when opening the same psi interface file.
|
|
||||||
|
|
||||||
Monitors activate only when system enters stall state for the monitored
|
|
||||||
psi metric and deactivates upon exit from the stall state. While system is
|
|
||||||
in the stall state psi signal growth is monitored at a rate of 10 times per
|
|
||||||
tracking window.
|
|
||||||
|
|
||||||
The kernel accepts window sizes ranging from 500ms to 10s, therefore min
|
|
||||||
monitoring update interval is 50ms and max is 1s. Min limit is set to
|
|
||||||
prevent overly frequent polling. Max limit is chosen as a high enough number
|
|
||||||
after which monitors are most likely not needed and psi averages can be used
|
|
||||||
instead.
|
|
||||||
|
|
||||||
When activated, psi monitor stays active for at least the duration of one
|
|
||||||
tracking window to avoid repeated activations/deactivations when system is
|
|
||||||
bouncing in and out of the stall state.
|
|
||||||
|
|
||||||
Notifications to the userspace are rate-limited to one per tracking window.
|
|
||||||
|
|
||||||
The trigger will de-register when the file descriptor used to define the
|
|
||||||
trigger is closed.
|
|
||||||
|
|
||||||
Userspace monitor usage example
|
|
||||||
===============================
|
|
||||||
|
|
||||||
#include <errno.h>
|
|
||||||
#include <fcntl.h>
|
|
||||||
#include <stdio.h>
|
|
||||||
#include <poll.h>
|
|
||||||
#include <string.h>
|
|
||||||
#include <unistd.h>
|
|
||||||
|
|
||||||
/*
|
|
||||||
* Monitor memory partial stall with 1s tracking window size
|
|
||||||
* and 150ms threshold.
|
|
||||||
*/
|
|
||||||
int main() {
|
|
||||||
const char trig[] = "some 150000 1000000";
|
|
||||||
struct pollfd fds;
|
|
||||||
int n;
|
|
||||||
|
|
||||||
fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
|
|
||||||
if (fds.fd < 0) {
|
|
||||||
printf("/proc/pressure/memory open error: %s\n",
|
|
||||||
strerror(errno));
|
|
||||||
return 1;
|
|
||||||
}
|
|
||||||
fds.events = POLLPRI;
|
|
||||||
|
|
||||||
if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
|
|
||||||
printf("/proc/pressure/memory write error: %s\n",
|
|
||||||
strerror(errno));
|
|
||||||
return 1;
|
|
||||||
}
|
|
||||||
|
|
||||||
printf("waiting for events...\n");
|
|
||||||
while (1) {
|
|
||||||
n = poll(&fds, 1, -1);
|
|
||||||
if (n < 0) {
|
|
||||||
printf("poll error: %s\n", strerror(errno));
|
|
||||||
return 1;
|
|
||||||
}
|
|
||||||
if (fds.revents & POLLERR) {
|
|
||||||
printf("got POLLERR, event source is gone\n");
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
if (fds.revents & POLLPRI) {
|
|
||||||
printf("event triggered!\n");
|
|
||||||
} else {
|
|
||||||
printf("unknown event received: 0x%x\n", fds.revents);
|
|
||||||
return 1;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
Cgroup2 interface
|
|
||||||
=================
|
|
||||||
|
|
||||||
In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
|
|
||||||
mounted, pressure stall information is also tracked for tasks grouped
|
|
||||||
into cgroups. Each subdirectory in the cgroupfs mountpoint contains
|
|
||||||
cpu.pressure, memory.pressure, and io.pressure files; the format is
|
|
||||||
the same as the /proc/pressure/ files.
|
|
||||||
|
|
||||||
Per-cgroup psi monitors can be specified and used the same way as
|
|
||||||
system-wide ones.
|
|
199
Documentation/accounting/taskstats-struct.rst
Normal file
199
Documentation/accounting/taskstats-struct.rst
Normal file
@ -0,0 +1,199 @@
|
|||||||
|
====================
|
||||||
|
The struct taskstats
|
||||||
|
====================
|
||||||
|
|
||||||
|
This document contains an explanation of the struct taskstats fields.
|
||||||
|
|
||||||
|
There are three different groups of fields in the struct taskstats:
|
||||||
|
|
||||||
|
1) Common and basic accounting fields
|
||||||
|
If CONFIG_TASKSTATS is set, the taskstats interface is enabled and
|
||||||
|
the common fields and basic accounting fields are collected for
|
||||||
|
delivery at do_exit() of a task.
|
||||||
|
2) Delay accounting fields
|
||||||
|
These fields are placed between::
|
||||||
|
|
||||||
|
/* Delay accounting fields start */
|
||||||
|
|
||||||
|
and::
|
||||||
|
|
||||||
|
/* Delay accounting fields end */
|
||||||
|
|
||||||
|
Their values are collected if CONFIG_TASK_DELAY_ACCT is set.
|
||||||
|
3) Extended accounting fields
|
||||||
|
These fields are placed between::
|
||||||
|
|
||||||
|
/* Extended accounting fields start */
|
||||||
|
|
||||||
|
and::
|
||||||
|
|
||||||
|
/* Extended accounting fields end */
|
||||||
|
|
||||||
|
Their values are collected if CONFIG_TASK_XACCT is set.
|
||||||
|
|
||||||
|
4) Per-task and per-thread context switch count statistics
|
||||||
|
|
||||||
|
5) Time accounting for SMT machines
|
||||||
|
|
||||||
|
6) Extended delay accounting fields for memory reclaim
|
||||||
|
|
||||||
|
Future extension should add fields to the end of the taskstats struct, and
|
||||||
|
should not change the relative position of each field within the struct.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
struct taskstats {
|
||||||
|
|
||||||
|
1) Common and basic accounting fields::
|
||||||
|
|
||||||
|
/* The version number of this struct. This field is always set to
|
||||||
|
* TAKSTATS_VERSION, which is defined in <linux/taskstats.h>.
|
||||||
|
* Each time the struct is changed, the value should be incremented.
|
||||||
|
*/
|
||||||
|
__u16 version;
|
||||||
|
|
||||||
|
/* The exit code of a task. */
|
||||||
|
__u32 ac_exitcode; /* Exit status */
|
||||||
|
|
||||||
|
/* The accounting flags of a task as defined in <linux/acct.h>
|
||||||
|
* Defined values are AFORK, ASU, ACOMPAT, ACORE, and AXSIG.
|
||||||
|
*/
|
||||||
|
__u8 ac_flag; /* Record flags */
|
||||||
|
|
||||||
|
/* The value of task_nice() of a task. */
|
||||||
|
__u8 ac_nice; /* task_nice */
|
||||||
|
|
||||||
|
/* The name of the command that started this task. */
|
||||||
|
char ac_comm[TS_COMM_LEN]; /* Command name */
|
||||||
|
|
||||||
|
/* The scheduling discipline as set in task->policy field. */
|
||||||
|
__u8 ac_sched; /* Scheduling discipline */
|
||||||
|
|
||||||
|
__u8 ac_pad[3];
|
||||||
|
__u32 ac_uid; /* User ID */
|
||||||
|
__u32 ac_gid; /* Group ID */
|
||||||
|
__u32 ac_pid; /* Process ID */
|
||||||
|
__u32 ac_ppid; /* Parent process ID */
|
||||||
|
|
||||||
|
/* The time when a task begins, in [secs] since 1970. */
|
||||||
|
__u32 ac_btime; /* Begin time [sec since 1970] */
|
||||||
|
|
||||||
|
/* The elapsed time of a task, in [usec]. */
|
||||||
|
__u64 ac_etime; /* Elapsed time [usec] */
|
||||||
|
|
||||||
|
/* The user CPU time of a task, in [usec]. */
|
||||||
|
__u64 ac_utime; /* User CPU time [usec] */
|
||||||
|
|
||||||
|
/* The system CPU time of a task, in [usec]. */
|
||||||
|
__u64 ac_stime; /* System CPU time [usec] */
|
||||||
|
|
||||||
|
/* The minor page fault count of a task, as set in task->min_flt. */
|
||||||
|
__u64 ac_minflt; /* Minor Page Fault Count */
|
||||||
|
|
||||||
|
/* The major page fault count of a task, as set in task->maj_flt. */
|
||||||
|
__u64 ac_majflt; /* Major Page Fault Count */
|
||||||
|
|
||||||
|
|
||||||
|
2) Delay accounting fields::
|
||||||
|
|
||||||
|
/* Delay accounting fields start
|
||||||
|
*
|
||||||
|
* All values, until the comment "Delay accounting fields end" are
|
||||||
|
* available only if delay accounting is enabled, even though the last
|
||||||
|
* few fields are not delays
|
||||||
|
*
|
||||||
|
* xxx_count is the number of delay values recorded
|
||||||
|
* xxx_delay_total is the corresponding cumulative delay in nanoseconds
|
||||||
|
*
|
||||||
|
* xxx_delay_total wraps around to zero on overflow
|
||||||
|
* xxx_count incremented regardless of overflow
|
||||||
|
*/
|
||||||
|
|
||||||
|
/* Delay waiting for cpu, while runnable
|
||||||
|
* count, delay_total NOT updated atomically
|
||||||
|
*/
|
||||||
|
__u64 cpu_count;
|
||||||
|
__u64 cpu_delay_total;
|
||||||
|
|
||||||
|
/* Following four fields atomically updated using task->delays->lock */
|
||||||
|
|
||||||
|
/* Delay waiting for synchronous block I/O to complete
|
||||||
|
* does not account for delays in I/O submission
|
||||||
|
*/
|
||||||
|
__u64 blkio_count;
|
||||||
|
__u64 blkio_delay_total;
|
||||||
|
|
||||||
|
/* Delay waiting for page fault I/O (swap in only) */
|
||||||
|
__u64 swapin_count;
|
||||||
|
__u64 swapin_delay_total;
|
||||||
|
|
||||||
|
/* cpu "wall-clock" running time
|
||||||
|
* On some architectures, value will adjust for cpu time stolen
|
||||||
|
* from the kernel in involuntary waits due to virtualization.
|
||||||
|
* Value is cumulative, in nanoseconds, without a corresponding count
|
||||||
|
* and wraps around to zero silently on overflow
|
||||||
|
*/
|
||||||
|
__u64 cpu_run_real_total;
|
||||||
|
|
||||||
|
/* cpu "virtual" running time
|
||||||
|
* Uses time intervals seen by the kernel i.e. no adjustment
|
||||||
|
* for kernel's involuntary waits due to virtualization.
|
||||||
|
* Value is cumulative, in nanoseconds, without a corresponding count
|
||||||
|
* and wraps around to zero silently on overflow
|
||||||
|
*/
|
||||||
|
__u64 cpu_run_virtual_total;
|
||||||
|
/* Delay accounting fields end */
|
||||||
|
/* version 1 ends here */
|
||||||
|
|
||||||
|
|
||||||
|
3) Extended accounting fields::
|
||||||
|
|
||||||
|
/* Extended accounting fields start */
|
||||||
|
|
||||||
|
/* Accumulated RSS usage in duration of a task, in MBytes-usecs.
|
||||||
|
* The current rss usage is added to this counter every time
|
||||||
|
* a tick is charged to a task's system time. So, at the end we
|
||||||
|
* will have memory usage multiplied by system time. Thus an
|
||||||
|
* average usage per system time unit can be calculated.
|
||||||
|
*/
|
||||||
|
__u64 coremem; /* accumulated RSS usage in MB-usec */
|
||||||
|
|
||||||
|
/* Accumulated virtual memory usage in duration of a task.
|
||||||
|
* Same as acct_rss_mem1 above except that we keep track of VM usage.
|
||||||
|
*/
|
||||||
|
__u64 virtmem; /* accumulated VM usage in MB-usec */
|
||||||
|
|
||||||
|
/* High watermark of RSS usage in duration of a task, in KBytes. */
|
||||||
|
__u64 hiwater_rss; /* High-watermark of RSS usage */
|
||||||
|
|
||||||
|
/* High watermark of VM usage in duration of a task, in KBytes. */
|
||||||
|
__u64 hiwater_vm; /* High-water virtual memory usage */
|
||||||
|
|
||||||
|
/* The following four fields are I/O statistics of a task. */
|
||||||
|
__u64 read_char; /* bytes read */
|
||||||
|
__u64 write_char; /* bytes written */
|
||||||
|
__u64 read_syscalls; /* read syscalls */
|
||||||
|
__u64 write_syscalls; /* write syscalls */
|
||||||
|
|
||||||
|
/* Extended accounting fields end */
|
||||||
|
|
||||||
|
4) Per-task and per-thread statistics::
|
||||||
|
|
||||||
|
__u64 nvcsw; /* Context voluntary switch counter */
|
||||||
|
__u64 nivcsw; /* Context involuntary switch counter */
|
||||||
|
|
||||||
|
5) Time accounting for SMT machines::
|
||||||
|
|
||||||
|
__u64 ac_utimescaled; /* utime scaled on frequency etc */
|
||||||
|
__u64 ac_stimescaled; /* stime scaled on frequency etc */
|
||||||
|
__u64 cpu_scaled_run_real_total; /* scaled cpu_run_real_total */
|
||||||
|
|
||||||
|
6) Extended delay accounting fields for memory reclaim::
|
||||||
|
|
||||||
|
/* Delay waiting for memory reclaim */
|
||||||
|
__u64 freepages_count;
|
||||||
|
__u64 freepages_delay_total;
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
}
|
@ -1,180 +0,0 @@
|
|||||||
The struct taskstats
|
|
||||||
--------------------
|
|
||||||
|
|
||||||
This document contains an explanation of the struct taskstats fields.
|
|
||||||
|
|
||||||
There are three different groups of fields in the struct taskstats:
|
|
||||||
|
|
||||||
1) Common and basic accounting fields
|
|
||||||
If CONFIG_TASKSTATS is set, the taskstats interface is enabled and
|
|
||||||
the common fields and basic accounting fields are collected for
|
|
||||||
delivery at do_exit() of a task.
|
|
||||||
2) Delay accounting fields
|
|
||||||
These fields are placed between
|
|
||||||
/* Delay accounting fields start */
|
|
||||||
and
|
|
||||||
/* Delay accounting fields end */
|
|
||||||
Their values are collected if CONFIG_TASK_DELAY_ACCT is set.
|
|
||||||
3) Extended accounting fields
|
|
||||||
These fields are placed between
|
|
||||||
/* Extended accounting fields start */
|
|
||||||
and
|
|
||||||
/* Extended accounting fields end */
|
|
||||||
Their values are collected if CONFIG_TASK_XACCT is set.
|
|
||||||
|
|
||||||
4) Per-task and per-thread context switch count statistics
|
|
||||||
|
|
||||||
5) Time accounting for SMT machines
|
|
||||||
|
|
||||||
6) Extended delay accounting fields for memory reclaim
|
|
||||||
|
|
||||||
Future extension should add fields to the end of the taskstats struct, and
|
|
||||||
should not change the relative position of each field within the struct.
|
|
||||||
|
|
||||||
|
|
||||||
struct taskstats {
|
|
||||||
|
|
||||||
1) Common and basic accounting fields:
|
|
||||||
/* The version number of this struct. This field is always set to
|
|
||||||
* TAKSTATS_VERSION, which is defined in <linux/taskstats.h>.
|
|
||||||
* Each time the struct is changed, the value should be incremented.
|
|
||||||
*/
|
|
||||||
__u16 version;
|
|
||||||
|
|
||||||
/* The exit code of a task. */
|
|
||||||
__u32 ac_exitcode; /* Exit status */
|
|
||||||
|
|
||||||
/* The accounting flags of a task as defined in <linux/acct.h>
|
|
||||||
* Defined values are AFORK, ASU, ACOMPAT, ACORE, and AXSIG.
|
|
||||||
*/
|
|
||||||
__u8 ac_flag; /* Record flags */
|
|
||||||
|
|
||||||
/* The value of task_nice() of a task. */
|
|
||||||
__u8 ac_nice; /* task_nice */
|
|
||||||
|
|
||||||
/* The name of the command that started this task. */
|
|
||||||
char ac_comm[TS_COMM_LEN]; /* Command name */
|
|
||||||
|
|
||||||
/* The scheduling discipline as set in task->policy field. */
|
|
||||||
__u8 ac_sched; /* Scheduling discipline */
|
|
||||||
|
|
||||||
__u8 ac_pad[3];
|
|
||||||
__u32 ac_uid; /* User ID */
|
|
||||||
__u32 ac_gid; /* Group ID */
|
|
||||||
__u32 ac_pid; /* Process ID */
|
|
||||||
__u32 ac_ppid; /* Parent process ID */
|
|
||||||
|
|
||||||
/* The time when a task begins, in [secs] since 1970. */
|
|
||||||
__u32 ac_btime; /* Begin time [sec since 1970] */
|
|
||||||
|
|
||||||
/* The elapsed time of a task, in [usec]. */
|
|
||||||
__u64 ac_etime; /* Elapsed time [usec] */
|
|
||||||
|
|
||||||
/* The user CPU time of a task, in [usec]. */
|
|
||||||
__u64 ac_utime; /* User CPU time [usec] */
|
|
||||||
|
|
||||||
/* The system CPU time of a task, in [usec]. */
|
|
||||||
__u64 ac_stime; /* System CPU time [usec] */
|
|
||||||
|
|
||||||
/* The minor page fault count of a task, as set in task->min_flt. */
|
|
||||||
__u64 ac_minflt; /* Minor Page Fault Count */
|
|
||||||
|
|
||||||
/* The major page fault count of a task, as set in task->maj_flt. */
|
|
||||||
__u64 ac_majflt; /* Major Page Fault Count */
|
|
||||||
|
|
||||||
|
|
||||||
2) Delay accounting fields:
|
|
||||||
/* Delay accounting fields start
|
|
||||||
*
|
|
||||||
* All values, until the comment "Delay accounting fields end" are
|
|
||||||
* available only if delay accounting is enabled, even though the last
|
|
||||||
* few fields are not delays
|
|
||||||
*
|
|
||||||
* xxx_count is the number of delay values recorded
|
|
||||||
* xxx_delay_total is the corresponding cumulative delay in nanoseconds
|
|
||||||
*
|
|
||||||
* xxx_delay_total wraps around to zero on overflow
|
|
||||||
* xxx_count incremented regardless of overflow
|
|
||||||
*/
|
|
||||||
|
|
||||||
/* Delay waiting for cpu, while runnable
|
|
||||||
* count, delay_total NOT updated atomically
|
|
||||||
*/
|
|
||||||
__u64 cpu_count;
|
|
||||||
__u64 cpu_delay_total;
|
|
||||||
|
|
||||||
/* Following four fields atomically updated using task->delays->lock */
|
|
||||||
|
|
||||||
/* Delay waiting for synchronous block I/O to complete
|
|
||||||
* does not account for delays in I/O submission
|
|
||||||
*/
|
|
||||||
__u64 blkio_count;
|
|
||||||
__u64 blkio_delay_total;
|
|
||||||
|
|
||||||
/* Delay waiting for page fault I/O (swap in only) */
|
|
||||||
__u64 swapin_count;
|
|
||||||
__u64 swapin_delay_total;
|
|
||||||
|
|
||||||
/* cpu "wall-clock" running time
|
|
||||||
* On some architectures, value will adjust for cpu time stolen
|
|
||||||
* from the kernel in involuntary waits due to virtualization.
|
|
||||||
* Value is cumulative, in nanoseconds, without a corresponding count
|
|
||||||
* and wraps around to zero silently on overflow
|
|
||||||
*/
|
|
||||||
__u64 cpu_run_real_total;
|
|
||||||
|
|
||||||
/* cpu "virtual" running time
|
|
||||||
* Uses time intervals seen by the kernel i.e. no adjustment
|
|
||||||
* for kernel's involuntary waits due to virtualization.
|
|
||||||
* Value is cumulative, in nanoseconds, without a corresponding count
|
|
||||||
* and wraps around to zero silently on overflow
|
|
||||||
*/
|
|
||||||
__u64 cpu_run_virtual_total;
|
|
||||||
/* Delay accounting fields end */
|
|
||||||
/* version 1 ends here */
|
|
||||||
|
|
||||||
|
|
||||||
3) Extended accounting fields
|
|
||||||
/* Extended accounting fields start */
|
|
||||||
|
|
||||||
/* Accumulated RSS usage in duration of a task, in MBytes-usecs.
|
|
||||||
* The current rss usage is added to this counter every time
|
|
||||||
* a tick is charged to a task's system time. So, at the end we
|
|
||||||
* will have memory usage multiplied by system time. Thus an
|
|
||||||
* average usage per system time unit can be calculated.
|
|
||||||
*/
|
|
||||||
__u64 coremem; /* accumulated RSS usage in MB-usec */
|
|
||||||
|
|
||||||
/* Accumulated virtual memory usage in duration of a task.
|
|
||||||
* Same as acct_rss_mem1 above except that we keep track of VM usage.
|
|
||||||
*/
|
|
||||||
__u64 virtmem; /* accumulated VM usage in MB-usec */
|
|
||||||
|
|
||||||
/* High watermark of RSS usage in duration of a task, in KBytes. */
|
|
||||||
__u64 hiwater_rss; /* High-watermark of RSS usage */
|
|
||||||
|
|
||||||
/* High watermark of VM usage in duration of a task, in KBytes. */
|
|
||||||
__u64 hiwater_vm; /* High-water virtual memory usage */
|
|
||||||
|
|
||||||
/* The following four fields are I/O statistics of a task. */
|
|
||||||
__u64 read_char; /* bytes read */
|
|
||||||
__u64 write_char; /* bytes written */
|
|
||||||
__u64 read_syscalls; /* read syscalls */
|
|
||||||
__u64 write_syscalls; /* write syscalls */
|
|
||||||
|
|
||||||
/* Extended accounting fields end */
|
|
||||||
|
|
||||||
4) Per-task and per-thread statistics
|
|
||||||
__u64 nvcsw; /* Context voluntary switch counter */
|
|
||||||
__u64 nivcsw; /* Context involuntary switch counter */
|
|
||||||
|
|
||||||
5) Time accounting for SMT machines
|
|
||||||
__u64 ac_utimescaled; /* utime scaled on frequency etc */
|
|
||||||
__u64 ac_stimescaled; /* stime scaled on frequency etc */
|
|
||||||
__u64 cpu_scaled_run_real_total; /* scaled cpu_run_real_total */
|
|
||||||
|
|
||||||
6) Extended delay accounting fields for memory reclaim
|
|
||||||
/* Delay waiting for memory reclaim */
|
|
||||||
__u64 freepages_count;
|
|
||||||
__u64 freepages_delay_total;
|
|
||||||
}
|
|
180
Documentation/accounting/taskstats.rst
Normal file
180
Documentation/accounting/taskstats.rst
Normal file
@ -0,0 +1,180 @@
|
|||||||
|
=============================
|
||||||
|
Per-task statistics interface
|
||||||
|
=============================
|
||||||
|
|
||||||
|
|
||||||
|
Taskstats is a netlink-based interface for sending per-task and
|
||||||
|
per-process statistics from the kernel to userspace.
|
||||||
|
|
||||||
|
Taskstats was designed for the following benefits:
|
||||||
|
|
||||||
|
- efficiently provide statistics during lifetime of a task and on its exit
|
||||||
|
- unified interface for multiple accounting subsystems
|
||||||
|
- extensibility for use by future accounting patches
|
||||||
|
|
||||||
|
Terminology
|
||||||
|
-----------
|
||||||
|
|
||||||
|
"pid", "tid" and "task" are used interchangeably and refer to the standard
|
||||||
|
Linux task defined by struct task_struct. per-pid stats are the same as
|
||||||
|
per-task stats.
|
||||||
|
|
||||||
|
"tgid", "process" and "thread group" are used interchangeably and refer to the
|
||||||
|
tasks that share an mm_struct i.e. the traditional Unix process. Despite the
|
||||||
|
use of tgid, there is no special treatment for the task that is thread group
|
||||||
|
leader - a process is deemed alive as long as it has any task belonging to it.
|
||||||
|
|
||||||
|
Usage
|
||||||
|
-----
|
||||||
|
|
||||||
|
To get statistics during a task's lifetime, userspace opens a unicast netlink
|
||||||
|
socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.
|
||||||
|
The response contains statistics for a task (if pid is specified) or the sum of
|
||||||
|
statistics for all tasks of the process (if tgid is specified).
|
||||||
|
|
||||||
|
To obtain statistics for tasks which are exiting, the userspace listener
|
||||||
|
sends a register command and specifies a cpumask. Whenever a task exits on
|
||||||
|
one of the cpus in the cpumask, its per-pid statistics are sent to the
|
||||||
|
registered listener. Using cpumasks allows the data received by one listener
|
||||||
|
to be limited and assists in flow control over the netlink interface and is
|
||||||
|
explained in more detail below.
|
||||||
|
|
||||||
|
If the exiting task is the last thread exiting its thread group,
|
||||||
|
an additional record containing the per-tgid stats is also sent to userspace.
|
||||||
|
The latter contains the sum of per-pid stats for all threads in the thread
|
||||||
|
group, both past and present.
|
||||||
|
|
||||||
|
getdelays.c is a simple utility demonstrating usage of the taskstats interface
|
||||||
|
for reporting delay accounting statistics. Users can register cpumasks,
|
||||||
|
send commands and process responses, listen for per-tid/tgid exit data,
|
||||||
|
write the data received to a file and do basic flow control by increasing
|
||||||
|
receive buffer sizes.
|
||||||
|
|
||||||
|
Interface
|
||||||
|
---------
|
||||||
|
|
||||||
|
The user-kernel interface is encapsulated in include/linux/taskstats.h
|
||||||
|
|
||||||
|
To avoid this documentation becoming obsolete as the interface evolves, only
|
||||||
|
an outline of the current version is given. taskstats.h always overrides the
|
||||||
|
description here.
|
||||||
|
|
||||||
|
struct taskstats is the common accounting structure for both per-pid and
|
||||||
|
per-tgid data. It is versioned and can be extended by each accounting subsystem
|
||||||
|
that is added to the kernel. The fields and their semantics are defined in the
|
||||||
|
taskstats.h file.
|
||||||
|
|
||||||
|
The data exchanged between user and kernel space is a netlink message belonging
|
||||||
|
to the NETLINK_GENERIC family and using the netlink attributes interface.
|
||||||
|
The messages are in the format::
|
||||||
|
|
||||||
|
+----------+- - -+-------------+-------------------+
|
||||||
|
| nlmsghdr | Pad | genlmsghdr | taskstats payload |
|
||||||
|
+----------+- - -+-------------+-------------------+
|
||||||
|
|
||||||
|
|
||||||
|
The taskstats payload is one of the following three kinds:
|
||||||
|
|
||||||
|
1. Commands: Sent from user to kernel. Commands to get data on
|
||||||
|
a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID,
|
||||||
|
containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes
|
||||||
|
the task/process for which userspace wants statistics.
|
||||||
|
|
||||||
|
Commands to register/deregister interest in exit data from a set of cpus
|
||||||
|
consist of one attribute, of type
|
||||||
|
TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the
|
||||||
|
attribute payload. The cpumask is specified as an ascii string of
|
||||||
|
comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8
|
||||||
|
the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest
|
||||||
|
in cpus before closing the listening socket, the kernel cleans up its interest
|
||||||
|
set over time. However, for the sake of efficiency, an explicit deregistration
|
||||||
|
is advisable.
|
||||||
|
|
||||||
|
2. Response for a command: sent from the kernel in response to a userspace
|
||||||
|
command. The payload is a series of three attributes of type:
|
||||||
|
|
||||||
|
a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates
|
||||||
|
a pid/tgid will be followed by some stats.
|
||||||
|
|
||||||
|
b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats
|
||||||
|
are being returned.
|
||||||
|
|
||||||
|
c) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The
|
||||||
|
same structure is used for both per-pid and per-tgid stats.
|
||||||
|
|
||||||
|
3. New message sent by kernel whenever a task exits. The payload consists of a
|
||||||
|
series of attributes of the following type:
|
||||||
|
|
||||||
|
a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats
|
||||||
|
b) TASKSTATS_TYPE_PID: contains exiting task's pid
|
||||||
|
c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats
|
||||||
|
d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats
|
||||||
|
e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs
|
||||||
|
f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process
|
||||||
|
|
||||||
|
|
||||||
|
per-tgid stats
|
||||||
|
--------------
|
||||||
|
|
||||||
|
Taskstats provides per-process stats, in addition to per-task stats, since
|
||||||
|
resource management is often done at a process granularity and aggregating task
|
||||||
|
stats in userspace alone is inefficient and potentially inaccurate (due to lack
|
||||||
|
of atomicity).
|
||||||
|
|
||||||
|
However, maintaining per-process, in addition to per-task stats, within the
|
||||||
|
kernel has space and time overheads. To address this, the taskstats code
|
||||||
|
accumulates each exiting task's statistics into a process-wide data structure.
|
||||||
|
When the last task of a process exits, the process level data accumulated also
|
||||||
|
gets sent to userspace (along with the per-task data).
|
||||||
|
|
||||||
|
When a user queries to get per-tgid data, the sum of all other live threads in
|
||||||
|
the group is added up and added to the accumulated total for previously exited
|
||||||
|
threads of the same thread group.
|
||||||
|
|
||||||
|
Extending taskstats
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
There are two ways to extend the taskstats interface to export more
|
||||||
|
per-task/process stats as patches to collect them get added to the kernel
|
||||||
|
in future:
|
||||||
|
|
||||||
|
1. Adding more fields to the end of the existing struct taskstats. Backward
|
||||||
|
compatibility is ensured by the version number within the
|
||||||
|
structure. Userspace will use only the fields of the struct that correspond
|
||||||
|
to the version its using.
|
||||||
|
|
||||||
|
2. Defining separate statistic structs and using the netlink attributes
|
||||||
|
interface to return them. Since userspace processes each netlink attribute
|
||||||
|
independently, it can always ignore attributes whose type it does not
|
||||||
|
understand (because it is using an older version of the interface).
|
||||||
|
|
||||||
|
|
||||||
|
Choosing between 1. and 2. is a matter of trading off flexibility and
|
||||||
|
overhead. If only a few fields need to be added, then 1. is the preferable
|
||||||
|
path since the kernel and userspace don't need to incur the overhead of
|
||||||
|
processing new netlink attributes. But if the new fields expand the existing
|
||||||
|
struct too much, requiring disparate userspace accounting utilities to
|
||||||
|
unnecessarily receive large structures whose fields are of no interest, then
|
||||||
|
extending the attributes structure would be worthwhile.
|
||||||
|
|
||||||
|
Flow control for taskstats
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
When the rate of task exits becomes large, a listener may not be able to keep
|
||||||
|
up with the kernel's rate of sending per-tid/tgid exit data leading to data
|
||||||
|
loss. This possibility gets compounded when the taskstats structure gets
|
||||||
|
extended and the number of cpus grows large.
|
||||||
|
|
||||||
|
To avoid losing statistics, userspace should do one or more of the following:
|
||||||
|
|
||||||
|
- increase the receive buffer sizes for the netlink sockets opened by
|
||||||
|
listeners to receive exit data.
|
||||||
|
|
||||||
|
- create more listeners and reduce the number of cpus being listened to by
|
||||||
|
each listener. In the extreme case, there could be one listener for each cpu.
|
||||||
|
Users may also consider setting the cpu affinity of the listener to the subset
|
||||||
|
of cpus to which it listens, especially if they are listening to just one cpu.
|
||||||
|
|
||||||
|
Despite these measures, if the userspace receives ENOBUFS error messages
|
||||||
|
indicated overflow of receive buffers, it should take measures to handle the
|
||||||
|
loss of data.
|
@ -1,181 +0,0 @@
|
|||||||
Per-task statistics interface
|
|
||||||
-----------------------------
|
|
||||||
|
|
||||||
|
|
||||||
Taskstats is a netlink-based interface for sending per-task and
|
|
||||||
per-process statistics from the kernel to userspace.
|
|
||||||
|
|
||||||
Taskstats was designed for the following benefits:
|
|
||||||
|
|
||||||
- efficiently provide statistics during lifetime of a task and on its exit
|
|
||||||
- unified interface for multiple accounting subsystems
|
|
||||||
- extensibility for use by future accounting patches
|
|
||||||
|
|
||||||
Terminology
|
|
||||||
-----------
|
|
||||||
|
|
||||||
"pid", "tid" and "task" are used interchangeably and refer to the standard
|
|
||||||
Linux task defined by struct task_struct. per-pid stats are the same as
|
|
||||||
per-task stats.
|
|
||||||
|
|
||||||
"tgid", "process" and "thread group" are used interchangeably and refer to the
|
|
||||||
tasks that share an mm_struct i.e. the traditional Unix process. Despite the
|
|
||||||
use of tgid, there is no special treatment for the task that is thread group
|
|
||||||
leader - a process is deemed alive as long as it has any task belonging to it.
|
|
||||||
|
|
||||||
Usage
|
|
||||||
-----
|
|
||||||
|
|
||||||
To get statistics during a task's lifetime, userspace opens a unicast netlink
|
|
||||||
socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.
|
|
||||||
The response contains statistics for a task (if pid is specified) or the sum of
|
|
||||||
statistics for all tasks of the process (if tgid is specified).
|
|
||||||
|
|
||||||
To obtain statistics for tasks which are exiting, the userspace listener
|
|
||||||
sends a register command and specifies a cpumask. Whenever a task exits on
|
|
||||||
one of the cpus in the cpumask, its per-pid statistics are sent to the
|
|
||||||
registered listener. Using cpumasks allows the data received by one listener
|
|
||||||
to be limited and assists in flow control over the netlink interface and is
|
|
||||||
explained in more detail below.
|
|
||||||
|
|
||||||
If the exiting task is the last thread exiting its thread group,
|
|
||||||
an additional record containing the per-tgid stats is also sent to userspace.
|
|
||||||
The latter contains the sum of per-pid stats for all threads in the thread
|
|
||||||
group, both past and present.
|
|
||||||
|
|
||||||
getdelays.c is a simple utility demonstrating usage of the taskstats interface
|
|
||||||
for reporting delay accounting statistics. Users can register cpumasks,
|
|
||||||
send commands and process responses, listen for per-tid/tgid exit data,
|
|
||||||
write the data received to a file and do basic flow control by increasing
|
|
||||||
receive buffer sizes.
|
|
||||||
|
|
||||||
Interface
|
|
||||||
---------
|
|
||||||
|
|
||||||
The user-kernel interface is encapsulated in include/linux/taskstats.h
|
|
||||||
|
|
||||||
To avoid this documentation becoming obsolete as the interface evolves, only
|
|
||||||
an outline of the current version is given. taskstats.h always overrides the
|
|
||||||
description here.
|
|
||||||
|
|
||||||
struct taskstats is the common accounting structure for both per-pid and
|
|
||||||
per-tgid data. It is versioned and can be extended by each accounting subsystem
|
|
||||||
that is added to the kernel. The fields and their semantics are defined in the
|
|
||||||
taskstats.h file.
|
|
||||||
|
|
||||||
The data exchanged between user and kernel space is a netlink message belonging
|
|
||||||
to the NETLINK_GENERIC family and using the netlink attributes interface.
|
|
||||||
The messages are in the format
|
|
||||||
|
|
||||||
+----------+- - -+-------------+-------------------+
|
|
||||||
| nlmsghdr | Pad | genlmsghdr | taskstats payload |
|
|
||||||
+----------+- - -+-------------+-------------------+
|
|
||||||
|
|
||||||
|
|
||||||
The taskstats payload is one of the following three kinds:
|
|
||||||
|
|
||||||
1. Commands: Sent from user to kernel. Commands to get data on
|
|
||||||
a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID,
|
|
||||||
containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes
|
|
||||||
the task/process for which userspace wants statistics.
|
|
||||||
|
|
||||||
Commands to register/deregister interest in exit data from a set of cpus
|
|
||||||
consist of one attribute, of type
|
|
||||||
TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the
|
|
||||||
attribute payload. The cpumask is specified as an ascii string of
|
|
||||||
comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8
|
|
||||||
the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest
|
|
||||||
in cpus before closing the listening socket, the kernel cleans up its interest
|
|
||||||
set over time. However, for the sake of efficiency, an explicit deregistration
|
|
||||||
is advisable.
|
|
||||||
|
|
||||||
2. Response for a command: sent from the kernel in response to a userspace
|
|
||||||
command. The payload is a series of three attributes of type:
|
|
||||||
|
|
||||||
a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates
|
|
||||||
a pid/tgid will be followed by some stats.
|
|
||||||
|
|
||||||
b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats
|
|
||||||
are being returned.
|
|
||||||
|
|
||||||
c) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The
|
|
||||||
same structure is used for both per-pid and per-tgid stats.
|
|
||||||
|
|
||||||
3. New message sent by kernel whenever a task exits. The payload consists of a
|
|
||||||
series of attributes of the following type:
|
|
||||||
|
|
||||||
a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats
|
|
||||||
b) TASKSTATS_TYPE_PID: contains exiting task's pid
|
|
||||||
c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats
|
|
||||||
d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats
|
|
||||||
e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs
|
|
||||||
f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process
|
|
||||||
|
|
||||||
|
|
||||||
per-tgid stats
|
|
||||||
--------------
|
|
||||||
|
|
||||||
Taskstats provides per-process stats, in addition to per-task stats, since
|
|
||||||
resource management is often done at a process granularity and aggregating task
|
|
||||||
stats in userspace alone is inefficient and potentially inaccurate (due to lack
|
|
||||||
of atomicity).
|
|
||||||
|
|
||||||
However, maintaining per-process, in addition to per-task stats, within the
|
|
||||||
kernel has space and time overheads. To address this, the taskstats code
|
|
||||||
accumulates each exiting task's statistics into a process-wide data structure.
|
|
||||||
When the last task of a process exits, the process level data accumulated also
|
|
||||||
gets sent to userspace (along with the per-task data).
|
|
||||||
|
|
||||||
When a user queries to get per-tgid data, the sum of all other live threads in
|
|
||||||
the group is added up and added to the accumulated total for previously exited
|
|
||||||
threads of the same thread group.
|
|
||||||
|
|
||||||
Extending taskstats
|
|
||||||
-------------------
|
|
||||||
|
|
||||||
There are two ways to extend the taskstats interface to export more
|
|
||||||
per-task/process stats as patches to collect them get added to the kernel
|
|
||||||
in future:
|
|
||||||
|
|
||||||
1. Adding more fields to the end of the existing struct taskstats. Backward
|
|
||||||
compatibility is ensured by the version number within the
|
|
||||||
structure. Userspace will use only the fields of the struct that correspond
|
|
||||||
to the version its using.
|
|
||||||
|
|
||||||
2. Defining separate statistic structs and using the netlink attributes
|
|
||||||
interface to return them. Since userspace processes each netlink attribute
|
|
||||||
independently, it can always ignore attributes whose type it does not
|
|
||||||
understand (because it is using an older version of the interface).
|
|
||||||
|
|
||||||
|
|
||||||
Choosing between 1. and 2. is a matter of trading off flexibility and
|
|
||||||
overhead. If only a few fields need to be added, then 1. is the preferable
|
|
||||||
path since the kernel and userspace don't need to incur the overhead of
|
|
||||||
processing new netlink attributes. But if the new fields expand the existing
|
|
||||||
struct too much, requiring disparate userspace accounting utilities to
|
|
||||||
unnecessarily receive large structures whose fields are of no interest, then
|
|
||||||
extending the attributes structure would be worthwhile.
|
|
||||||
|
|
||||||
Flow control for taskstats
|
|
||||||
--------------------------
|
|
||||||
|
|
||||||
When the rate of task exits becomes large, a listener may not be able to keep
|
|
||||||
up with the kernel's rate of sending per-tid/tgid exit data leading to data
|
|
||||||
loss. This possibility gets compounded when the taskstats structure gets
|
|
||||||
extended and the number of cpus grows large.
|
|
||||||
|
|
||||||
To avoid losing statistics, userspace should do one or more of the following:
|
|
||||||
|
|
||||||
- increase the receive buffer sizes for the netlink sockets opened by
|
|
||||||
listeners to receive exit data.
|
|
||||||
|
|
||||||
- create more listeners and reduce the number of cpus being listened to by
|
|
||||||
each listener. In the extreme case, there could be one listener for each cpu.
|
|
||||||
Users may also consider setting the cpu affinity of the listener to the subset
|
|
||||||
of cpus to which it listens, especially if they are listening to just one cpu.
|
|
||||||
|
|
||||||
Despite these measures, if the userspace receives ENOBUFS error messages
|
|
||||||
indicated overflow of receive buffers, it should take measures to handle the
|
|
||||||
loss of data.
|
|
||||||
|
|
||||||
----
|
|
150
Documentation/admin-guide/aoe/aoe.rst
Normal file
150
Documentation/admin-guide/aoe/aoe.rst
Normal file
@ -0,0 +1,150 @@
|
|||||||
|
Introduction
|
||||||
|
============
|
||||||
|
|
||||||
|
ATA over Ethernet is a network protocol that provides simple access to
|
||||||
|
block storage on the LAN.
|
||||||
|
|
||||||
|
http://support.coraid.com/documents/AoEr11.txt
|
||||||
|
|
||||||
|
The EtherDrive (R) HOWTO for 2.6 and 3.x kernels is found at ...
|
||||||
|
|
||||||
|
http://support.coraid.com/support/linux/EtherDrive-2.6-HOWTO.html
|
||||||
|
|
||||||
|
It has many tips and hints! Please see, especially, recommended
|
||||||
|
tunings for virtual memory:
|
||||||
|
|
||||||
|
http://support.coraid.com/support/linux/EtherDrive-2.6-HOWTO-5.html#ss5.19
|
||||||
|
|
||||||
|
The aoetools are userland programs that are designed to work with this
|
||||||
|
driver. The aoetools are on sourceforge.
|
||||||
|
|
||||||
|
http://aoetools.sourceforge.net/
|
||||||
|
|
||||||
|
The scripts in this Documentation/admin-guide/aoe directory are intended to
|
||||||
|
document the use of the driver and are not necessary if you install
|
||||||
|
the aoetools.
|
||||||
|
|
||||||
|
|
||||||
|
Creating Device Nodes
|
||||||
|
=====================
|
||||||
|
|
||||||
|
Users of udev should find the block device nodes created
|
||||||
|
automatically, but to create all the necessary device nodes, use the
|
||||||
|
udev configuration rules provided in udev.txt (in this directory).
|
||||||
|
|
||||||
|
There is a udev-install.sh script that shows how to install these
|
||||||
|
rules on your system.
|
||||||
|
|
||||||
|
There is also an autoload script that shows how to edit
|
||||||
|
/etc/modprobe.d/aoe.conf to ensure that the aoe module is loaded when
|
||||||
|
necessary. Preloading the aoe module is preferable to autoloading,
|
||||||
|
however, because AoE discovery takes a few seconds. It can be
|
||||||
|
confusing when an AoE device is not present the first time the a
|
||||||
|
command is run but appears a second later.
|
||||||
|
|
||||||
|
Using Device Nodes
|
||||||
|
==================
|
||||||
|
|
||||||
|
"cat /dev/etherd/err" blocks, waiting for error diagnostic output,
|
||||||
|
like any retransmitted packets.
|
||||||
|
|
||||||
|
"echo eth2 eth4 > /dev/etherd/interfaces" tells the aoe driver to
|
||||||
|
limit ATA over Ethernet traffic to eth2 and eth4. AoE traffic from
|
||||||
|
untrusted networks should be ignored as a matter of security. See
|
||||||
|
also the aoe_iflist driver option described below.
|
||||||
|
|
||||||
|
"echo > /dev/etherd/discover" tells the driver to find out what AoE
|
||||||
|
devices are available.
|
||||||
|
|
||||||
|
In the future these character devices may disappear and be replaced
|
||||||
|
by sysfs counterparts. Using the commands in aoetools insulates
|
||||||
|
users from these implementation details.
|
||||||
|
|
||||||
|
The block devices are named like this::
|
||||||
|
|
||||||
|
e{shelf}.{slot}
|
||||||
|
e{shelf}.{slot}p{part}
|
||||||
|
|
||||||
|
... so that "e0.2" is the third blade from the left (slot 2) in the
|
||||||
|
first shelf (shelf address zero). That's the whole disk. The first
|
||||||
|
partition on that disk would be "e0.2p1".
|
||||||
|
|
||||||
|
Using sysfs
|
||||||
|
===========
|
||||||
|
|
||||||
|
Each aoe block device in /sys/block has the extra attributes of
|
||||||
|
state, mac, and netif. The state attribute is "up" when the device
|
||||||
|
is ready for I/O and "down" if detected but unusable. The
|
||||||
|
"down,closewait" state shows that the device is still open and
|
||||||
|
cannot come up again until it has been closed.
|
||||||
|
|
||||||
|
The mac attribute is the ethernet address of the remote AoE device.
|
||||||
|
The netif attribute is the network interface on the localhost
|
||||||
|
through which we are communicating with the remote AoE device.
|
||||||
|
|
||||||
|
There is a script in this directory that formats this information in
|
||||||
|
a convenient way. Users with aoetools should use the aoe-stat
|
||||||
|
command::
|
||||||
|
|
||||||
|
root@makki root# sh Documentation/admin-guide/aoe/status.sh
|
||||||
|
e10.0 eth3 up
|
||||||
|
e10.1 eth3 up
|
||||||
|
e10.2 eth3 up
|
||||||
|
e10.3 eth3 up
|
||||||
|
e10.4 eth3 up
|
||||||
|
e10.5 eth3 up
|
||||||
|
e10.6 eth3 up
|
||||||
|
e10.7 eth3 up
|
||||||
|
e10.8 eth3 up
|
||||||
|
e10.9 eth3 up
|
||||||
|
e4.0 eth1 up
|
||||||
|
e4.1 eth1 up
|
||||||
|
e4.2 eth1 up
|
||||||
|
e4.3 eth1 up
|
||||||
|
e4.4 eth1 up
|
||||||
|
e4.5 eth1 up
|
||||||
|
e4.6 eth1 up
|
||||||
|
e4.7 eth1 up
|
||||||
|
e4.8 eth1 up
|
||||||
|
e4.9 eth1 up
|
||||||
|
|
||||||
|
Use /sys/module/aoe/parameters/aoe_iflist (or better, the driver
|
||||||
|
option discussed below) instead of /dev/etherd/interfaces to limit
|
||||||
|
AoE traffic to the network interfaces in the given
|
||||||
|
whitespace-separated list. Unlike the old character device, the
|
||||||
|
sysfs entry can be read from as well as written to.
|
||||||
|
|
||||||
|
It's helpful to trigger discovery after setting the list of allowed
|
||||||
|
interfaces. The aoetools package provides an aoe-discover script
|
||||||
|
for this purpose. You can also directly use the
|
||||||
|
/dev/etherd/discover special file described above.
|
||||||
|
|
||||||
|
Driver Options
|
||||||
|
==============
|
||||||
|
|
||||||
|
There is a boot option for the built-in aoe driver and a
|
||||||
|
corresponding module parameter, aoe_iflist. Without this option,
|
||||||
|
all network interfaces may be used for ATA over Ethernet. Here is a
|
||||||
|
usage example for the module parameter::
|
||||||
|
|
||||||
|
modprobe aoe_iflist="eth1 eth3"
|
||||||
|
|
||||||
|
The aoe_deadsecs module parameter determines the maximum number of
|
||||||
|
seconds that the driver will wait for an AoE device to provide a
|
||||||
|
response to an AoE command. After aoe_deadsecs seconds have
|
||||||
|
elapsed, the AoE device will be marked as "down". A value of zero
|
||||||
|
is supported for testing purposes and makes the aoe driver keep
|
||||||
|
trying AoE commands forever.
|
||||||
|
|
||||||
|
The aoe_maxout module parameter has a default of 128. This is the
|
||||||
|
maximum number of unresponded packets that will be sent to an AoE
|
||||||
|
target at one time.
|
||||||
|
|
||||||
|
The aoe_dyndevs module parameter defaults to 1, meaning that the
|
||||||
|
driver will assign a block device minor number to a discovered AoE
|
||||||
|
target based on the order of its discovery. With dynamic minor
|
||||||
|
device numbers in use, a greater range of AoE shelf and slot
|
||||||
|
addresses can be supported. Users with udev will never have to
|
||||||
|
think about minor numbers. Using aoe_dyndevs=0 allows device nodes
|
||||||
|
to be pre-created using a static minor-number scheme with the
|
||||||
|
aoe-mkshelf script in the aoetools.
|
17
Documentation/admin-guide/aoe/index.rst
Normal file
17
Documentation/admin-guide/aoe/index.rst
Normal file
@ -0,0 +1,17 @@
|
|||||||
|
=======================
|
||||||
|
ATA over Ethernet (AoE)
|
||||||
|
=======================
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
aoe
|
||||||
|
todo
|
||||||
|
examples
|
||||||
|
|
||||||
|
.. only:: subproject and html
|
||||||
|
|
||||||
|
Indices
|
||||||
|
=======
|
||||||
|
|
||||||
|
* :ref:`genindex`
|
26
Documentation/admin-guide/aoe/udev.txt
Normal file
26
Documentation/admin-guide/aoe/udev.txt
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
# These rules tell udev what device nodes to create for aoe support.
|
||||||
|
# They may be installed along the following lines. Check the section
|
||||||
|
# 8 udev manpage to see whether your udev supports SUBSYSTEM, and
|
||||||
|
# whether it uses one or two equal signs for SUBSYSTEM and KERNEL.
|
||||||
|
#
|
||||||
|
# ecashin@makki ~$ su
|
||||||
|
# Password:
|
||||||
|
# bash# find /etc -type f -name udev.conf
|
||||||
|
# /etc/udev/udev.conf
|
||||||
|
# bash# grep udev_rules= /etc/udev/udev.conf
|
||||||
|
# udev_rules="/etc/udev/rules.d/"
|
||||||
|
# bash# ls /etc/udev/rules.d/
|
||||||
|
# 10-wacom.rules 50-udev.rules
|
||||||
|
# bash# cp /path/to/linux/Documentation/admin-guide/aoe/udev.txt \
|
||||||
|
# /etc/udev/rules.d/60-aoe.rules
|
||||||
|
#
|
||||||
|
|
||||||
|
# aoe char devices
|
||||||
|
SUBSYSTEM=="aoe", KERNEL=="discover", NAME="etherd/%k", GROUP="disk", MODE="0220"
|
||||||
|
SUBSYSTEM=="aoe", KERNEL=="err", NAME="etherd/%k", GROUP="disk", MODE="0440"
|
||||||
|
SUBSYSTEM=="aoe", KERNEL=="interfaces", NAME="etherd/%k", GROUP="disk", MODE="0220"
|
||||||
|
SUBSYSTEM=="aoe", KERNEL=="revalidate", NAME="etherd/%k", GROUP="disk", MODE="0220"
|
||||||
|
SUBSYSTEM=="aoe", KERNEL=="flush", NAME="etherd/%k", GROUP="disk", MODE="0220"
|
||||||
|
|
||||||
|
# aoe block devices
|
||||||
|
KERNEL=="etherd*", GROUP="disk"
|
Before Width: | Height: | Size: 22 KiB After Width: | Height: | Size: 22 KiB |
Before Width: | Height: | Size: 17 KiB After Width: | Height: | Size: 17 KiB |
@ -0,0 +1,42 @@
|
|||||||
|
================================
|
||||||
|
kernel data structure for DRBD-9
|
||||||
|
================================
|
||||||
|
|
||||||
|
This describes the in kernel data structure for DRBD-9. Starting with
|
||||||
|
Linux v3.14 we are reorganizing DRBD to use this data structure.
|
||||||
|
|
||||||
|
Basic Data Structure
|
||||||
|
====================
|
||||||
|
|
||||||
|
A node has a number of DRBD resources. Each such resource has a number of
|
||||||
|
devices (aka volumes) and connections to other nodes ("peer nodes"). Each DRBD
|
||||||
|
device is represented by a block device locally.
|
||||||
|
|
||||||
|
The DRBD objects are interconnected to form a matrix as depicted below; a
|
||||||
|
drbd_peer_device object sits at each intersection between a drbd_device and a
|
||||||
|
drbd_connection::
|
||||||
|
|
||||||
|
/--------------+---------------+.....+---------------\
|
||||||
|
| resource | device | | device |
|
||||||
|
+--------------+---------------+.....+---------------+
|
||||||
|
| connection | peer_device | | peer_device |
|
||||||
|
+--------------+---------------+.....+---------------+
|
||||||
|
: : : : :
|
||||||
|
: : : : :
|
||||||
|
+--------------+---------------+.....+---------------+
|
||||||
|
| connection | peer_device | | peer_device |
|
||||||
|
\--------------+---------------+.....+---------------/
|
||||||
|
|
||||||
|
In this table, horizontally, devices can be accessed from resources by their
|
||||||
|
volume number. Likewise, peer_devices can be accessed from connections by
|
||||||
|
their volume number. Objects in the vertical direction are connected by double
|
||||||
|
linked lists. There are back pointers from peer_devices to their connections a
|
||||||
|
devices, and from connections and devices to their resource.
|
||||||
|
|
||||||
|
All resources are in the drbd_resources double-linked list. In addition, all
|
||||||
|
devices can be accessed by their minor device number via the drbd_devices idr.
|
||||||
|
|
||||||
|
The drbd_resource, drbd_connection, and drbd_device objects are reference
|
||||||
|
counted. The peer_device objects only serve to establish the links between
|
||||||
|
devices and connections; their lifetime is determined by the lifetime of the
|
||||||
|
device and connection which they reference.
|
30
Documentation/admin-guide/blockdev/drbd/figures.rst
Normal file
30
Documentation/admin-guide/blockdev/drbd/figures.rst
Normal file
@ -0,0 +1,30 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
.. The here included files are intended to help understand the implementation
|
||||||
|
|
||||||
|
Data flows that Relate some functions, and write packets
|
||||||
|
========================================================
|
||||||
|
|
||||||
|
.. kernel-figure:: DRBD-8.3-data-packets.svg
|
||||||
|
:alt: DRBD-8.3-data-packets.svg
|
||||||
|
:align: center
|
||||||
|
|
||||||
|
.. kernel-figure:: DRBD-data-packets.svg
|
||||||
|
:alt: DRBD-data-packets.svg
|
||||||
|
:align: center
|
||||||
|
|
||||||
|
|
||||||
|
Sub graphs of DRBD's state transitions
|
||||||
|
======================================
|
||||||
|
|
||||||
|
.. kernel-figure:: conn-states-8.dot
|
||||||
|
:alt: conn-states-8.dot
|
||||||
|
:align: center
|
||||||
|
|
||||||
|
.. kernel-figure:: disk-states-8.dot
|
||||||
|
:alt: disk-states-8.dot
|
||||||
|
:align: center
|
||||||
|
|
||||||
|
.. kernel-figure:: node-states-8.dot
|
||||||
|
:alt: node-states-8.dot
|
||||||
|
:align: center
|
19
Documentation/admin-guide/blockdev/drbd/index.rst
Normal file
19
Documentation/admin-guide/blockdev/drbd/index.rst
Normal file
@ -0,0 +1,19 @@
|
|||||||
|
==========================================
|
||||||
|
Distributed Replicated Block Device - DRBD
|
||||||
|
==========================================
|
||||||
|
|
||||||
|
Description
|
||||||
|
===========
|
||||||
|
|
||||||
|
DRBD is a shared-nothing, synchronously replicated block device. It
|
||||||
|
is designed to serve as a building block for high availability
|
||||||
|
clusters and in this context, is a "drop-in" replacement for shared
|
||||||
|
storage. Simplistically, you could see it as a network RAID 1.
|
||||||
|
|
||||||
|
Please visit http://www.drbd.org to find out more.
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
data-structure-v9
|
||||||
|
figures
|
13
Documentation/admin-guide/blockdev/drbd/node-states-8.dot
Normal file
13
Documentation/admin-guide/blockdev/drbd/node-states-8.dot
Normal file
@ -0,0 +1,13 @@
|
|||||||
|
digraph node_states {
|
||||||
|
Secondary -> Primary [ label = "ioctl_set_state()" ]
|
||||||
|
Primary -> Secondary [ label = "ioctl_set_state()" ]
|
||||||
|
}
|
||||||
|
|
||||||
|
digraph peer_states {
|
||||||
|
Secondary -> Primary [ label = "recv state packet" ]
|
||||||
|
Primary -> Secondary [ label = "recv state packet" ]
|
||||||
|
Primary -> Unknown [ label = "connection lost" ]
|
||||||
|
Secondary -> Unknown [ label = "connection lost" ]
|
||||||
|
Unknown -> Primary [ label = "connected" ]
|
||||||
|
Unknown -> Secondary [ label = "connected" ]
|
||||||
|
}
|
255
Documentation/admin-guide/blockdev/floppy.rst
Normal file
255
Documentation/admin-guide/blockdev/floppy.rst
Normal file
@ -0,0 +1,255 @@
|
|||||||
|
=============
|
||||||
|
Floppy Driver
|
||||||
|
=============
|
||||||
|
|
||||||
|
FAQ list:
|
||||||
|
=========
|
||||||
|
|
||||||
|
A FAQ list may be found in the fdutils package (see below), and also
|
||||||
|
at <http://fdutils.linux.lu/faq.html>.
|
||||||
|
|
||||||
|
|
||||||
|
LILO configuration options (Thinkpad users, read this)
|
||||||
|
======================================================
|
||||||
|
|
||||||
|
The floppy driver is configured using the 'floppy=' option in
|
||||||
|
lilo. This option can be typed at the boot prompt, or entered in the
|
||||||
|
lilo configuration file.
|
||||||
|
|
||||||
|
Example: If your kernel is called linux-2.6.9, type the following line
|
||||||
|
at the lilo boot prompt (if you have a thinkpad)::
|
||||||
|
|
||||||
|
linux-2.6.9 floppy=thinkpad
|
||||||
|
|
||||||
|
You may also enter the following line in /etc/lilo.conf, in the description
|
||||||
|
of linux-2.6.9::
|
||||||
|
|
||||||
|
append = "floppy=thinkpad"
|
||||||
|
|
||||||
|
Several floppy related options may be given, example::
|
||||||
|
|
||||||
|
linux-2.6.9 floppy=daring floppy=two_fdc
|
||||||
|
append = "floppy=daring floppy=two_fdc"
|
||||||
|
|
||||||
|
If you give options both in the lilo config file and on the boot
|
||||||
|
prompt, the option strings of both places are concatenated, the boot
|
||||||
|
prompt options coming last. That's why there are also options to
|
||||||
|
restore the default behavior.
|
||||||
|
|
||||||
|
|
||||||
|
Module configuration options
|
||||||
|
============================
|
||||||
|
|
||||||
|
If you use the floppy driver as a module, use the following syntax::
|
||||||
|
|
||||||
|
modprobe floppy floppy="<options>"
|
||||||
|
|
||||||
|
Example::
|
||||||
|
|
||||||
|
modprobe floppy floppy="omnibook messages"
|
||||||
|
|
||||||
|
If you need certain options enabled every time you load the floppy driver,
|
||||||
|
you can put::
|
||||||
|
|
||||||
|
options floppy floppy="omnibook messages"
|
||||||
|
|
||||||
|
in a configuration file in /etc/modprobe.d/.
|
||||||
|
|
||||||
|
|
||||||
|
The floppy driver related options are:
|
||||||
|
|
||||||
|
floppy=asus_pci
|
||||||
|
Sets the bit mask to allow only units 0 and 1. (default)
|
||||||
|
|
||||||
|
floppy=daring
|
||||||
|
Tells the floppy driver that you have a well behaved floppy controller.
|
||||||
|
This allows more efficient and smoother operation, but may fail on
|
||||||
|
certain controllers. This may speed up certain operations.
|
||||||
|
|
||||||
|
floppy=0,daring
|
||||||
|
Tells the floppy driver that your floppy controller should be used
|
||||||
|
with caution.
|
||||||
|
|
||||||
|
floppy=one_fdc
|
||||||
|
Tells the floppy driver that you have only one floppy controller.
|
||||||
|
(default)
|
||||||
|
|
||||||
|
floppy=two_fdc / floppy=<address>,two_fdc
|
||||||
|
Tells the floppy driver that you have two floppy controllers.
|
||||||
|
The second floppy controller is assumed to be at <address>.
|
||||||
|
This option is not needed if the second controller is at address
|
||||||
|
0x370, and if you use the 'cmos' option.
|
||||||
|
|
||||||
|
floppy=thinkpad
|
||||||
|
Tells the floppy driver that you have a Thinkpad. Thinkpads use an
|
||||||
|
inverted convention for the disk change line.
|
||||||
|
|
||||||
|
floppy=0,thinkpad
|
||||||
|
Tells the floppy driver that you don't have a Thinkpad.
|
||||||
|
|
||||||
|
floppy=omnibook / floppy=nodma
|
||||||
|
Tells the floppy driver not to use Dma for data transfers.
|
||||||
|
This is needed on HP Omnibooks, which don't have a workable
|
||||||
|
DMA channel for the floppy driver. This option is also useful
|
||||||
|
if you frequently get "Unable to allocate DMA memory" messages.
|
||||||
|
Indeed, dma memory needs to be continuous in physical memory,
|
||||||
|
and is thus harder to find, whereas non-dma buffers may be
|
||||||
|
allocated in virtual memory. However, I advise against this if
|
||||||
|
you have an FDC without a FIFO (8272A or 82072). 82072A and
|
||||||
|
later are OK. You also need at least a 486 to use nodma.
|
||||||
|
If you use nodma mode, I suggest you also set the FIFO
|
||||||
|
threshold to 10 or lower, in order to limit the number of data
|
||||||
|
transfer interrupts.
|
||||||
|
|
||||||
|
If you have a FIFO-able FDC, the floppy driver automatically
|
||||||
|
falls back on non DMA mode if no DMA-able memory can be found.
|
||||||
|
If you want to avoid this, explicitly ask for 'yesdma'.
|
||||||
|
|
||||||
|
floppy=yesdma
|
||||||
|
Tells the floppy driver that a workable DMA channel is available.
|
||||||
|
(default)
|
||||||
|
|
||||||
|
floppy=nofifo
|
||||||
|
Disables the FIFO entirely. This is needed if you get "Bus
|
||||||
|
master arbitration error" messages from your Ethernet card (or
|
||||||
|
from other devices) while accessing the floppy.
|
||||||
|
|
||||||
|
floppy=usefifo
|
||||||
|
Enables the FIFO. (default)
|
||||||
|
|
||||||
|
floppy=<threshold>,fifo_depth
|
||||||
|
Sets the FIFO threshold. This is mostly relevant in DMA
|
||||||
|
mode. If this is higher, the floppy driver tolerates more
|
||||||
|
interrupt latency, but it triggers more interrupts (i.e. it
|
||||||
|
imposes more load on the rest of the system). If this is
|
||||||
|
lower, the interrupt latency should be lower too (faster
|
||||||
|
processor). The benefit of a lower threshold is less
|
||||||
|
interrupts.
|
||||||
|
|
||||||
|
To tune the fifo threshold, switch on over/underrun messages
|
||||||
|
using 'floppycontrol --messages'. Then access a floppy
|
||||||
|
disk. If you get a huge amount of "Over/Underrun - retrying"
|
||||||
|
messages, then the fifo threshold is too low. Try with a
|
||||||
|
higher value, until you only get an occasional Over/Underrun.
|
||||||
|
It is a good idea to compile the floppy driver as a module
|
||||||
|
when doing this tuning. Indeed, it allows to try different
|
||||||
|
fifo values without rebooting the machine for each test. Note
|
||||||
|
that you need to do 'floppycontrol --messages' every time you
|
||||||
|
re-insert the module.
|
||||||
|
|
||||||
|
Usually, tuning the fifo threshold should not be needed, as
|
||||||
|
the default (0xa) is reasonable.
|
||||||
|
|
||||||
|
floppy=<drive>,<type>,cmos
|
||||||
|
Sets the CMOS type of <drive> to <type>. This is mandatory if
|
||||||
|
you have more than two floppy drives (only two can be
|
||||||
|
described in the physical CMOS), or if your BIOS uses
|
||||||
|
non-standard CMOS types. The CMOS types are:
|
||||||
|
|
||||||
|
== ==================================
|
||||||
|
0 Use the value of the physical CMOS
|
||||||
|
1 5 1/4 DD
|
||||||
|
2 5 1/4 HD
|
||||||
|
3 3 1/2 DD
|
||||||
|
4 3 1/2 HD
|
||||||
|
5 3 1/2 ED
|
||||||
|
6 3 1/2 ED
|
||||||
|
16 unknown or not installed
|
||||||
|
== ==================================
|
||||||
|
|
||||||
|
(Note: there are two valid types for ED drives. This is because 5 was
|
||||||
|
initially chosen to represent floppy *tapes*, and 6 for ED drives.
|
||||||
|
AMI ignored this, and used 5 for ED drives. That's why the floppy
|
||||||
|
driver handles both.)
|
||||||
|
|
||||||
|
floppy=unexpected_interrupts
|
||||||
|
Print a warning message when an unexpected interrupt is received.
|
||||||
|
(default)
|
||||||
|
|
||||||
|
floppy=no_unexpected_interrupts / floppy=L40SX
|
||||||
|
Don't print a message when an unexpected interrupt is received. This
|
||||||
|
is needed on IBM L40SX laptops in certain video modes. (There seems
|
||||||
|
to be an interaction between video and floppy. The unexpected
|
||||||
|
interrupts affect only performance, and can be safely ignored.)
|
||||||
|
|
||||||
|
floppy=broken_dcl
|
||||||
|
Don't use the disk change line, but assume that the disk was
|
||||||
|
changed whenever the device node is reopened. Needed on some
|
||||||
|
boxes where the disk change line is broken or unsupported.
|
||||||
|
This should be regarded as a stopgap measure, indeed it makes
|
||||||
|
floppy operation less efficient due to unneeded cache
|
||||||
|
flushings, and slightly more unreliable. Please verify your
|
||||||
|
cable, connection and jumper settings if you have any DCL
|
||||||
|
problems. However, some older drives, and also some laptops
|
||||||
|
are known not to have a DCL.
|
||||||
|
|
||||||
|
floppy=debug
|
||||||
|
Print debugging messages.
|
||||||
|
|
||||||
|
floppy=messages
|
||||||
|
Print informational messages for some operations (disk change
|
||||||
|
notifications, warnings about over and underruns, and about
|
||||||
|
autodetection).
|
||||||
|
|
||||||
|
floppy=silent_dcl_clear
|
||||||
|
Uses a less noisy way to clear the disk change line (which
|
||||||
|
doesn't involve seeks). Implied by 'daring' option.
|
||||||
|
|
||||||
|
floppy=<nr>,irq
|
||||||
|
Sets the floppy IRQ to <nr> instead of 6.
|
||||||
|
|
||||||
|
floppy=<nr>,dma
|
||||||
|
Sets the floppy DMA channel to <nr> instead of 2.
|
||||||
|
|
||||||
|
floppy=slow
|
||||||
|
Use PS/2 stepping rate::
|
||||||
|
|
||||||
|
PS/2 floppies have much slower step rates than regular floppies.
|
||||||
|
It's been recommended that take about 1/4 of the default speed
|
||||||
|
in some more extreme cases.
|
||||||
|
|
||||||
|
|
||||||
|
Supporting utilities and additional documentation:
|
||||||
|
==================================================
|
||||||
|
|
||||||
|
Additional parameters of the floppy driver can be configured at
|
||||||
|
runtime. Utilities which do this can be found in the fdutils package.
|
||||||
|
This package also contains a new version of mtools which allows to
|
||||||
|
access high capacity disks (up to 1992K on a high density 3 1/2 disk!).
|
||||||
|
It also contains additional documentation about the floppy driver.
|
||||||
|
|
||||||
|
The latest version can be found at fdutils homepage:
|
||||||
|
|
||||||
|
http://fdutils.linux.lu
|
||||||
|
|
||||||
|
The fdutils releases can be found at:
|
||||||
|
|
||||||
|
http://fdutils.linux.lu/download.html
|
||||||
|
|
||||||
|
http://www.tux.org/pub/knaff/fdutils/
|
||||||
|
|
||||||
|
ftp://metalab.unc.edu/pub/Linux/utils/disk-management/
|
||||||
|
|
||||||
|
Reporting problems about the floppy driver
|
||||||
|
==========================================
|
||||||
|
|
||||||
|
If you have a question or a bug report about the floppy driver, mail
|
||||||
|
me at Alain.Knaff@poboxes.com . If you post to Usenet, preferably use
|
||||||
|
comp.os.linux.hardware. As the volume in these groups is rather high,
|
||||||
|
be sure to include the word "floppy" (or "FLOPPY") in the subject
|
||||||
|
line. If the reported problem happens when mounting floppy disks, be
|
||||||
|
sure to mention also the type of the filesystem in the subject line.
|
||||||
|
|
||||||
|
Be sure to read the FAQ before mailing/posting any bug reports!
|
||||||
|
|
||||||
|
Alain
|
||||||
|
|
||||||
|
Changelog
|
||||||
|
=========
|
||||||
|
|
||||||
|
10-30-2004 :
|
||||||
|
Cleanup, updating, add reference to module configuration.
|
||||||
|
James Nelson <james4765@gmail.com>
|
||||||
|
|
||||||
|
6-3-2000 :
|
||||||
|
Original Document
|
16
Documentation/admin-guide/blockdev/index.rst
Normal file
16
Documentation/admin-guide/blockdev/index.rst
Normal file
@ -0,0 +1,16 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
===========================
|
||||||
|
The Linux RapidIO Subsystem
|
||||||
|
===========================
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
floppy
|
||||||
|
nbd
|
||||||
|
paride
|
||||||
|
ramdisk
|
||||||
|
zram
|
||||||
|
|
||||||
|
drbd/index
|
31
Documentation/admin-guide/blockdev/nbd.rst
Normal file
31
Documentation/admin-guide/blockdev/nbd.rst
Normal file
@ -0,0 +1,31 @@
|
|||||||
|
==================================
|
||||||
|
Network Block Device (TCP version)
|
||||||
|
==================================
|
||||||
|
|
||||||
|
1) Overview
|
||||||
|
-----------
|
||||||
|
|
||||||
|
What is it: With this compiled in the kernel (or as a module), Linux
|
||||||
|
can use a remote server as one of its block devices. So every time
|
||||||
|
the client computer wants to read, e.g., /dev/nb0, it sends a
|
||||||
|
request over TCP to the server, which will reply with the data read.
|
||||||
|
This can be used for stations with low disk space (or even diskless)
|
||||||
|
to borrow disk space from another computer.
|
||||||
|
Unlike NFS, it is possible to put any filesystem on it, etc.
|
||||||
|
|
||||||
|
For more information, or to download the nbd-client and nbd-server
|
||||||
|
tools, go to http://nbd.sf.net/.
|
||||||
|
|
||||||
|
The nbd kernel module need only be installed on the client
|
||||||
|
system, as the nbd-server is completely in userspace. In fact,
|
||||||
|
the nbd-server has been successfully ported to other operating
|
||||||
|
systems, including Windows.
|
||||||
|
|
||||||
|
A) NBD parameters
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
max_part
|
||||||
|
Number of partitions per device (default: 0).
|
||||||
|
|
||||||
|
nbds_max
|
||||||
|
Number of block devices that should be initialized (default: 16).
|
439
Documentation/admin-guide/blockdev/paride.rst
Normal file
439
Documentation/admin-guide/blockdev/paride.rst
Normal file
@ -0,0 +1,439 @@
|
|||||||
|
===================================
|
||||||
|
Linux and parallel port IDE devices
|
||||||
|
===================================
|
||||||
|
|
||||||
|
PARIDE v1.03 (c) 1997-8 Grant Guenther <grant@torque.net>
|
||||||
|
|
||||||
|
1. Introduction
|
||||||
|
===============
|
||||||
|
|
||||||
|
Owing to the simplicity and near universality of the parallel port interface
|
||||||
|
to personal computers, many external devices such as portable hard-disk,
|
||||||
|
CD-ROM, LS-120 and tape drives use the parallel port to connect to their
|
||||||
|
host computer. While some devices (notably scanners) use ad-hoc methods
|
||||||
|
to pass commands and data through the parallel port interface, most
|
||||||
|
external devices are actually identical to an internal model, but with
|
||||||
|
a parallel-port adapter chip added in. Some of the original parallel port
|
||||||
|
adapters were little more than mechanisms for multiplexing a SCSI bus.
|
||||||
|
(The Iomega PPA-3 adapter used in the ZIP drives is an example of this
|
||||||
|
approach). Most current designs, however, take a different approach.
|
||||||
|
The adapter chip reproduces a small ISA or IDE bus in the external device
|
||||||
|
and the communication protocol provides operations for reading and writing
|
||||||
|
device registers, as well as data block transfer functions. Sometimes,
|
||||||
|
the device being addressed via the parallel cable is a standard SCSI
|
||||||
|
controller like an NCR 5380. The "ditto" family of external tape
|
||||||
|
drives use the ISA replicator to interface a floppy disk controller,
|
||||||
|
which is then connected to a floppy-tape mechanism. The vast majority
|
||||||
|
of external parallel port devices, however, are now based on standard
|
||||||
|
IDE type devices, which require no intermediate controller. If one
|
||||||
|
were to open up a parallel port CD-ROM drive, for instance, one would
|
||||||
|
find a standard ATAPI CD-ROM drive, a power supply, and a single adapter
|
||||||
|
that interconnected a standard PC parallel port cable and a standard
|
||||||
|
IDE cable. It is usually possible to exchange the CD-ROM device with
|
||||||
|
any other device using the IDE interface.
|
||||||
|
|
||||||
|
The document describes the support in Linux for parallel port IDE
|
||||||
|
devices. It does not cover parallel port SCSI devices, "ditto" tape
|
||||||
|
drives or scanners. Many different devices are supported by the
|
||||||
|
parallel port IDE subsystem, including:
|
||||||
|
|
||||||
|
- MicroSolutions backpack CD-ROM
|
||||||
|
- MicroSolutions backpack PD/CD
|
||||||
|
- MicroSolutions backpack hard-drives
|
||||||
|
- MicroSolutions backpack 8000t tape drive
|
||||||
|
- SyQuest EZ-135, EZ-230 & SparQ drives
|
||||||
|
- Avatar Shark
|
||||||
|
- Imation Superdisk LS-120
|
||||||
|
- Maxell Superdisk LS-120
|
||||||
|
- FreeCom Power CD
|
||||||
|
- Hewlett-Packard 5GB and 8GB tape drives
|
||||||
|
- Hewlett-Packard 7100 and 7200 CD-RW drives
|
||||||
|
|
||||||
|
as well as most of the clone and no-name products on the market.
|
||||||
|
|
||||||
|
To support such a wide range of devices, PARIDE, the parallel port IDE
|
||||||
|
subsystem, is actually structured in three parts. There is a base
|
||||||
|
paride module which provides a registry and some common methods for
|
||||||
|
accessing the parallel ports. The second component is a set of
|
||||||
|
high-level drivers for each of the different types of supported devices:
|
||||||
|
|
||||||
|
=== =============
|
||||||
|
pd IDE disk
|
||||||
|
pcd ATAPI CD-ROM
|
||||||
|
pf ATAPI disk
|
||||||
|
pt ATAPI tape
|
||||||
|
pg ATAPI generic
|
||||||
|
=== =============
|
||||||
|
|
||||||
|
(Currently, the pg driver is only used with CD-R drives).
|
||||||
|
|
||||||
|
The high-level drivers function according to the relevant standards.
|
||||||
|
The third component of PARIDE is a set of low-level protocol drivers
|
||||||
|
for each of the parallel port IDE adapter chips. Thanks to the interest
|
||||||
|
and encouragement of Linux users from many parts of the world,
|
||||||
|
support is available for almost all known adapter protocols:
|
||||||
|
|
||||||
|
==== ====================================== ====
|
||||||
|
aten ATEN EH-100 (HK)
|
||||||
|
bpck Microsolutions backpack (US)
|
||||||
|
comm DataStor (old-type) "commuter" adapter (TW)
|
||||||
|
dstr DataStor EP-2000 (TW)
|
||||||
|
epat Shuttle EPAT (UK)
|
||||||
|
epia Shuttle EPIA (UK)
|
||||||
|
fit2 FIT TD-2000 (US)
|
||||||
|
fit3 FIT TD-3000 (US)
|
||||||
|
friq Freecom IQ cable (DE)
|
||||||
|
frpw Freecom Power (DE)
|
||||||
|
kbic KingByte KBIC-951A and KBIC-971A (TW)
|
||||||
|
ktti KT Technology PHd adapter (SG)
|
||||||
|
on20 OnSpec 90c20 (US)
|
||||||
|
on26 OnSpec 90c26 (US)
|
||||||
|
==== ====================================== ====
|
||||||
|
|
||||||
|
|
||||||
|
2. Using the PARIDE subsystem
|
||||||
|
=============================
|
||||||
|
|
||||||
|
While configuring the Linux kernel, you may choose either to build
|
||||||
|
the PARIDE drivers into your kernel, or to build them as modules.
|
||||||
|
|
||||||
|
In either case, you will need to select "Parallel port IDE device support"
|
||||||
|
as well as at least one of the high-level drivers and at least one
|
||||||
|
of the parallel port communication protocols. If you do not know
|
||||||
|
what kind of parallel port adapter is used in your drive, you could
|
||||||
|
begin by checking the file names and any text files on your DOS
|
||||||
|
installation floppy. Alternatively, you can look at the markings on
|
||||||
|
the adapter chip itself. That's usually sufficient to identify the
|
||||||
|
correct device.
|
||||||
|
|
||||||
|
You can actually select all the protocol modules, and allow the PARIDE
|
||||||
|
subsystem to try them all for you.
|
||||||
|
|
||||||
|
For the "brand-name" products listed above, here are the protocol
|
||||||
|
and high-level drivers that you would use:
|
||||||
|
|
||||||
|
================ ============ ====== ========
|
||||||
|
Manufacturer Model Driver Protocol
|
||||||
|
================ ============ ====== ========
|
||||||
|
MicroSolutions CD-ROM pcd bpck
|
||||||
|
MicroSolutions PD drive pf bpck
|
||||||
|
MicroSolutions hard-drive pd bpck
|
||||||
|
MicroSolutions 8000t tape pt bpck
|
||||||
|
SyQuest EZ, SparQ pd epat
|
||||||
|
Imation Superdisk pf epat
|
||||||
|
Maxell Superdisk pf friq
|
||||||
|
Avatar Shark pd epat
|
||||||
|
FreeCom CD-ROM pcd frpw
|
||||||
|
Hewlett-Packard 5GB Tape pt epat
|
||||||
|
Hewlett-Packard 7200e (CD) pcd epat
|
||||||
|
Hewlett-Packard 7200e (CD-R) pg epat
|
||||||
|
================ ============ ====== ========
|
||||||
|
|
||||||
|
2.1 Configuring built-in drivers
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
We recommend that you get to know how the drivers work and how to
|
||||||
|
configure them as loadable modules, before attempting to compile a
|
||||||
|
kernel with the drivers built-in.
|
||||||
|
|
||||||
|
If you built all of your PARIDE support directly into your kernel,
|
||||||
|
and you have just a single parallel port IDE device, your kernel should
|
||||||
|
locate it automatically for you. If you have more than one device,
|
||||||
|
you may need to give some command line options to your bootloader
|
||||||
|
(eg: LILO), how to do that is beyond the scope of this document.
|
||||||
|
|
||||||
|
The high-level drivers accept a number of command line parameters, all
|
||||||
|
of which are documented in the source files in linux/drivers/block/paride.
|
||||||
|
By default, each driver will automatically try all parallel ports it
|
||||||
|
can find, and all protocol types that have been installed, until it finds
|
||||||
|
a parallel port IDE adapter. Once it finds one, the probe stops. So,
|
||||||
|
if you have more than one device, you will need to tell the drivers
|
||||||
|
how to identify them. This requires specifying the port address, the
|
||||||
|
protocol identification number and, for some devices, the drive's
|
||||||
|
chain ID. While your system is booting, a number of messages are
|
||||||
|
displayed on the console. Like all such messages, they can be
|
||||||
|
reviewed with the 'dmesg' command. Among those messages will be
|
||||||
|
some lines like::
|
||||||
|
|
||||||
|
paride: bpck registered as protocol 0
|
||||||
|
paride: epat registered as protocol 1
|
||||||
|
|
||||||
|
The numbers will always be the same until you build a new kernel with
|
||||||
|
different protocol selections. You should note these numbers as you
|
||||||
|
will need them to identify the devices.
|
||||||
|
|
||||||
|
If you happen to be using a MicroSolutions backpack device, you will
|
||||||
|
also need to know the unit ID number for each drive. This is usually
|
||||||
|
the last two digits of the drive's serial number (but read MicroSolutions'
|
||||||
|
documentation about this).
|
||||||
|
|
||||||
|
As an example, let's assume that you have a MicroSolutions PD/CD drive
|
||||||
|
with unit ID number 36 connected to the parallel port at 0x378, a SyQuest
|
||||||
|
EZ-135 connected to the chained port on the PD/CD drive and also an
|
||||||
|
Imation Superdisk connected to port 0x278. You could give the following
|
||||||
|
options on your boot command::
|
||||||
|
|
||||||
|
pd.drive0=0x378,1 pf.drive0=0x278,1 pf.drive1=0x378,0,36
|
||||||
|
|
||||||
|
In the last option, pf.drive1 configures device /dev/pf1, the 0x378
|
||||||
|
is the parallel port base address, the 0 is the protocol registration
|
||||||
|
number and 36 is the chain ID.
|
||||||
|
|
||||||
|
Please note: while PARIDE will work both with and without the
|
||||||
|
PARPORT parallel port sharing system that is included by the
|
||||||
|
"Parallel port support" option, PARPORT must be included and enabled
|
||||||
|
if you want to use chains of devices on the same parallel port.
|
||||||
|
|
||||||
|
2.2 Loading and configuring PARIDE as modules
|
||||||
|
----------------------------------------------
|
||||||
|
|
||||||
|
It is much faster and simpler to get to understand the PARIDE drivers
|
||||||
|
if you use them as loadable kernel modules.
|
||||||
|
|
||||||
|
Note 1:
|
||||||
|
using these drivers with the "kerneld" automatic module loading
|
||||||
|
system is not recommended for beginners, and is not documented here.
|
||||||
|
|
||||||
|
Note 2:
|
||||||
|
if you build PARPORT support as a loadable module, PARIDE must
|
||||||
|
also be built as loadable modules, and PARPORT must be loaded before
|
||||||
|
the PARIDE modules.
|
||||||
|
|
||||||
|
To use PARIDE, you must begin by::
|
||||||
|
|
||||||
|
insmod paride
|
||||||
|
|
||||||
|
this loads a base module which provides a registry for the protocols,
|
||||||
|
among other tasks.
|
||||||
|
|
||||||
|
Then, load as many of the protocol modules as you think you might need.
|
||||||
|
As you load each module, it will register the protocols that it supports,
|
||||||
|
and print a log message to your kernel log file and your console. For
|
||||||
|
example::
|
||||||
|
|
||||||
|
# insmod epat
|
||||||
|
paride: epat registered as protocol 0
|
||||||
|
# insmod kbic
|
||||||
|
paride: k951 registered as protocol 1
|
||||||
|
paride: k971 registered as protocol 2
|
||||||
|
|
||||||
|
Finally, you can load high-level drivers for each kind of device that
|
||||||
|
you have connected. By default, each driver will autoprobe for a single
|
||||||
|
device, but you can support up to four similar devices by giving their
|
||||||
|
individual co-ordinates when you load the driver.
|
||||||
|
|
||||||
|
For example, if you had two no-name CD-ROM drives both using the
|
||||||
|
KingByte KBIC-951A adapter, one on port 0x378 and the other on 0x3bc
|
||||||
|
you could give the following command::
|
||||||
|
|
||||||
|
# insmod pcd drive0=0x378,1 drive1=0x3bc,1
|
||||||
|
|
||||||
|
For most adapters, giving a port address and protocol number is sufficient,
|
||||||
|
but check the source files in linux/drivers/block/paride for more
|
||||||
|
information. (Hopefully someone will write some man pages one day !).
|
||||||
|
|
||||||
|
As another example, here's what happens when PARPORT is installed, and
|
||||||
|
a SyQuest EZ-135 is attached to port 0x378::
|
||||||
|
|
||||||
|
# insmod paride
|
||||||
|
paride: version 1.0 installed
|
||||||
|
# insmod epat
|
||||||
|
paride: epat registered as protocol 0
|
||||||
|
# insmod pd
|
||||||
|
pd: pd version 1.0, major 45, cluster 64, nice 0
|
||||||
|
pda: Sharing parport1 at 0x378
|
||||||
|
pda: epat 1.0, Shuttle EPAT chip c3 at 0x378, mode 5 (EPP-32), delay 1
|
||||||
|
pda: SyQuest EZ135A, 262144 blocks [128M], (512/16/32), removable media
|
||||||
|
pda: pda1
|
||||||
|
|
||||||
|
Note that the last line is the output from the generic partition table
|
||||||
|
scanner - in this case it reports that it has found a disk with one partition.
|
||||||
|
|
||||||
|
2.3 Using a PARIDE device
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
Once the drivers have been loaded, you can access PARIDE devices in the
|
||||||
|
same way as their traditional counterparts. You will probably need to
|
||||||
|
create the device "special files". Here is a simple script that you can
|
||||||
|
cut to a file and execute::
|
||||||
|
|
||||||
|
#!/bin/bash
|
||||||
|
#
|
||||||
|
# mkd -- a script to create the device special files for the PARIDE subsystem
|
||||||
|
#
|
||||||
|
function mkdev {
|
||||||
|
mknod $1 $2 $3 $4 ; chmod 0660 $1 ; chown root:disk $1
|
||||||
|
}
|
||||||
|
#
|
||||||
|
function pd {
|
||||||
|
D=$( printf \\$( printf "x%03x" $[ $1 + 97 ] ) )
|
||||||
|
mkdev pd$D b 45 $[ $1 * 16 ]
|
||||||
|
for P in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
|
||||||
|
do mkdev pd$D$P b 45 $[ $1 * 16 + $P ]
|
||||||
|
done
|
||||||
|
}
|
||||||
|
#
|
||||||
|
cd /dev
|
||||||
|
#
|
||||||
|
for u in 0 1 2 3 ; do pd $u ; done
|
||||||
|
for u in 0 1 2 3 ; do mkdev pcd$u b 46 $u ; done
|
||||||
|
for u in 0 1 2 3 ; do mkdev pf$u b 47 $u ; done
|
||||||
|
for u in 0 1 2 3 ; do mkdev pt$u c 96 $u ; done
|
||||||
|
for u in 0 1 2 3 ; do mkdev npt$u c 96 $[ $u + 128 ] ; done
|
||||||
|
for u in 0 1 2 3 ; do mkdev pg$u c 97 $u ; done
|
||||||
|
#
|
||||||
|
# end of mkd
|
||||||
|
|
||||||
|
With the device files and drivers in place, you can access PARIDE devices
|
||||||
|
like any other Linux device. For example, to mount a CD-ROM in pcd0, use::
|
||||||
|
|
||||||
|
mount /dev/pcd0 /cdrom
|
||||||
|
|
||||||
|
If you have a fresh Avatar Shark cartridge, and the drive is pda, you
|
||||||
|
might do something like::
|
||||||
|
|
||||||
|
fdisk /dev/pda -- make a new partition table with
|
||||||
|
partition 1 of type 83
|
||||||
|
|
||||||
|
mke2fs /dev/pda1 -- to build the file system
|
||||||
|
|
||||||
|
mkdir /shark -- make a place to mount the disk
|
||||||
|
|
||||||
|
mount /dev/pda1 /shark
|
||||||
|
|
||||||
|
Devices like the Imation superdisk work in the same way, except that
|
||||||
|
they do not have a partition table. For example to make a 120MB
|
||||||
|
floppy that you could share with a DOS system::
|
||||||
|
|
||||||
|
mkdosfs /dev/pf0
|
||||||
|
mount /dev/pf0 /mnt
|
||||||
|
|
||||||
|
|
||||||
|
2.4 The pf driver
|
||||||
|
------------------
|
||||||
|
|
||||||
|
The pf driver is intended for use with parallel port ATAPI disk
|
||||||
|
devices. The most common devices in this category are PD drives
|
||||||
|
and LS-120 drives. Traditionally, media for these devices are not
|
||||||
|
partitioned. Consequently, the pf driver does not support partitioned
|
||||||
|
media. This may be changed in a future version of the driver.
|
||||||
|
|
||||||
|
2.5 Using the pt driver
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
The pt driver for parallel port ATAPI tape drives is a minimal driver.
|
||||||
|
It does not yet support many of the standard tape ioctl operations.
|
||||||
|
For best performance, a block size of 32KB should be used. You will
|
||||||
|
probably want to set the parallel port delay to 0, if you can.
|
||||||
|
|
||||||
|
2.6 Using the pg driver
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
The pg driver can be used in conjunction with the cdrecord program
|
||||||
|
to create CD-ROMs. Please get cdrecord version 1.6.1 or later
|
||||||
|
from ftp://ftp.fokus.gmd.de/pub/unix/cdrecord/ . To record CD-R media
|
||||||
|
your parallel port should ideally be set to EPP mode, and the "port delay"
|
||||||
|
should be set to 0. With those settings it is possible to record at 2x
|
||||||
|
speed without any buffer underruns. If you cannot get the driver to work
|
||||||
|
in EPP mode, try to use "bidirectional" or "PS/2" mode and 1x speeds only.
|
||||||
|
|
||||||
|
|
||||||
|
3. Troubleshooting
|
||||||
|
==================
|
||||||
|
|
||||||
|
3.1 Use EPP mode if you can
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
The most common problems that people report with the PARIDE drivers
|
||||||
|
concern the parallel port CMOS settings. At this time, none of the
|
||||||
|
PARIDE protocol modules support ECP mode, or any ECP combination modes.
|
||||||
|
If you are able to do so, please set your parallel port into EPP mode
|
||||||
|
using your CMOS setup procedure.
|
||||||
|
|
||||||
|
3.2 Check the port delay
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
Some parallel ports cannot reliably transfer data at full speed. To
|
||||||
|
offset the errors, the PARIDE protocol modules introduce a "port
|
||||||
|
delay" between each access to the i/o ports. Each protocol sets
|
||||||
|
a default value for this delay. In most cases, the user can override
|
||||||
|
the default and set it to 0 - resulting in somewhat higher transfer
|
||||||
|
rates. In some rare cases (especially with older 486 systems) the
|
||||||
|
default delays are not long enough. if you experience corrupt data
|
||||||
|
transfers, or unexpected failures, you may wish to increase the
|
||||||
|
port delay. The delay can be programmed using the "driveN" parameters
|
||||||
|
to each of the high-level drivers. Please see the notes above, or
|
||||||
|
read the comments at the beginning of the driver source files in
|
||||||
|
linux/drivers/block/paride.
|
||||||
|
|
||||||
|
3.3 Some drives need a printer reset
|
||||||
|
-------------------------------------
|
||||||
|
|
||||||
|
There appear to be a number of "noname" external drives on the market
|
||||||
|
that do not always power up correctly. We have noticed this with some
|
||||||
|
drives based on OnSpec and older Freecom adapters. In these rare cases,
|
||||||
|
the adapter can often be reinitialised by issuing a "printer reset" on
|
||||||
|
the parallel port. As the reset operation is potentially disruptive in
|
||||||
|
multiple device environments, the PARIDE drivers will not do it
|
||||||
|
automatically. You can however, force a printer reset by doing::
|
||||||
|
|
||||||
|
insmod lp reset=1
|
||||||
|
rmmod lp
|
||||||
|
|
||||||
|
If you have one of these marginal cases, you should probably build
|
||||||
|
your paride drivers as modules, and arrange to do the printer reset
|
||||||
|
before loading the PARIDE drivers.
|
||||||
|
|
||||||
|
3.4 Use the verbose option and dmesg if you need help
|
||||||
|
------------------------------------------------------
|
||||||
|
|
||||||
|
While a lot of testing has gone into these drivers to make them work
|
||||||
|
as smoothly as possible, problems will arise. If you do have problems,
|
||||||
|
please check all the obvious things first: does the drive work in
|
||||||
|
DOS with the manufacturer's drivers ? If that doesn't yield any useful
|
||||||
|
clues, then please make sure that only one drive is hooked to your system,
|
||||||
|
and that either (a) PARPORT is enabled or (b) no other device driver
|
||||||
|
is using your parallel port (check in /proc/ioports). Then, load the
|
||||||
|
appropriate drivers (you can load several protocol modules if you want)
|
||||||
|
as in::
|
||||||
|
|
||||||
|
# insmod paride
|
||||||
|
# insmod epat
|
||||||
|
# insmod bpck
|
||||||
|
# insmod kbic
|
||||||
|
...
|
||||||
|
# insmod pd verbose=1
|
||||||
|
|
||||||
|
(using the correct driver for the type of device you have, of course).
|
||||||
|
The verbose=1 parameter will cause the drivers to log a trace of their
|
||||||
|
activity as they attempt to locate your drive.
|
||||||
|
|
||||||
|
Use 'dmesg' to capture a log of all the PARIDE messages (any messages
|
||||||
|
beginning with paride:, a protocol module's name or a driver's name) and
|
||||||
|
include that with your bug report. You can submit a bug report in one
|
||||||
|
of two ways. Either send it directly to the author of the PARIDE suite,
|
||||||
|
by e-mail to grant@torque.net, or join the linux-parport mailing list
|
||||||
|
and post your report there.
|
||||||
|
|
||||||
|
3.5 For more information or help
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
You can join the linux-parport mailing list by sending a mail message
|
||||||
|
to:
|
||||||
|
|
||||||
|
linux-parport-request@torque.net
|
||||||
|
|
||||||
|
with the single word::
|
||||||
|
|
||||||
|
subscribe
|
||||||
|
|
||||||
|
in the body of the mail message (not in the subject line). Please be
|
||||||
|
sure that your mail program is correctly set up when you do this, as
|
||||||
|
the list manager is a robot that will subscribe you using the reply
|
||||||
|
address in your mail headers. REMOVE any anti-spam gimmicks you may
|
||||||
|
have in your mail headers, when sending mail to the list server.
|
||||||
|
|
||||||
|
You might also find some useful information on the linux-parport
|
||||||
|
web pages (although they are not always up to date) at
|
||||||
|
|
||||||
|
http://web.archive.org/web/%2E/http://www.torque.net/parport/
|
177
Documentation/admin-guide/blockdev/ramdisk.rst
Normal file
177
Documentation/admin-guide/blockdev/ramdisk.rst
Normal file
@ -0,0 +1,177 @@
|
|||||||
|
==========================================
|
||||||
|
Using the RAM disk block device with Linux
|
||||||
|
==========================================
|
||||||
|
|
||||||
|
.. Contents:
|
||||||
|
|
||||||
|
1) Overview
|
||||||
|
2) Kernel Command Line Parameters
|
||||||
|
3) Using "rdev -r"
|
||||||
|
4) An Example of Creating a Compressed RAM Disk
|
||||||
|
|
||||||
|
|
||||||
|
1) Overview
|
||||||
|
-----------
|
||||||
|
|
||||||
|
The RAM disk driver is a way to use main system memory as a block device. It
|
||||||
|
is required for initrd, an initial filesystem used if you need to load modules
|
||||||
|
in order to access the root filesystem (see Documentation/admin-guide/initrd.rst). It can
|
||||||
|
also be used for a temporary filesystem for crypto work, since the contents
|
||||||
|
are erased on reboot.
|
||||||
|
|
||||||
|
The RAM disk dynamically grows as more space is required. It does this by using
|
||||||
|
RAM from the buffer cache. The driver marks the buffers it is using as dirty
|
||||||
|
so that the VM subsystem does not try to reclaim them later.
|
||||||
|
|
||||||
|
The RAM disk supports up to 16 RAM disks by default, and can be reconfigured
|
||||||
|
to support an unlimited number of RAM disks (at your own risk). Just change
|
||||||
|
the configuration symbol BLK_DEV_RAM_COUNT in the Block drivers config menu
|
||||||
|
and (re)build the kernel.
|
||||||
|
|
||||||
|
To use RAM disk support with your system, run './MAKEDEV ram' from the /dev
|
||||||
|
directory. RAM disks are all major number 1, and start with minor number 0
|
||||||
|
for /dev/ram0, etc. If used, modern kernels use /dev/ram0 for an initrd.
|
||||||
|
|
||||||
|
The new RAM disk also has the ability to load compressed RAM disk images,
|
||||||
|
allowing one to squeeze more programs onto an average installation or
|
||||||
|
rescue floppy disk.
|
||||||
|
|
||||||
|
|
||||||
|
2) Parameters
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
2a) Kernel Command Line Parameters
|
||||||
|
|
||||||
|
ramdisk_size=N
|
||||||
|
Size of the ramdisk.
|
||||||
|
|
||||||
|
This parameter tells the RAM disk driver to set up RAM disks of N k size. The
|
||||||
|
default is 4096 (4 MB).
|
||||||
|
|
||||||
|
2b) Module parameters
|
||||||
|
|
||||||
|
rd_nr
|
||||||
|
/dev/ramX devices created.
|
||||||
|
|
||||||
|
max_part
|
||||||
|
Maximum partition number.
|
||||||
|
|
||||||
|
rd_size
|
||||||
|
See ramdisk_size.
|
||||||
|
|
||||||
|
3) Using "rdev -r"
|
||||||
|
------------------
|
||||||
|
|
||||||
|
The usage of the word (two bytes) that "rdev -r" sets in the kernel image is
|
||||||
|
as follows. The low 11 bits (0 -> 10) specify an offset (in 1 k blocks) of up
|
||||||
|
to 2 MB (2^11) of where to find the RAM disk (this used to be the size). Bit
|
||||||
|
14 indicates that a RAM disk is to be loaded, and bit 15 indicates whether a
|
||||||
|
prompt/wait sequence is to be given before trying to read the RAM disk. Since
|
||||||
|
the RAM disk dynamically grows as data is being written into it, a size field
|
||||||
|
is not required. Bits 11 to 13 are not currently used and may as well be zero.
|
||||||
|
These numbers are no magical secrets, as seen below::
|
||||||
|
|
||||||
|
./arch/x86/kernel/setup.c:#define RAMDISK_IMAGE_START_MASK 0x07FF
|
||||||
|
./arch/x86/kernel/setup.c:#define RAMDISK_PROMPT_FLAG 0x8000
|
||||||
|
./arch/x86/kernel/setup.c:#define RAMDISK_LOAD_FLAG 0x4000
|
||||||
|
|
||||||
|
Consider a typical two floppy disk setup, where you will have the
|
||||||
|
kernel on disk one, and have already put a RAM disk image onto disk #2.
|
||||||
|
|
||||||
|
Hence you want to set bits 0 to 13 as 0, meaning that your RAM disk
|
||||||
|
starts at an offset of 0 kB from the beginning of the floppy.
|
||||||
|
The command line equivalent is: "ramdisk_start=0"
|
||||||
|
|
||||||
|
You want bit 14 as one, indicating that a RAM disk is to be loaded.
|
||||||
|
The command line equivalent is: "load_ramdisk=1"
|
||||||
|
|
||||||
|
You want bit 15 as one, indicating that you want a prompt/keypress
|
||||||
|
sequence so that you have a chance to switch floppy disks.
|
||||||
|
The command line equivalent is: "prompt_ramdisk=1"
|
||||||
|
|
||||||
|
Putting that together gives 2^15 + 2^14 + 0 = 49152 for an rdev word.
|
||||||
|
So to create disk one of the set, you would do::
|
||||||
|
|
||||||
|
/usr/src/linux# cat arch/x86/boot/zImage > /dev/fd0
|
||||||
|
/usr/src/linux# rdev /dev/fd0 /dev/fd0
|
||||||
|
/usr/src/linux# rdev -r /dev/fd0 49152
|
||||||
|
|
||||||
|
If you make a boot disk that has LILO, then for the above, you would use::
|
||||||
|
|
||||||
|
append = "ramdisk_start=0 load_ramdisk=1 prompt_ramdisk=1"
|
||||||
|
|
||||||
|
Since the default start = 0 and the default prompt = 1, you could use::
|
||||||
|
|
||||||
|
append = "load_ramdisk=1"
|
||||||
|
|
||||||
|
|
||||||
|
4) An Example of Creating a Compressed RAM Disk
|
||||||
|
-----------------------------------------------
|
||||||
|
|
||||||
|
To create a RAM disk image, you will need a spare block device to
|
||||||
|
construct it on. This can be the RAM disk device itself, or an
|
||||||
|
unused disk partition (such as an unmounted swap partition). For this
|
||||||
|
example, we will use the RAM disk device, "/dev/ram0".
|
||||||
|
|
||||||
|
Note: This technique should not be done on a machine with less than 8 MB
|
||||||
|
of RAM. If using a spare disk partition instead of /dev/ram0, then this
|
||||||
|
restriction does not apply.
|
||||||
|
|
||||||
|
a) Decide on the RAM disk size that you want. Say 2 MB for this example.
|
||||||
|
Create it by writing to the RAM disk device. (This step is not currently
|
||||||
|
required, but may be in the future.) It is wise to zero out the
|
||||||
|
area (esp. for disks) so that maximal compression is achieved for
|
||||||
|
the unused blocks of the image that you are about to create::
|
||||||
|
|
||||||
|
dd if=/dev/zero of=/dev/ram0 bs=1k count=2048
|
||||||
|
|
||||||
|
b) Make a filesystem on it. Say ext2fs for this example::
|
||||||
|
|
||||||
|
mke2fs -vm0 /dev/ram0 2048
|
||||||
|
|
||||||
|
c) Mount it, copy the files you want to it (eg: /etc/* /dev/* ...)
|
||||||
|
and unmount it again.
|
||||||
|
|
||||||
|
d) Compress the contents of the RAM disk. The level of compression
|
||||||
|
will be approximately 50% of the space used by the files. Unused
|
||||||
|
space on the RAM disk will compress to almost nothing::
|
||||||
|
|
||||||
|
dd if=/dev/ram0 bs=1k count=2048 | gzip -v9 > /tmp/ram_image.gz
|
||||||
|
|
||||||
|
e) Put the kernel onto the floppy::
|
||||||
|
|
||||||
|
dd if=zImage of=/dev/fd0 bs=1k
|
||||||
|
|
||||||
|
f) Put the RAM disk image onto the floppy, after the kernel. Use an offset
|
||||||
|
that is slightly larger than the kernel, so that you can put another
|
||||||
|
(possibly larger) kernel onto the same floppy later without overlapping
|
||||||
|
the RAM disk image. An offset of 400 kB for kernels about 350 kB in
|
||||||
|
size would be reasonable. Make sure offset+size of ram_image.gz is
|
||||||
|
not larger than the total space on your floppy (usually 1440 kB)::
|
||||||
|
|
||||||
|
dd if=/tmp/ram_image.gz of=/dev/fd0 bs=1k seek=400
|
||||||
|
|
||||||
|
g) Use "rdev" to set the boot device, RAM disk offset, prompt flag, etc.
|
||||||
|
For prompt_ramdisk=1, load_ramdisk=1, ramdisk_start=400, one would
|
||||||
|
have 2^15 + 2^14 + 400 = 49552::
|
||||||
|
|
||||||
|
rdev /dev/fd0 /dev/fd0
|
||||||
|
rdev -r /dev/fd0 49552
|
||||||
|
|
||||||
|
That is it. You now have your boot/root compressed RAM disk floppy. Some
|
||||||
|
users may wish to combine steps (d) and (f) by using a pipe.
|
||||||
|
|
||||||
|
|
||||||
|
Paul Gortmaker 12/95
|
||||||
|
|
||||||
|
Changelog:
|
||||||
|
----------
|
||||||
|
|
||||||
|
10-22-04 :
|
||||||
|
Updated to reflect changes in command line options, remove
|
||||||
|
obsolete references, general cleanup.
|
||||||
|
James Nelson (james4765@gmail.com)
|
||||||
|
|
||||||
|
|
||||||
|
12-95 :
|
||||||
|
Original Document
|
422
Documentation/admin-guide/blockdev/zram.rst
Normal file
422
Documentation/admin-guide/blockdev/zram.rst
Normal file
@ -0,0 +1,422 @@
|
|||||||
|
========================================
|
||||||
|
zram: Compressed RAM based block devices
|
||||||
|
========================================
|
||||||
|
|
||||||
|
Introduction
|
||||||
|
============
|
||||||
|
|
||||||
|
The zram module creates RAM based block devices named /dev/zram<id>
|
||||||
|
(<id> = 0, 1, ...). Pages written to these disks are compressed and stored
|
||||||
|
in memory itself. These disks allow very fast I/O and compression provides
|
||||||
|
good amounts of memory savings. Some of the usecases include /tmp storage,
|
||||||
|
use as swap disks, various caches under /var and maybe many more :)
|
||||||
|
|
||||||
|
Statistics for individual zram devices are exported through sysfs nodes at
|
||||||
|
/sys/block/zram<id>/
|
||||||
|
|
||||||
|
Usage
|
||||||
|
=====
|
||||||
|
|
||||||
|
There are several ways to configure and manage zram device(-s):
|
||||||
|
|
||||||
|
a) using zram and zram_control sysfs attributes
|
||||||
|
b) using zramctl utility, provided by util-linux (util-linux@vger.kernel.org).
|
||||||
|
|
||||||
|
In this document we will describe only 'manual' zram configuration steps,
|
||||||
|
IOW, zram and zram_control sysfs attributes.
|
||||||
|
|
||||||
|
In order to get a better idea about zramctl please consult util-linux
|
||||||
|
documentation, zramctl man-page or `zramctl --help`. Please be informed
|
||||||
|
that zram maintainers do not develop/maintain util-linux or zramctl, should
|
||||||
|
you have any questions please contact util-linux@vger.kernel.org
|
||||||
|
|
||||||
|
Following shows a typical sequence of steps for using zram.
|
||||||
|
|
||||||
|
WARNING
|
||||||
|
=======
|
||||||
|
|
||||||
|
For the sake of simplicity we skip error checking parts in most of the
|
||||||
|
examples below. However, it is your sole responsibility to handle errors.
|
||||||
|
|
||||||
|
zram sysfs attributes always return negative values in case of errors.
|
||||||
|
The list of possible return codes:
|
||||||
|
|
||||||
|
======== =============================================================
|
||||||
|
-EBUSY an attempt to modify an attribute that cannot be changed once
|
||||||
|
the device has been initialised. Please reset device first;
|
||||||
|
-ENOMEM zram was not able to allocate enough memory to fulfil your
|
||||||
|
needs;
|
||||||
|
-EINVAL invalid input has been provided.
|
||||||
|
======== =============================================================
|
||||||
|
|
||||||
|
If you use 'echo', the returned value that is changed by 'echo' utility,
|
||||||
|
and, in general case, something like::
|
||||||
|
|
||||||
|
echo 3 > /sys/block/zram0/max_comp_streams
|
||||||
|
if [ $? -ne 0 ];
|
||||||
|
handle_error
|
||||||
|
fi
|
||||||
|
|
||||||
|
should suffice.
|
||||||
|
|
||||||
|
1) Load Module
|
||||||
|
==============
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
modprobe zram num_devices=4
|
||||||
|
This creates 4 devices: /dev/zram{0,1,2,3}
|
||||||
|
|
||||||
|
num_devices parameter is optional and tells zram how many devices should be
|
||||||
|
pre-created. Default: 1.
|
||||||
|
|
||||||
|
2) Set max number of compression streams
|
||||||
|
========================================
|
||||||
|
|
||||||
|
Regardless the value passed to this attribute, ZRAM will always
|
||||||
|
allocate multiple compression streams - one per online CPUs - thus
|
||||||
|
allowing several concurrent compression operations. The number of
|
||||||
|
allocated compression streams goes down when some of the CPUs
|
||||||
|
become offline. There is no single-compression-stream mode anymore,
|
||||||
|
unless you are running a UP system or has only 1 CPU online.
|
||||||
|
|
||||||
|
To find out how many streams are currently available::
|
||||||
|
|
||||||
|
cat /sys/block/zram0/max_comp_streams
|
||||||
|
|
||||||
|
3) Select compression algorithm
|
||||||
|
===============================
|
||||||
|
|
||||||
|
Using comp_algorithm device attribute one can see available and
|
||||||
|
currently selected (shown in square brackets) compression algorithms,
|
||||||
|
change selected compression algorithm (once the device is initialised
|
||||||
|
there is no way to change compression algorithm).
|
||||||
|
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
#show supported compression algorithms
|
||||||
|
cat /sys/block/zram0/comp_algorithm
|
||||||
|
lzo [lz4]
|
||||||
|
|
||||||
|
#select lzo compression algorithm
|
||||||
|
echo lzo > /sys/block/zram0/comp_algorithm
|
||||||
|
|
||||||
|
For the time being, the `comp_algorithm` content does not necessarily
|
||||||
|
show every compression algorithm supported by the kernel. We keep this
|
||||||
|
list primarily to simplify device configuration and one can configure
|
||||||
|
a new device with a compression algorithm that is not listed in
|
||||||
|
`comp_algorithm`. The thing is that, internally, ZRAM uses Crypto API
|
||||||
|
and, if some of the algorithms were built as modules, it's impossible
|
||||||
|
to list all of them using, for instance, /proc/crypto or any other
|
||||||
|
method. This, however, has an advantage of permitting the usage of
|
||||||
|
custom crypto compression modules (implementing S/W or H/W compression).
|
||||||
|
|
||||||
|
4) Set Disksize
|
||||||
|
===============
|
||||||
|
|
||||||
|
Set disk size by writing the value to sysfs node 'disksize'.
|
||||||
|
The value can be either in bytes or you can use mem suffixes.
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
# Initialize /dev/zram0 with 50MB disksize
|
||||||
|
echo $((50*1024*1024)) > /sys/block/zram0/disksize
|
||||||
|
|
||||||
|
# Using mem suffixes
|
||||||
|
echo 256K > /sys/block/zram0/disksize
|
||||||
|
echo 512M > /sys/block/zram0/disksize
|
||||||
|
echo 1G > /sys/block/zram0/disksize
|
||||||
|
|
||||||
|
Note:
|
||||||
|
There is little point creating a zram of greater than twice the size of memory
|
||||||
|
since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
|
||||||
|
size of the disk when not in use so a huge zram is wasteful.
|
||||||
|
|
||||||
|
5) Set memory limit: Optional
|
||||||
|
=============================
|
||||||
|
|
||||||
|
Set memory limit by writing the value to sysfs node 'mem_limit'.
|
||||||
|
The value can be either in bytes or you can use mem suffixes.
|
||||||
|
In addition, you could change the value in runtime.
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
# limit /dev/zram0 with 50MB memory
|
||||||
|
echo $((50*1024*1024)) > /sys/block/zram0/mem_limit
|
||||||
|
|
||||||
|
# Using mem suffixes
|
||||||
|
echo 256K > /sys/block/zram0/mem_limit
|
||||||
|
echo 512M > /sys/block/zram0/mem_limit
|
||||||
|
echo 1G > /sys/block/zram0/mem_limit
|
||||||
|
|
||||||
|
# To disable memory limit
|
||||||
|
echo 0 > /sys/block/zram0/mem_limit
|
||||||
|
|
||||||
|
6) Activate
|
||||||
|
===========
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
mkswap /dev/zram0
|
||||||
|
swapon /dev/zram0
|
||||||
|
|
||||||
|
mkfs.ext4 /dev/zram1
|
||||||
|
mount /dev/zram1 /tmp
|
||||||
|
|
||||||
|
7) Add/remove zram devices
|
||||||
|
==========================
|
||||||
|
|
||||||
|
zram provides a control interface, which enables dynamic (on-demand) device
|
||||||
|
addition and removal.
|
||||||
|
|
||||||
|
In order to add a new /dev/zramX device, perform read operation on hot_add
|
||||||
|
attribute. This will return either new device's device id (meaning that you
|
||||||
|
can use /dev/zram<id>) or error code.
|
||||||
|
|
||||||
|
Example::
|
||||||
|
|
||||||
|
cat /sys/class/zram-control/hot_add
|
||||||
|
1
|
||||||
|
|
||||||
|
To remove the existing /dev/zramX device (where X is a device id)
|
||||||
|
execute::
|
||||||
|
|
||||||
|
echo X > /sys/class/zram-control/hot_remove
|
||||||
|
|
||||||
|
8) Stats
|
||||||
|
========
|
||||||
|
|
||||||
|
Per-device statistics are exported as various nodes under /sys/block/zram<id>/
|
||||||
|
|
||||||
|
A brief description of exported device attributes. For more details please
|
||||||
|
read Documentation/ABI/testing/sysfs-block-zram.
|
||||||
|
|
||||||
|
====================== ====== ===============================================
|
||||||
|
Name access description
|
||||||
|
====================== ====== ===============================================
|
||||||
|
disksize RW show and set the device's disk size
|
||||||
|
initstate RO shows the initialization state of the device
|
||||||
|
reset WO trigger device reset
|
||||||
|
mem_used_max WO reset the `mem_used_max` counter (see later)
|
||||||
|
mem_limit WO specifies the maximum amount of memory ZRAM can
|
||||||
|
use to store the compressed data
|
||||||
|
writeback_limit WO specifies the maximum amount of write IO zram
|
||||||
|
can write out to backing device as 4KB unit
|
||||||
|
writeback_limit_enable RW show and set writeback_limit feature
|
||||||
|
max_comp_streams RW the number of possible concurrent compress
|
||||||
|
operations
|
||||||
|
comp_algorithm RW show and change the compression algorithm
|
||||||
|
compact WO trigger memory compaction
|
||||||
|
debug_stat RO this file is used for zram debugging purposes
|
||||||
|
backing_dev RW set up backend storage for zram to write out
|
||||||
|
idle WO mark allocated slot as idle
|
||||||
|
====================== ====== ===============================================
|
||||||
|
|
||||||
|
|
||||||
|
User space is advised to use the following files to read the device statistics.
|
||||||
|
|
||||||
|
File /sys/block/zram<id>/stat
|
||||||
|
|
||||||
|
Represents block layer statistics. Read Documentation/block/stat.rst for
|
||||||
|
details.
|
||||||
|
|
||||||
|
File /sys/block/zram<id>/io_stat
|
||||||
|
|
||||||
|
The stat file represents device's I/O statistics not accounted by block
|
||||||
|
layer and, thus, not available in zram<id>/stat file. It consists of a
|
||||||
|
single line of text and contains the following stats separated by
|
||||||
|
whitespace:
|
||||||
|
|
||||||
|
============= =============================================================
|
||||||
|
failed_reads The number of failed reads
|
||||||
|
failed_writes The number of failed writes
|
||||||
|
invalid_io The number of non-page-size-aligned I/O requests
|
||||||
|
notify_free Depending on device usage scenario it may account
|
||||||
|
|
||||||
|
a) the number of pages freed because of swap slot free
|
||||||
|
notifications
|
||||||
|
b) the number of pages freed because of
|
||||||
|
REQ_OP_DISCARD requests sent by bio. The former ones are
|
||||||
|
sent to a swap block device when a swap slot is freed,
|
||||||
|
which implies that this disk is being used as a swap disk.
|
||||||
|
|
||||||
|
The latter ones are sent by filesystem mounted with
|
||||||
|
discard option, whenever some data blocks are getting
|
||||||
|
discarded.
|
||||||
|
============= =============================================================
|
||||||
|
|
||||||
|
File /sys/block/zram<id>/mm_stat
|
||||||
|
|
||||||
|
The stat file represents device's mm statistics. It consists of a single
|
||||||
|
line of text and contains the following stats separated by whitespace:
|
||||||
|
|
||||||
|
================ =============================================================
|
||||||
|
orig_data_size uncompressed size of data stored in this disk.
|
||||||
|
This excludes same-element-filled pages (same_pages) since
|
||||||
|
no memory is allocated for them.
|
||||||
|
Unit: bytes
|
||||||
|
compr_data_size compressed size of data stored in this disk
|
||||||
|
mem_used_total the amount of memory allocated for this disk. This
|
||||||
|
includes allocator fragmentation and metadata overhead,
|
||||||
|
allocated for this disk. So, allocator space efficiency
|
||||||
|
can be calculated using compr_data_size and this statistic.
|
||||||
|
Unit: bytes
|
||||||
|
mem_limit the maximum amount of memory ZRAM can use to store
|
||||||
|
the compressed data
|
||||||
|
mem_used_max the maximum amount of memory zram have consumed to
|
||||||
|
store the data
|
||||||
|
same_pages the number of same element filled pages written to this disk.
|
||||||
|
No memory is allocated for such pages.
|
||||||
|
pages_compacted the number of pages freed during compaction
|
||||||
|
huge_pages the number of incompressible pages
|
||||||
|
================ =============================================================
|
||||||
|
|
||||||
|
File /sys/block/zram<id>/bd_stat
|
||||||
|
|
||||||
|
The stat file represents device's backing device statistics. It consists of
|
||||||
|
a single line of text and contains the following stats separated by whitespace:
|
||||||
|
|
||||||
|
============== =============================================================
|
||||||
|
bd_count size of data written in backing device.
|
||||||
|
Unit: 4K bytes
|
||||||
|
bd_reads the number of reads from backing device
|
||||||
|
Unit: 4K bytes
|
||||||
|
bd_writes the number of writes to backing device
|
||||||
|
Unit: 4K bytes
|
||||||
|
============== =============================================================
|
||||||
|
|
||||||
|
9) Deactivate
|
||||||
|
=============
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
swapoff /dev/zram0
|
||||||
|
umount /dev/zram1
|
||||||
|
|
||||||
|
10) Reset
|
||||||
|
=========
|
||||||
|
|
||||||
|
Write any positive value to 'reset' sysfs node::
|
||||||
|
|
||||||
|
echo 1 > /sys/block/zram0/reset
|
||||||
|
echo 1 > /sys/block/zram1/reset
|
||||||
|
|
||||||
|
This frees all the memory allocated for the given device and
|
||||||
|
resets the disksize to zero. You must set the disksize again
|
||||||
|
before reusing the device.
|
||||||
|
|
||||||
|
Optional Feature
|
||||||
|
================
|
||||||
|
|
||||||
|
writeback
|
||||||
|
---------
|
||||||
|
|
||||||
|
With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
|
||||||
|
to backing storage rather than keeping it in memory.
|
||||||
|
To use the feature, admin should set up backing device via::
|
||||||
|
|
||||||
|
echo /dev/sda5 > /sys/block/zramX/backing_dev
|
||||||
|
|
||||||
|
before disksize setting. It supports only partition at this moment.
|
||||||
|
If admin want to use incompressible page writeback, they could do via::
|
||||||
|
|
||||||
|
echo huge > /sys/block/zramX/write
|
||||||
|
|
||||||
|
To use idle page writeback, first, user need to declare zram pages
|
||||||
|
as idle::
|
||||||
|
|
||||||
|
echo all > /sys/block/zramX/idle
|
||||||
|
|
||||||
|
From now on, any pages on zram are idle pages. The idle mark
|
||||||
|
will be removed until someone request access of the block.
|
||||||
|
IOW, unless there is access request, those pages are still idle pages.
|
||||||
|
|
||||||
|
Admin can request writeback of those idle pages at right timing via::
|
||||||
|
|
||||||
|
echo idle > /sys/block/zramX/writeback
|
||||||
|
|
||||||
|
With the command, zram writeback idle pages from memory to the storage.
|
||||||
|
|
||||||
|
If there are lots of write IO with flash device, potentially, it has
|
||||||
|
flash wearout problem so that admin needs to design write limitation
|
||||||
|
to guarantee storage health for entire product life.
|
||||||
|
|
||||||
|
To overcome the concern, zram supports "writeback_limit" feature.
|
||||||
|
The "writeback_limit_enable"'s default value is 0 so that it doesn't limit
|
||||||
|
any writeback. IOW, if admin want to apply writeback budget, he should
|
||||||
|
enable writeback_limit_enable via::
|
||||||
|
|
||||||
|
$ echo 1 > /sys/block/zramX/writeback_limit_enable
|
||||||
|
|
||||||
|
Once writeback_limit_enable is set, zram doesn't allow any writeback
|
||||||
|
until admin set the budget via /sys/block/zramX/writeback_limit.
|
||||||
|
|
||||||
|
(If admin doesn't enable writeback_limit_enable, writeback_limit's value
|
||||||
|
assigned via /sys/block/zramX/writeback_limit is meaninless.)
|
||||||
|
|
||||||
|
If admin want to limit writeback as per-day 400M, he could do it
|
||||||
|
like below::
|
||||||
|
|
||||||
|
$ MB_SHIFT=20
|
||||||
|
$ 4K_SHIFT=12
|
||||||
|
$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
|
||||||
|
/sys/block/zram0/writeback_limit.
|
||||||
|
$ echo 1 > /sys/block/zram0/writeback_limit_enable
|
||||||
|
|
||||||
|
If admin want to allow further write again once the bugdet is exausted,
|
||||||
|
he could do it like below::
|
||||||
|
|
||||||
|
$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
|
||||||
|
/sys/block/zram0/writeback_limit
|
||||||
|
|
||||||
|
If admin want to see remaining writeback budget since he set::
|
||||||
|
|
||||||
|
$ cat /sys/block/zramX/writeback_limit
|
||||||
|
|
||||||
|
If admin want to disable writeback limit, he could do::
|
||||||
|
|
||||||
|
$ echo 0 > /sys/block/zramX/writeback_limit_enable
|
||||||
|
|
||||||
|
The writeback_limit count will reset whenever you reset zram(e.g.,
|
||||||
|
system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
|
||||||
|
writeback happened until you reset the zram to allocate extra writeback
|
||||||
|
budget in next setting is user's job.
|
||||||
|
|
||||||
|
If admin want to measure writeback count in a certain period, he could
|
||||||
|
know it via /sys/block/zram0/bd_stat's 3rd column.
|
||||||
|
|
||||||
|
memory tracking
|
||||||
|
===============
|
||||||
|
|
||||||
|
With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
|
||||||
|
zram block. It could be useful to catch cold or incompressible
|
||||||
|
pages of the process with*pagemap.
|
||||||
|
|
||||||
|
If you enable the feature, you could see block state via
|
||||||
|
/sys/kernel/debug/zram/zram0/block_state". The output is as follows::
|
||||||
|
|
||||||
|
300 75.033841 .wh.
|
||||||
|
301 63.806904 s...
|
||||||
|
302 63.806919 ..hi
|
||||||
|
|
||||||
|
First column
|
||||||
|
zram's block index.
|
||||||
|
Second column
|
||||||
|
access time since the system was booted
|
||||||
|
Third column
|
||||||
|
state of the block:
|
||||||
|
|
||||||
|
s:
|
||||||
|
same page
|
||||||
|
w:
|
||||||
|
written page to backing store
|
||||||
|
h:
|
||||||
|
huge page
|
||||||
|
i:
|
||||||
|
idle page
|
||||||
|
|
||||||
|
First line of above example says 300th block is accessed at 75.033841sec
|
||||||
|
and the block's state is huge so it is written back to the backing
|
||||||
|
storage. It's a debugging feature so anyone shouldn't rely on it to work
|
||||||
|
properly.
|
||||||
|
|
||||||
|
Nitin Gupta
|
||||||
|
ngupta@vflare.org
|
@ -90,9 +90,9 @@ the disk is not available then you have three options:
|
|||||||
run a null modem to a second machine and capture the output there
|
run a null modem to a second machine and capture the output there
|
||||||
using your favourite communication program. Minicom works well.
|
using your favourite communication program. Minicom works well.
|
||||||
|
|
||||||
(3) Use Kdump (see Documentation/kdump/kdump.rst),
|
(3) Use Kdump (see Documentation/admin-guide/kdump/kdump.rst),
|
||||||
extract the kernel ring buffer from old memory with using dmesg
|
extract the kernel ring buffer from old memory with using dmesg
|
||||||
gdbmacro in Documentation/kdump/gdbmacros.txt.
|
gdbmacro in Documentation/admin-guide/kdump/gdbmacros.txt.
|
||||||
|
|
||||||
Finding the bug's location
|
Finding the bug's location
|
||||||
--------------------------
|
--------------------------
|
||||||
|
695
Documentation/admin-guide/cgroup-v1/cgroups.rst
Normal file
695
Documentation/admin-guide/cgroup-v1/cgroups.rst
Normal file
@ -0,0 +1,695 @@
|
|||||||
|
==============
|
||||||
|
Control Groups
|
||||||
|
==============
|
||||||
|
|
||||||
|
Written by Paul Menage <menage@google.com> based on
|
||||||
|
Documentation/admin-guide/cgroup-v1/cpusets.rst
|
||||||
|
|
||||||
|
Original copyright statements from cpusets.txt:
|
||||||
|
|
||||||
|
Portions Copyright (C) 2004 BULL SA.
|
||||||
|
|
||||||
|
Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
|
||||||
|
|
||||||
|
Modified by Paul Jackson <pj@sgi.com>
|
||||||
|
|
||||||
|
Modified by Christoph Lameter <cl@linux.com>
|
||||||
|
|
||||||
|
.. CONTENTS:
|
||||||
|
|
||||||
|
1. Control Groups
|
||||||
|
1.1 What are cgroups ?
|
||||||
|
1.2 Why are cgroups needed ?
|
||||||
|
1.3 How are cgroups implemented ?
|
||||||
|
1.4 What does notify_on_release do ?
|
||||||
|
1.5 What does clone_children do ?
|
||||||
|
1.6 How do I use cgroups ?
|
||||||
|
2. Usage Examples and Syntax
|
||||||
|
2.1 Basic Usage
|
||||||
|
2.2 Attaching processes
|
||||||
|
2.3 Mounting hierarchies by name
|
||||||
|
3. Kernel API
|
||||||
|
3.1 Overview
|
||||||
|
3.2 Synchronization
|
||||||
|
3.3 Subsystem API
|
||||||
|
4. Extended attributes usage
|
||||||
|
5. Questions
|
||||||
|
|
||||||
|
1. Control Groups
|
||||||
|
=================
|
||||||
|
|
||||||
|
1.1 What are cgroups ?
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
Control Groups provide a mechanism for aggregating/partitioning sets of
|
||||||
|
tasks, and all their future children, into hierarchical groups with
|
||||||
|
specialized behaviour.
|
||||||
|
|
||||||
|
Definitions:
|
||||||
|
|
||||||
|
A *cgroup* associates a set of tasks with a set of parameters for one
|
||||||
|
or more subsystems.
|
||||||
|
|
||||||
|
A *subsystem* is a module that makes use of the task grouping
|
||||||
|
facilities provided by cgroups to treat groups of tasks in
|
||||||
|
particular ways. A subsystem is typically a "resource controller" that
|
||||||
|
schedules a resource or applies per-cgroup limits, but it may be
|
||||||
|
anything that wants to act on a group of processes, e.g. a
|
||||||
|
virtualization subsystem.
|
||||||
|
|
||||||
|
A *hierarchy* is a set of cgroups arranged in a tree, such that
|
||||||
|
every task in the system is in exactly one of the cgroups in the
|
||||||
|
hierarchy, and a set of subsystems; each subsystem has system-specific
|
||||||
|
state attached to each cgroup in the hierarchy. Each hierarchy has
|
||||||
|
an instance of the cgroup virtual filesystem associated with it.
|
||||||
|
|
||||||
|
At any one time there may be multiple active hierarchies of task
|
||||||
|
cgroups. Each hierarchy is a partition of all tasks in the system.
|
||||||
|
|
||||||
|
User-level code may create and destroy cgroups by name in an
|
||||||
|
instance of the cgroup virtual file system, specify and query to
|
||||||
|
which cgroup a task is assigned, and list the task PIDs assigned to
|
||||||
|
a cgroup. Those creations and assignments only affect the hierarchy
|
||||||
|
associated with that instance of the cgroup file system.
|
||||||
|
|
||||||
|
On their own, the only use for cgroups is for simple job
|
||||||
|
tracking. The intention is that other subsystems hook into the generic
|
||||||
|
cgroup support to provide new attributes for cgroups, such as
|
||||||
|
accounting/limiting the resources which processes in a cgroup can
|
||||||
|
access. For example, cpusets (see Documentation/admin-guide/cgroup-v1/cpusets.rst) allow
|
||||||
|
you to associate a set of CPUs and a set of memory nodes with the
|
||||||
|
tasks in each cgroup.
|
||||||
|
|
||||||
|
1.2 Why are cgroups needed ?
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
There are multiple efforts to provide process aggregations in the
|
||||||
|
Linux kernel, mainly for resource-tracking purposes. Such efforts
|
||||||
|
include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server
|
||||||
|
namespaces. These all require the basic notion of a
|
||||||
|
grouping/partitioning of processes, with newly forked processes ending
|
||||||
|
up in the same group (cgroup) as their parent process.
|
||||||
|
|
||||||
|
The kernel cgroup patch provides the minimum essential kernel
|
||||||
|
mechanisms required to efficiently implement such groups. It has
|
||||||
|
minimal impact on the system fast paths, and provides hooks for
|
||||||
|
specific subsystems such as cpusets to provide additional behaviour as
|
||||||
|
desired.
|
||||||
|
|
||||||
|
Multiple hierarchy support is provided to allow for situations where
|
||||||
|
the division of tasks into cgroups is distinctly different for
|
||||||
|
different subsystems - having parallel hierarchies allows each
|
||||||
|
hierarchy to be a natural division of tasks, without having to handle
|
||||||
|
complex combinations of tasks that would be present if several
|
||||||
|
unrelated subsystems needed to be forced into the same tree of
|
||||||
|
cgroups.
|
||||||
|
|
||||||
|
At one extreme, each resource controller or subsystem could be in a
|
||||||
|
separate hierarchy; at the other extreme, all subsystems
|
||||||
|
would be attached to the same hierarchy.
|
||||||
|
|
||||||
|
As an example of a scenario (originally proposed by vatsa@in.ibm.com)
|
||||||
|
that can benefit from multiple hierarchies, consider a large
|
||||||
|
university server with various users - students, professors, system
|
||||||
|
tasks etc. The resource planning for this server could be along the
|
||||||
|
following lines::
|
||||||
|
|
||||||
|
CPU : "Top cpuset"
|
||||||
|
/ \
|
||||||
|
CPUSet1 CPUSet2
|
||||||
|
| |
|
||||||
|
(Professors) (Students)
|
||||||
|
|
||||||
|
In addition (system tasks) are attached to topcpuset (so
|
||||||
|
that they can run anywhere) with a limit of 20%
|
||||||
|
|
||||||
|
Memory : Professors (50%), Students (30%), system (20%)
|
||||||
|
|
||||||
|
Disk : Professors (50%), Students (30%), system (20%)
|
||||||
|
|
||||||
|
Network : WWW browsing (20%), Network File System (60%), others (20%)
|
||||||
|
/ \
|
||||||
|
Professors (15%) students (5%)
|
||||||
|
|
||||||
|
Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd goes
|
||||||
|
into the NFS network class.
|
||||||
|
|
||||||
|
At the same time Firefox/Lynx will share an appropriate CPU/Memory class
|
||||||
|
depending on who launched it (prof/student).
|
||||||
|
|
||||||
|
With the ability to classify tasks differently for different resources
|
||||||
|
(by putting those resource subsystems in different hierarchies),
|
||||||
|
the admin can easily set up a script which receives exec notifications
|
||||||
|
and depending on who is launching the browser he can::
|
||||||
|
|
||||||
|
# echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks
|
||||||
|
|
||||||
|
With only a single hierarchy, he now would potentially have to create
|
||||||
|
a separate cgroup for every browser launched and associate it with
|
||||||
|
appropriate network and other resource class. This may lead to
|
||||||
|
proliferation of such cgroups.
|
||||||
|
|
||||||
|
Also let's say that the administrator would like to give enhanced network
|
||||||
|
access temporarily to a student's browser (since it is night and the user
|
||||||
|
wants to do online gaming :)) OR give one of the student's simulation
|
||||||
|
apps enhanced CPU power.
|
||||||
|
|
||||||
|
With ability to write PIDs directly to resource classes, it's just a
|
||||||
|
matter of::
|
||||||
|
|
||||||
|
# echo pid > /sys/fs/cgroup/network/<new_class>/tasks
|
||||||
|
(after some time)
|
||||||
|
# echo pid > /sys/fs/cgroup/network/<orig_class>/tasks
|
||||||
|
|
||||||
|
Without this ability, the administrator would have to split the cgroup into
|
||||||
|
multiple separate ones and then associate the new cgroups with the
|
||||||
|
new resource classes.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
1.3 How are cgroups implemented ?
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
Control Groups extends the kernel as follows:
|
||||||
|
|
||||||
|
- Each task in the system has a reference-counted pointer to a
|
||||||
|
css_set.
|
||||||
|
|
||||||
|
- A css_set contains a set of reference-counted pointers to
|
||||||
|
cgroup_subsys_state objects, one for each cgroup subsystem
|
||||||
|
registered in the system. There is no direct link from a task to
|
||||||
|
the cgroup of which it's a member in each hierarchy, but this
|
||||||
|
can be determined by following pointers through the
|
||||||
|
cgroup_subsys_state objects. This is because accessing the
|
||||||
|
subsystem state is something that's expected to happen frequently
|
||||||
|
and in performance-critical code, whereas operations that require a
|
||||||
|
task's actual cgroup assignments (in particular, moving between
|
||||||
|
cgroups) are less common. A linked list runs through the cg_list
|
||||||
|
field of each task_struct using the css_set, anchored at
|
||||||
|
css_set->tasks.
|
||||||
|
|
||||||
|
- A cgroup hierarchy filesystem can be mounted for browsing and
|
||||||
|
manipulation from user space.
|
||||||
|
|
||||||
|
- You can list all the tasks (by PID) attached to any cgroup.
|
||||||
|
|
||||||
|
The implementation of cgroups requires a few, simple hooks
|
||||||
|
into the rest of the kernel, none in performance-critical paths:
|
||||||
|
|
||||||
|
- in init/main.c, to initialize the root cgroups and initial
|
||||||
|
css_set at system boot.
|
||||||
|
|
||||||
|
- in fork and exit, to attach and detach a task from its css_set.
|
||||||
|
|
||||||
|
In addition, a new file system of type "cgroup" may be mounted, to
|
||||||
|
enable browsing and modifying the cgroups presently known to the
|
||||||
|
kernel. When mounting a cgroup hierarchy, you may specify a
|
||||||
|
comma-separated list of subsystems to mount as the filesystem mount
|
||||||
|
options. By default, mounting the cgroup filesystem attempts to
|
||||||
|
mount a hierarchy containing all registered subsystems.
|
||||||
|
|
||||||
|
If an active hierarchy with exactly the same set of subsystems already
|
||||||
|
exists, it will be reused for the new mount. If no existing hierarchy
|
||||||
|
matches, and any of the requested subsystems are in use in an existing
|
||||||
|
hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
|
||||||
|
is activated, associated with the requested subsystems.
|
||||||
|
|
||||||
|
It's not currently possible to bind a new subsystem to an active
|
||||||
|
cgroup hierarchy, or to unbind a subsystem from an active cgroup
|
||||||
|
hierarchy. This may be possible in future, but is fraught with nasty
|
||||||
|
error-recovery issues.
|
||||||
|
|
||||||
|
When a cgroup filesystem is unmounted, if there are any
|
||||||
|
child cgroups created below the top-level cgroup, that hierarchy
|
||||||
|
will remain active even though unmounted; if there are no
|
||||||
|
child cgroups then the hierarchy will be deactivated.
|
||||||
|
|
||||||
|
No new system calls are added for cgroups - all support for
|
||||||
|
querying and modifying cgroups is via this cgroup file system.
|
||||||
|
|
||||||
|
Each task under /proc has an added file named 'cgroup' displaying,
|
||||||
|
for each active hierarchy, the subsystem names and the cgroup name
|
||||||
|
as the path relative to the root of the cgroup file system.
|
||||||
|
|
||||||
|
Each cgroup is represented by a directory in the cgroup file system
|
||||||
|
containing the following files describing that cgroup:
|
||||||
|
|
||||||
|
- tasks: list of tasks (by PID) attached to that cgroup. This list
|
||||||
|
is not guaranteed to be sorted. Writing a thread ID into this file
|
||||||
|
moves the thread into this cgroup.
|
||||||
|
- cgroup.procs: list of thread group IDs in the cgroup. This list is
|
||||||
|
not guaranteed to be sorted or free of duplicate TGIDs, and userspace
|
||||||
|
should sort/uniquify the list if this property is required.
|
||||||
|
Writing a thread group ID into this file moves all threads in that
|
||||||
|
group into this cgroup.
|
||||||
|
- notify_on_release flag: run the release agent on exit?
|
||||||
|
- release_agent: the path to use for release notifications (this file
|
||||||
|
exists in the top cgroup only)
|
||||||
|
|
||||||
|
Other subsystems such as cpusets may add additional files in each
|
||||||
|
cgroup dir.
|
||||||
|
|
||||||
|
New cgroups are created using the mkdir system call or shell
|
||||||
|
command. The properties of a cgroup, such as its flags, are
|
||||||
|
modified by writing to the appropriate file in that cgroups
|
||||||
|
directory, as listed above.
|
||||||
|
|
||||||
|
The named hierarchical structure of nested cgroups allows partitioning
|
||||||
|
a large system into nested, dynamically changeable, "soft-partitions".
|
||||||
|
|
||||||
|
The attachment of each task, automatically inherited at fork by any
|
||||||
|
children of that task, to a cgroup allows organizing the work load
|
||||||
|
on a system into related sets of tasks. A task may be re-attached to
|
||||||
|
any other cgroup, if allowed by the permissions on the necessary
|
||||||
|
cgroup file system directories.
|
||||||
|
|
||||||
|
When a task is moved from one cgroup to another, it gets a new
|
||||||
|
css_set pointer - if there's an already existing css_set with the
|
||||||
|
desired collection of cgroups then that group is reused, otherwise a new
|
||||||
|
css_set is allocated. The appropriate existing css_set is located by
|
||||||
|
looking into a hash table.
|
||||||
|
|
||||||
|
To allow access from a cgroup to the css_sets (and hence tasks)
|
||||||
|
that comprise it, a set of cg_cgroup_link objects form a lattice;
|
||||||
|
each cg_cgroup_link is linked into a list of cg_cgroup_links for
|
||||||
|
a single cgroup on its cgrp_link_list field, and a list of
|
||||||
|
cg_cgroup_links for a single css_set on its cg_link_list.
|
||||||
|
|
||||||
|
Thus the set of tasks in a cgroup can be listed by iterating over
|
||||||
|
each css_set that references the cgroup, and sub-iterating over
|
||||||
|
each css_set's task set.
|
||||||
|
|
||||||
|
The use of a Linux virtual file system (vfs) to represent the
|
||||||
|
cgroup hierarchy provides for a familiar permission and name space
|
||||||
|
for cgroups, with a minimum of additional kernel code.
|
||||||
|
|
||||||
|
1.4 What does notify_on_release do ?
|
||||||
|
------------------------------------
|
||||||
|
|
||||||
|
If the notify_on_release flag is enabled (1) in a cgroup, then
|
||||||
|
whenever the last task in the cgroup leaves (exits or attaches to
|
||||||
|
some other cgroup) and the last child cgroup of that cgroup
|
||||||
|
is removed, then the kernel runs the command specified by the contents
|
||||||
|
of the "release_agent" file in that hierarchy's root directory,
|
||||||
|
supplying the pathname (relative to the mount point of the cgroup
|
||||||
|
file system) of the abandoned cgroup. This enables automatic
|
||||||
|
removal of abandoned cgroups. The default value of
|
||||||
|
notify_on_release in the root cgroup at system boot is disabled
|
||||||
|
(0). The default value of other cgroups at creation is the current
|
||||||
|
value of their parents' notify_on_release settings. The default value of
|
||||||
|
a cgroup hierarchy's release_agent path is empty.
|
||||||
|
|
||||||
|
1.5 What does clone_children do ?
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
This flag only affects the cpuset controller. If the clone_children
|
||||||
|
flag is enabled (1) in a cgroup, a new cpuset cgroup will copy its
|
||||||
|
configuration from the parent during initialization.
|
||||||
|
|
||||||
|
1.6 How do I use cgroups ?
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
To start a new job that is to be contained within a cgroup, using
|
||||||
|
the "cpuset" cgroup subsystem, the steps are something like::
|
||||||
|
|
||||||
|
1) mount -t tmpfs cgroup_root /sys/fs/cgroup
|
||||||
|
2) mkdir /sys/fs/cgroup/cpuset
|
||||||
|
3) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
|
||||||
|
4) Create the new cgroup by doing mkdir's and write's (or echo's) in
|
||||||
|
the /sys/fs/cgroup/cpuset virtual file system.
|
||||||
|
5) Start a task that will be the "founding father" of the new job.
|
||||||
|
6) Attach that task to the new cgroup by writing its PID to the
|
||||||
|
/sys/fs/cgroup/cpuset tasks file for that cgroup.
|
||||||
|
7) fork, exec or clone the job tasks from this founding father task.
|
||||||
|
|
||||||
|
For example, the following sequence of commands will setup a cgroup
|
||||||
|
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
|
||||||
|
and then start a subshell 'sh' in that cgroup::
|
||||||
|
|
||||||
|
mount -t tmpfs cgroup_root /sys/fs/cgroup
|
||||||
|
mkdir /sys/fs/cgroup/cpuset
|
||||||
|
mount -t cgroup cpuset -ocpuset /sys/fs/cgroup/cpuset
|
||||||
|
cd /sys/fs/cgroup/cpuset
|
||||||
|
mkdir Charlie
|
||||||
|
cd Charlie
|
||||||
|
/bin/echo 2-3 > cpuset.cpus
|
||||||
|
/bin/echo 1 > cpuset.mems
|
||||||
|
/bin/echo $$ > tasks
|
||||||
|
sh
|
||||||
|
# The subshell 'sh' is now running in cgroup Charlie
|
||||||
|
# The next line should display '/Charlie'
|
||||||
|
cat /proc/self/cgroup
|
||||||
|
|
||||||
|
2. Usage Examples and Syntax
|
||||||
|
============================
|
||||||
|
|
||||||
|
2.1 Basic Usage
|
||||||
|
---------------
|
||||||
|
|
||||||
|
Creating, modifying, using cgroups can be done through the cgroup
|
||||||
|
virtual filesystem.
|
||||||
|
|
||||||
|
To mount a cgroup hierarchy with all available subsystems, type::
|
||||||
|
|
||||||
|
# mount -t cgroup xxx /sys/fs/cgroup
|
||||||
|
|
||||||
|
The "xxx" is not interpreted by the cgroup code, but will appear in
|
||||||
|
/proc/mounts so may be any useful identifying string that you like.
|
||||||
|
|
||||||
|
Note: Some subsystems do not work without some user input first. For instance,
|
||||||
|
if cpusets are enabled the user will have to populate the cpus and mems files
|
||||||
|
for each new cgroup created before that group can be used.
|
||||||
|
|
||||||
|
As explained in section `1.2 Why are cgroups needed?` you should create
|
||||||
|
different hierarchies of cgroups for each single resource or group of
|
||||||
|
resources you want to control. Therefore, you should mount a tmpfs on
|
||||||
|
/sys/fs/cgroup and create directories for each cgroup resource or resource
|
||||||
|
group::
|
||||||
|
|
||||||
|
# mount -t tmpfs cgroup_root /sys/fs/cgroup
|
||||||
|
# mkdir /sys/fs/cgroup/rg1
|
||||||
|
|
||||||
|
To mount a cgroup hierarchy with just the cpuset and memory
|
||||||
|
subsystems, type::
|
||||||
|
|
||||||
|
# mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1
|
||||||
|
|
||||||
|
While remounting cgroups is currently supported, it is not recommend
|
||||||
|
to use it. Remounting allows changing bound subsystems and
|
||||||
|
release_agent. Rebinding is hardly useful as it only works when the
|
||||||
|
hierarchy is empty and release_agent itself should be replaced with
|
||||||
|
conventional fsnotify. The support for remounting will be removed in
|
||||||
|
the future.
|
||||||
|
|
||||||
|
To Specify a hierarchy's release_agent::
|
||||||
|
|
||||||
|
# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
|
||||||
|
xxx /sys/fs/cgroup/rg1
|
||||||
|
|
||||||
|
Note that specifying 'release_agent' more than once will return failure.
|
||||||
|
|
||||||
|
Note that changing the set of subsystems is currently only supported
|
||||||
|
when the hierarchy consists of a single (root) cgroup. Supporting
|
||||||
|
the ability to arbitrarily bind/unbind subsystems from an existing
|
||||||
|
cgroup hierarchy is intended to be implemented in the future.
|
||||||
|
|
||||||
|
Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the
|
||||||
|
tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1
|
||||||
|
is the cgroup that holds the whole system.
|
||||||
|
|
||||||
|
If you want to change the value of release_agent::
|
||||||
|
|
||||||
|
# echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent
|
||||||
|
|
||||||
|
It can also be changed via remount.
|
||||||
|
|
||||||
|
If you want to create a new cgroup under /sys/fs/cgroup/rg1::
|
||||||
|
|
||||||
|
# cd /sys/fs/cgroup/rg1
|
||||||
|
# mkdir my_cgroup
|
||||||
|
|
||||||
|
Now you want to do something with this cgroup:
|
||||||
|
|
||||||
|
# cd my_cgroup
|
||||||
|
|
||||||
|
In this directory you can find several files::
|
||||||
|
|
||||||
|
# ls
|
||||||
|
cgroup.procs notify_on_release tasks
|
||||||
|
(plus whatever files added by the attached subsystems)
|
||||||
|
|
||||||
|
Now attach your shell to this cgroup::
|
||||||
|
|
||||||
|
# /bin/echo $$ > tasks
|
||||||
|
|
||||||
|
You can also create cgroups inside your cgroup by using mkdir in this
|
||||||
|
directory::
|
||||||
|
|
||||||
|
# mkdir my_sub_cs
|
||||||
|
|
||||||
|
To remove a cgroup, just use rmdir::
|
||||||
|
|
||||||
|
# rmdir my_sub_cs
|
||||||
|
|
||||||
|
This will fail if the cgroup is in use (has cgroups inside, or
|
||||||
|
has processes attached, or is held alive by other subsystem-specific
|
||||||
|
reference).
|
||||||
|
|
||||||
|
2.2 Attaching processes
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
# /bin/echo PID > tasks
|
||||||
|
|
||||||
|
Note that it is PID, not PIDs. You can only attach ONE task at a time.
|
||||||
|
If you have several tasks to attach, you have to do it one after another::
|
||||||
|
|
||||||
|
# /bin/echo PID1 > tasks
|
||||||
|
# /bin/echo PID2 > tasks
|
||||||
|
...
|
||||||
|
# /bin/echo PIDn > tasks
|
||||||
|
|
||||||
|
You can attach the current shell task by echoing 0::
|
||||||
|
|
||||||
|
# echo 0 > tasks
|
||||||
|
|
||||||
|
You can use the cgroup.procs file instead of the tasks file to move all
|
||||||
|
threads in a threadgroup at once. Echoing the PID of any task in a
|
||||||
|
threadgroup to cgroup.procs causes all tasks in that threadgroup to be
|
||||||
|
attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
|
||||||
|
in the writing task's threadgroup.
|
||||||
|
|
||||||
|
Note: Since every task is always a member of exactly one cgroup in each
|
||||||
|
mounted hierarchy, to remove a task from its current cgroup you must
|
||||||
|
move it into a new cgroup (possibly the root cgroup) by writing to the
|
||||||
|
new cgroup's tasks file.
|
||||||
|
|
||||||
|
Note: Due to some restrictions enforced by some cgroup subsystems, moving
|
||||||
|
a process to another cgroup can fail.
|
||||||
|
|
||||||
|
2.3 Mounting hierarchies by name
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
Passing the name=<x> option when mounting a cgroups hierarchy
|
||||||
|
associates the given name with the hierarchy. This can be used when
|
||||||
|
mounting a pre-existing hierarchy, in order to refer to it by name
|
||||||
|
rather than by its set of active subsystems. Each hierarchy is either
|
||||||
|
nameless, or has a unique name.
|
||||||
|
|
||||||
|
The name should match [\w.-]+
|
||||||
|
|
||||||
|
When passing a name=<x> option for a new hierarchy, you need to
|
||||||
|
specify subsystems manually; the legacy behaviour of mounting all
|
||||||
|
subsystems when none are explicitly specified is not supported when
|
||||||
|
you give a subsystem a name.
|
||||||
|
|
||||||
|
The name of the subsystem appears as part of the hierarchy description
|
||||||
|
in /proc/mounts and /proc/<pid>/cgroups.
|
||||||
|
|
||||||
|
|
||||||
|
3. Kernel API
|
||||||
|
=============
|
||||||
|
|
||||||
|
3.1 Overview
|
||||||
|
------------
|
||||||
|
|
||||||
|
Each kernel subsystem that wants to hook into the generic cgroup
|
||||||
|
system needs to create a cgroup_subsys object. This contains
|
||||||
|
various methods, which are callbacks from the cgroup system, along
|
||||||
|
with a subsystem ID which will be assigned by the cgroup system.
|
||||||
|
|
||||||
|
Other fields in the cgroup_subsys object include:
|
||||||
|
|
||||||
|
- subsys_id: a unique array index for the subsystem, indicating which
|
||||||
|
entry in cgroup->subsys[] this subsystem should be managing.
|
||||||
|
|
||||||
|
- name: should be initialized to a unique subsystem name. Should be
|
||||||
|
no longer than MAX_CGROUP_TYPE_NAMELEN.
|
||||||
|
|
||||||
|
- early_init: indicate if the subsystem needs early initialization
|
||||||
|
at system boot.
|
||||||
|
|
||||||
|
Each cgroup object created by the system has an array of pointers,
|
||||||
|
indexed by subsystem ID; this pointer is entirely managed by the
|
||||||
|
subsystem; the generic cgroup code will never touch this pointer.
|
||||||
|
|
||||||
|
3.2 Synchronization
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
There is a global mutex, cgroup_mutex, used by the cgroup
|
||||||
|
system. This should be taken by anything that wants to modify a
|
||||||
|
cgroup. It may also be taken to prevent cgroups from being
|
||||||
|
modified, but more specific locks may be more appropriate in that
|
||||||
|
situation.
|
||||||
|
|
||||||
|
See kernel/cgroup.c for more details.
|
||||||
|
|
||||||
|
Subsystems can take/release the cgroup_mutex via the functions
|
||||||
|
cgroup_lock()/cgroup_unlock().
|
||||||
|
|
||||||
|
Accessing a task's cgroup pointer may be done in the following ways:
|
||||||
|
- while holding cgroup_mutex
|
||||||
|
- while holding the task's alloc_lock (via task_lock())
|
||||||
|
- inside an rcu_read_lock() section via rcu_dereference()
|
||||||
|
|
||||||
|
3.3 Subsystem API
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
Each subsystem should:
|
||||||
|
|
||||||
|
- add an entry in linux/cgroup_subsys.h
|
||||||
|
- define a cgroup_subsys object called <name>_cgrp_subsys
|
||||||
|
|
||||||
|
Each subsystem may export the following methods. The only mandatory
|
||||||
|
methods are css_alloc/free. Any others that are null are presumed to
|
||||||
|
be successful no-ops.
|
||||||
|
|
||||||
|
``struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp)``
|
||||||
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
|
Called to allocate a subsystem state object for a cgroup. The
|
||||||
|
subsystem should allocate its subsystem state object for the passed
|
||||||
|
cgroup, returning a pointer to the new object on success or a
|
||||||
|
ERR_PTR() value. On success, the subsystem pointer should point to
|
||||||
|
a structure of type cgroup_subsys_state (typically embedded in a
|
||||||
|
larger subsystem-specific object), which will be initialized by the
|
||||||
|
cgroup system. Note that this will be called at initialization to
|
||||||
|
create the root subsystem state for this subsystem; this case can be
|
||||||
|
identified by the passed cgroup object having a NULL parent (since
|
||||||
|
it's the root of the hierarchy) and may be an appropriate place for
|
||||||
|
initialization code.
|
||||||
|
|
||||||
|
``int css_online(struct cgroup *cgrp)``
|
||||||
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
|
Called after @cgrp successfully completed all allocations and made
|
||||||
|
visible to cgroup_for_each_child/descendant_*() iterators. The
|
||||||
|
subsystem may choose to fail creation by returning -errno. This
|
||||||
|
callback can be used to implement reliable state sharing and
|
||||||
|
propagation along the hierarchy. See the comment on
|
||||||
|
cgroup_for_each_descendant_pre() for details.
|
||||||
|
|
||||||
|
``void css_offline(struct cgroup *cgrp);``
|
||||||
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
|
This is the counterpart of css_online() and called iff css_online()
|
||||||
|
has succeeded on @cgrp. This signifies the beginning of the end of
|
||||||
|
@cgrp. @cgrp is being removed and the subsystem should start dropping
|
||||||
|
all references it's holding on @cgrp. When all references are dropped,
|
||||||
|
cgroup removal will proceed to the next step - css_free(). After this
|
||||||
|
callback, @cgrp should be considered dead to the subsystem.
|
||||||
|
|
||||||
|
``void css_free(struct cgroup *cgrp)``
|
||||||
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
|
The cgroup system is about to free @cgrp; the subsystem should free
|
||||||
|
its subsystem state object. By the time this method is called, @cgrp
|
||||||
|
is completely unused; @cgrp->parent is still valid. (Note - can also
|
||||||
|
be called for a newly-created cgroup if an error occurs after this
|
||||||
|
subsystem's create() method has been called for the new cgroup).
|
||||||
|
|
||||||
|
``int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
|
||||||
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
|
Called prior to moving one or more tasks into a cgroup; if the
|
||||||
|
subsystem returns an error, this will abort the attach operation.
|
||||||
|
@tset contains the tasks to be attached and is guaranteed to have at
|
||||||
|
least one task in it.
|
||||||
|
|
||||||
|
If there are multiple tasks in the taskset, then:
|
||||||
|
- it's guaranteed that all are from the same thread group
|
||||||
|
- @tset contains all tasks from the thread group whether or not
|
||||||
|
they're switching cgroups
|
||||||
|
- the first task is the leader
|
||||||
|
|
||||||
|
Each @tset entry also contains the task's old cgroup and tasks which
|
||||||
|
aren't switching cgroup can be skipped easily using the
|
||||||
|
cgroup_taskset_for_each() iterator. Note that this isn't called on a
|
||||||
|
fork. If this method returns 0 (success) then this should remain valid
|
||||||
|
while the caller holds cgroup_mutex and it is ensured that either
|
||||||
|
attach() or cancel_attach() will be called in future.
|
||||||
|
|
||||||
|
``void css_reset(struct cgroup_subsys_state *css)``
|
||||||
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
|
An optional operation which should restore @css's configuration to the
|
||||||
|
initial state. This is currently only used on the unified hierarchy
|
||||||
|
when a subsystem is disabled on a cgroup through
|
||||||
|
"cgroup.subtree_control" but should remain enabled because other
|
||||||
|
subsystems depend on it. cgroup core makes such a css invisible by
|
||||||
|
removing the associated interface files and invokes this callback so
|
||||||
|
that the hidden subsystem can return to the initial neutral state.
|
||||||
|
This prevents unexpected resource control from a hidden css and
|
||||||
|
ensures that the configuration is in the initial state when it is made
|
||||||
|
visible again later.
|
||||||
|
|
||||||
|
``void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
|
||||||
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
|
Called when a task attach operation has failed after can_attach() has succeeded.
|
||||||
|
A subsystem whose can_attach() has some side-effects should provide this
|
||||||
|
function, so that the subsystem can implement a rollback. If not, not necessary.
|
||||||
|
This will be called only about subsystems whose can_attach() operation have
|
||||||
|
succeeded. The parameters are identical to can_attach().
|
||||||
|
|
||||||
|
``void attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
|
||||||
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
|
Called after the task has been attached to the cgroup, to allow any
|
||||||
|
post-attachment activity that requires memory allocations or blocking.
|
||||||
|
The parameters are identical to can_attach().
|
||||||
|
|
||||||
|
``void fork(struct task_struct *task)``
|
||||||
|
|
||||||
|
Called when a task is forked into a cgroup.
|
||||||
|
|
||||||
|
``void exit(struct task_struct *task)``
|
||||||
|
|
||||||
|
Called during task exit.
|
||||||
|
|
||||||
|
``void free(struct task_struct *task)``
|
||||||
|
|
||||||
|
Called when the task_struct is freed.
|
||||||
|
|
||||||
|
``void bind(struct cgroup *root)``
|
||||||
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
|
Called when a cgroup subsystem is rebound to a different hierarchy
|
||||||
|
and root cgroup. Currently this will only involve movement between
|
||||||
|
the default hierarchy (which never has sub-cgroups) and a hierarchy
|
||||||
|
that is being created/destroyed (and hence has no sub-cgroups).
|
||||||
|
|
||||||
|
4. Extended attribute usage
|
||||||
|
===========================
|
||||||
|
|
||||||
|
cgroup filesystem supports certain types of extended attributes in its
|
||||||
|
directories and files. The current supported types are:
|
||||||
|
|
||||||
|
- Trusted (XATTR_TRUSTED)
|
||||||
|
- Security (XATTR_SECURITY)
|
||||||
|
|
||||||
|
Both require CAP_SYS_ADMIN capability to set.
|
||||||
|
|
||||||
|
Like in tmpfs, the extended attributes in cgroup filesystem are stored
|
||||||
|
using kernel memory and it's advised to keep the usage at minimum. This
|
||||||
|
is the reason why user defined extended attributes are not supported, since
|
||||||
|
any user can do it and there's no limit in the value size.
|
||||||
|
|
||||||
|
The current known users for this feature are SELinux to limit cgroup usage
|
||||||
|
in containers and systemd for assorted meta data like main PID in a cgroup
|
||||||
|
(systemd creates a cgroup per service).
|
||||||
|
|
||||||
|
5. Questions
|
||||||
|
============
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
Q: what's up with this '/bin/echo' ?
|
||||||
|
A: bash's builtin 'echo' command does not check calls to write() against
|
||||||
|
errors. If you use it in the cgroup file system, you won't be
|
||||||
|
able to tell whether a command succeeded or failed.
|
||||||
|
|
||||||
|
Q: When I attach processes, only the first of the line gets really attached !
|
||||||
|
A: We can only return one error code per call to write(). So you should also
|
||||||
|
put only ONE PID.
|
866
Documentation/admin-guide/cgroup-v1/cpusets.rst
Normal file
866
Documentation/admin-guide/cgroup-v1/cpusets.rst
Normal file
@ -0,0 +1,866 @@
|
|||||||
|
=======
|
||||||
|
CPUSETS
|
||||||
|
=======
|
||||||
|
|
||||||
|
Copyright (C) 2004 BULL SA.
|
||||||
|
|
||||||
|
Written by Simon.Derr@bull.net
|
||||||
|
|
||||||
|
- Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
|
||||||
|
- Modified by Paul Jackson <pj@sgi.com>
|
||||||
|
- Modified by Christoph Lameter <cl@linux.com>
|
||||||
|
- Modified by Paul Menage <menage@google.com>
|
||||||
|
- Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
|
||||||
|
|
||||||
|
.. CONTENTS:
|
||||||
|
|
||||||
|
1. Cpusets
|
||||||
|
1.1 What are cpusets ?
|
||||||
|
1.2 Why are cpusets needed ?
|
||||||
|
1.3 How are cpusets implemented ?
|
||||||
|
1.4 What are exclusive cpusets ?
|
||||||
|
1.5 What is memory_pressure ?
|
||||||
|
1.6 What is memory spread ?
|
||||||
|
1.7 What is sched_load_balance ?
|
||||||
|
1.8 What is sched_relax_domain_level ?
|
||||||
|
1.9 How do I use cpusets ?
|
||||||
|
2. Usage Examples and Syntax
|
||||||
|
2.1 Basic Usage
|
||||||
|
2.2 Adding/removing cpus
|
||||||
|
2.3 Setting flags
|
||||||
|
2.4 Attaching processes
|
||||||
|
3. Questions
|
||||||
|
4. Contact
|
||||||
|
|
||||||
|
1. Cpusets
|
||||||
|
==========
|
||||||
|
|
||||||
|
1.1 What are cpusets ?
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
Cpusets provide a mechanism for assigning a set of CPUs and Memory
|
||||||
|
Nodes to a set of tasks. In this document "Memory Node" refers to
|
||||||
|
an on-line node that contains memory.
|
||||||
|
|
||||||
|
Cpusets constrain the CPU and Memory placement of tasks to only
|
||||||
|
the resources within a task's current cpuset. They form a nested
|
||||||
|
hierarchy visible in a virtual file system. These are the essential
|
||||||
|
hooks, beyond what is already present, required to manage dynamic
|
||||||
|
job placement on large systems.
|
||||||
|
|
||||||
|
Cpusets use the generic cgroup subsystem described in
|
||||||
|
Documentation/admin-guide/cgroup-v1/cgroups.rst.
|
||||||
|
|
||||||
|
Requests by a task, using the sched_setaffinity(2) system call to
|
||||||
|
include CPUs in its CPU affinity mask, and using the mbind(2) and
|
||||||
|
set_mempolicy(2) system calls to include Memory Nodes in its memory
|
||||||
|
policy, are both filtered through that task's cpuset, filtering out any
|
||||||
|
CPUs or Memory Nodes not in that cpuset. The scheduler will not
|
||||||
|
schedule a task on a CPU that is not allowed in its cpus_allowed
|
||||||
|
vector, and the kernel page allocator will not allocate a page on a
|
||||||
|
node that is not allowed in the requesting task's mems_allowed vector.
|
||||||
|
|
||||||
|
User level code may create and destroy cpusets by name in the cgroup
|
||||||
|
virtual file system, manage the attributes and permissions of these
|
||||||
|
cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
|
||||||
|
specify and query to which cpuset a task is assigned, and list the
|
||||||
|
task pids assigned to a cpuset.
|
||||||
|
|
||||||
|
|
||||||
|
1.2 Why are cpusets needed ?
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
The management of large computer systems, with many processors (CPUs),
|
||||||
|
complex memory cache hierarchies and multiple Memory Nodes having
|
||||||
|
non-uniform access times (NUMA) presents additional challenges for
|
||||||
|
the efficient scheduling and memory placement of processes.
|
||||||
|
|
||||||
|
Frequently more modest sized systems can be operated with adequate
|
||||||
|
efficiency just by letting the operating system automatically share
|
||||||
|
the available CPU and Memory resources amongst the requesting tasks.
|
||||||
|
|
||||||
|
But larger systems, which benefit more from careful processor and
|
||||||
|
memory placement to reduce memory access times and contention,
|
||||||
|
and which typically represent a larger investment for the customer,
|
||||||
|
can benefit from explicitly placing jobs on properly sized subsets of
|
||||||
|
the system.
|
||||||
|
|
||||||
|
This can be especially valuable on:
|
||||||
|
|
||||||
|
* Web Servers running multiple instances of the same web application,
|
||||||
|
* Servers running different applications (for instance, a web server
|
||||||
|
and a database), or
|
||||||
|
* NUMA systems running large HPC applications with demanding
|
||||||
|
performance characteristics.
|
||||||
|
|
||||||
|
These subsets, or "soft partitions" must be able to be dynamically
|
||||||
|
adjusted, as the job mix changes, without impacting other concurrently
|
||||||
|
executing jobs. The location of the running jobs pages may also be moved
|
||||||
|
when the memory locations are changed.
|
||||||
|
|
||||||
|
The kernel cpuset patch provides the minimum essential kernel
|
||||||
|
mechanisms required to efficiently implement such subsets. It
|
||||||
|
leverages existing CPU and Memory Placement facilities in the Linux
|
||||||
|
kernel to avoid any additional impact on the critical scheduler or
|
||||||
|
memory allocator code.
|
||||||
|
|
||||||
|
|
||||||
|
1.3 How are cpusets implemented ?
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
Cpusets provide a Linux kernel mechanism to constrain which CPUs and
|
||||||
|
Memory Nodes are used by a process or set of processes.
|
||||||
|
|
||||||
|
The Linux kernel already has a pair of mechanisms to specify on which
|
||||||
|
CPUs a task may be scheduled (sched_setaffinity) and on which Memory
|
||||||
|
Nodes it may obtain memory (mbind, set_mempolicy).
|
||||||
|
|
||||||
|
Cpusets extends these two mechanisms as follows:
|
||||||
|
|
||||||
|
- Cpusets are sets of allowed CPUs and Memory Nodes, known to the
|
||||||
|
kernel.
|
||||||
|
- Each task in the system is attached to a cpuset, via a pointer
|
||||||
|
in the task structure to a reference counted cgroup structure.
|
||||||
|
- Calls to sched_setaffinity are filtered to just those CPUs
|
||||||
|
allowed in that task's cpuset.
|
||||||
|
- Calls to mbind and set_mempolicy are filtered to just
|
||||||
|
those Memory Nodes allowed in that task's cpuset.
|
||||||
|
- The root cpuset contains all the systems CPUs and Memory
|
||||||
|
Nodes.
|
||||||
|
- For any cpuset, one can define child cpusets containing a subset
|
||||||
|
of the parents CPU and Memory Node resources.
|
||||||
|
- The hierarchy of cpusets can be mounted at /dev/cpuset, for
|
||||||
|
browsing and manipulation from user space.
|
||||||
|
- A cpuset may be marked exclusive, which ensures that no other
|
||||||
|
cpuset (except direct ancestors and descendants) may contain
|
||||||
|
any overlapping CPUs or Memory Nodes.
|
||||||
|
- You can list all the tasks (by pid) attached to any cpuset.
|
||||||
|
|
||||||
|
The implementation of cpusets requires a few, simple hooks
|
||||||
|
into the rest of the kernel, none in performance critical paths:
|
||||||
|
|
||||||
|
- in init/main.c, to initialize the root cpuset at system boot.
|
||||||
|
- in fork and exit, to attach and detach a task from its cpuset.
|
||||||
|
- in sched_setaffinity, to mask the requested CPUs by what's
|
||||||
|
allowed in that task's cpuset.
|
||||||
|
- in sched.c migrate_live_tasks(), to keep migrating tasks within
|
||||||
|
the CPUs allowed by their cpuset, if possible.
|
||||||
|
- in the mbind and set_mempolicy system calls, to mask the requested
|
||||||
|
Memory Nodes by what's allowed in that task's cpuset.
|
||||||
|
- in page_alloc.c, to restrict memory to allowed nodes.
|
||||||
|
- in vmscan.c, to restrict page recovery to the current cpuset.
|
||||||
|
|
||||||
|
You should mount the "cgroup" filesystem type in order to enable
|
||||||
|
browsing and modifying the cpusets presently known to the kernel. No
|
||||||
|
new system calls are added for cpusets - all support for querying and
|
||||||
|
modifying cpusets is via this cpuset file system.
|
||||||
|
|
||||||
|
The /proc/<pid>/status file for each task has four added lines,
|
||||||
|
displaying the task's cpus_allowed (on which CPUs it may be scheduled)
|
||||||
|
and mems_allowed (on which Memory Nodes it may obtain memory),
|
||||||
|
in the two formats seen in the following example::
|
||||||
|
|
||||||
|
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
|
||||||
|
Cpus_allowed_list: 0-127
|
||||||
|
Mems_allowed: ffffffff,ffffffff
|
||||||
|
Mems_allowed_list: 0-63
|
||||||
|
|
||||||
|
Each cpuset is represented by a directory in the cgroup file system
|
||||||
|
containing (on top of the standard cgroup files) the following
|
||||||
|
files describing that cpuset:
|
||||||
|
|
||||||
|
- cpuset.cpus: list of CPUs in that cpuset
|
||||||
|
- cpuset.mems: list of Memory Nodes in that cpuset
|
||||||
|
- cpuset.memory_migrate flag: if set, move pages to cpusets nodes
|
||||||
|
- cpuset.cpu_exclusive flag: is cpu placement exclusive?
|
||||||
|
- cpuset.mem_exclusive flag: is memory placement exclusive?
|
||||||
|
- cpuset.mem_hardwall flag: is memory allocation hardwalled
|
||||||
|
- cpuset.memory_pressure: measure of how much paging pressure in cpuset
|
||||||
|
- cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
|
||||||
|
- cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
|
||||||
|
- cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
|
||||||
|
- cpuset.sched_relax_domain_level: the searching range when migrating tasks
|
||||||
|
|
||||||
|
In addition, only the root cpuset has the following file:
|
||||||
|
|
||||||
|
- cpuset.memory_pressure_enabled flag: compute memory_pressure?
|
||||||
|
|
||||||
|
New cpusets are created using the mkdir system call or shell
|
||||||
|
command. The properties of a cpuset, such as its flags, allowed
|
||||||
|
CPUs and Memory Nodes, and attached tasks, are modified by writing
|
||||||
|
to the appropriate file in that cpusets directory, as listed above.
|
||||||
|
|
||||||
|
The named hierarchical structure of nested cpusets allows partitioning
|
||||||
|
a large system into nested, dynamically changeable, "soft-partitions".
|
||||||
|
|
||||||
|
The attachment of each task, automatically inherited at fork by any
|
||||||
|
children of that task, to a cpuset allows organizing the work load
|
||||||
|
on a system into related sets of tasks such that each set is constrained
|
||||||
|
to using the CPUs and Memory Nodes of a particular cpuset. A task
|
||||||
|
may be re-attached to any other cpuset, if allowed by the permissions
|
||||||
|
on the necessary cpuset file system directories.
|
||||||
|
|
||||||
|
Such management of a system "in the large" integrates smoothly with
|
||||||
|
the detailed placement done on individual tasks and memory regions
|
||||||
|
using the sched_setaffinity, mbind and set_mempolicy system calls.
|
||||||
|
|
||||||
|
The following rules apply to each cpuset:
|
||||||
|
|
||||||
|
- Its CPUs and Memory Nodes must be a subset of its parents.
|
||||||
|
- It can't be marked exclusive unless its parent is.
|
||||||
|
- If its cpu or memory is exclusive, they may not overlap any sibling.
|
||||||
|
|
||||||
|
These rules, and the natural hierarchy of cpusets, enable efficient
|
||||||
|
enforcement of the exclusive guarantee, without having to scan all
|
||||||
|
cpusets every time any of them change to ensure nothing overlaps a
|
||||||
|
exclusive cpuset. Also, the use of a Linux virtual file system (vfs)
|
||||||
|
to represent the cpuset hierarchy provides for a familiar permission
|
||||||
|
and name space for cpusets, with a minimum of additional kernel code.
|
||||||
|
|
||||||
|
The cpus and mems files in the root (top_cpuset) cpuset are
|
||||||
|
read-only. The cpus file automatically tracks the value of
|
||||||
|
cpu_online_mask using a CPU hotplug notifier, and the mems file
|
||||||
|
automatically tracks the value of node_states[N_MEMORY]--i.e.,
|
||||||
|
nodes with memory--using the cpuset_track_online_nodes() hook.
|
||||||
|
|
||||||
|
|
||||||
|
1.4 What are exclusive cpusets ?
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
If a cpuset is cpu or mem exclusive, no other cpuset, other than
|
||||||
|
a direct ancestor or descendant, may share any of the same CPUs or
|
||||||
|
Memory Nodes.
|
||||||
|
|
||||||
|
A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled",
|
||||||
|
i.e. it restricts kernel allocations for page, buffer and other data
|
||||||
|
commonly shared by the kernel across multiple users. All cpusets,
|
||||||
|
whether hardwalled or not, restrict allocations of memory for user
|
||||||
|
space. This enables configuring a system so that several independent
|
||||||
|
jobs can share common kernel data, such as file system pages, while
|
||||||
|
isolating each job's user allocation in its own cpuset. To do this,
|
||||||
|
construct a large mem_exclusive cpuset to hold all the jobs, and
|
||||||
|
construct child, non-mem_exclusive cpusets for each individual job.
|
||||||
|
Only a small amount of typical kernel memory, such as requests from
|
||||||
|
interrupt handlers, is allowed to be taken outside even a
|
||||||
|
mem_exclusive cpuset.
|
||||||
|
|
||||||
|
|
||||||
|
1.5 What is memory_pressure ?
|
||||||
|
-----------------------------
|
||||||
|
The memory_pressure of a cpuset provides a simple per-cpuset metric
|
||||||
|
of the rate that the tasks in a cpuset are attempting to free up in
|
||||||
|
use memory on the nodes of the cpuset to satisfy additional memory
|
||||||
|
requests.
|
||||||
|
|
||||||
|
This enables batch managers monitoring jobs running in dedicated
|
||||||
|
cpusets to efficiently detect what level of memory pressure that job
|
||||||
|
is causing.
|
||||||
|
|
||||||
|
This is useful both on tightly managed systems running a wide mix of
|
||||||
|
submitted jobs, which may choose to terminate or re-prioritize jobs that
|
||||||
|
are trying to use more memory than allowed on the nodes assigned to them,
|
||||||
|
and with tightly coupled, long running, massively parallel scientific
|
||||||
|
computing jobs that will dramatically fail to meet required performance
|
||||||
|
goals if they start to use more memory than allowed to them.
|
||||||
|
|
||||||
|
This mechanism provides a very economical way for the batch manager
|
||||||
|
to monitor a cpuset for signs of memory pressure. It's up to the
|
||||||
|
batch manager or other user code to decide what to do about it and
|
||||||
|
take action.
|
||||||
|
|
||||||
|
==>
|
||||||
|
Unless this feature is enabled by writing "1" to the special file
|
||||||
|
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
|
||||||
|
code of __alloc_pages() for this metric reduces to simply noticing
|
||||||
|
that the cpuset_memory_pressure_enabled flag is zero. So only
|
||||||
|
systems that enable this feature will compute the metric.
|
||||||
|
|
||||||
|
Why a per-cpuset, running average:
|
||||||
|
|
||||||
|
Because this meter is per-cpuset, rather than per-task or mm,
|
||||||
|
the system load imposed by a batch scheduler monitoring this
|
||||||
|
metric is sharply reduced on large systems, because a scan of
|
||||||
|
the tasklist can be avoided on each set of queries.
|
||||||
|
|
||||||
|
Because this meter is a running average, instead of an accumulating
|
||||||
|
counter, a batch scheduler can detect memory pressure with a
|
||||||
|
single read, instead of having to read and accumulate results
|
||||||
|
for a period of time.
|
||||||
|
|
||||||
|
Because this meter is per-cpuset rather than per-task or mm,
|
||||||
|
the batch scheduler can obtain the key information, memory
|
||||||
|
pressure in a cpuset, with a single read, rather than having to
|
||||||
|
query and accumulate results over all the (dynamically changing)
|
||||||
|
set of tasks in the cpuset.
|
||||||
|
|
||||||
|
A per-cpuset simple digital filter (requires a spinlock and 3 words
|
||||||
|
of data per-cpuset) is kept, and updated by any task attached to that
|
||||||
|
cpuset, if it enters the synchronous (direct) page reclaim code.
|
||||||
|
|
||||||
|
A per-cpuset file provides an integer number representing the recent
|
||||||
|
(half-life of 10 seconds) rate of direct page reclaims caused by
|
||||||
|
the tasks in the cpuset, in units of reclaims attempted per second,
|
||||||
|
times 1000.
|
||||||
|
|
||||||
|
|
||||||
|
1.6 What is memory spread ?
|
||||||
|
---------------------------
|
||||||
|
There are two boolean flag files per cpuset that control where the
|
||||||
|
kernel allocates pages for the file system buffers and related in
|
||||||
|
kernel data structures. They are called 'cpuset.memory_spread_page' and
|
||||||
|
'cpuset.memory_spread_slab'.
|
||||||
|
|
||||||
|
If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then
|
||||||
|
the kernel will spread the file system buffers (page cache) evenly
|
||||||
|
over all the nodes that the faulting task is allowed to use, instead
|
||||||
|
of preferring to put those pages on the node where the task is running.
|
||||||
|
|
||||||
|
If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set,
|
||||||
|
then the kernel will spread some file system related slab caches,
|
||||||
|
such as for inodes and dentries evenly over all the nodes that the
|
||||||
|
faulting task is allowed to use, instead of preferring to put those
|
||||||
|
pages on the node where the task is running.
|
||||||
|
|
||||||
|
The setting of these flags does not affect anonymous data segment or
|
||||||
|
stack segment pages of a task.
|
||||||
|
|
||||||
|
By default, both kinds of memory spreading are off, and memory
|
||||||
|
pages are allocated on the node local to where the task is running,
|
||||||
|
except perhaps as modified by the task's NUMA mempolicy or cpuset
|
||||||
|
configuration, so long as sufficient free memory pages are available.
|
||||||
|
|
||||||
|
When new cpusets are created, they inherit the memory spread settings
|
||||||
|
of their parent.
|
||||||
|
|
||||||
|
Setting memory spreading causes allocations for the affected page
|
||||||
|
or slab caches to ignore the task's NUMA mempolicy and be spread
|
||||||
|
instead. Tasks using mbind() or set_mempolicy() calls to set NUMA
|
||||||
|
mempolicies will not notice any change in these calls as a result of
|
||||||
|
their containing task's memory spread settings. If memory spreading
|
||||||
|
is turned off, then the currently specified NUMA mempolicy once again
|
||||||
|
applies to memory page allocations.
|
||||||
|
|
||||||
|
Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag
|
||||||
|
files. By default they contain "0", meaning that the feature is off
|
||||||
|
for that cpuset. If a "1" is written to that file, then that turns
|
||||||
|
the named feature on.
|
||||||
|
|
||||||
|
The implementation is simple.
|
||||||
|
|
||||||
|
Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag
|
||||||
|
PFA_SPREAD_PAGE for each task that is in that cpuset or subsequently
|
||||||
|
joins that cpuset. The page allocation calls for the page cache
|
||||||
|
is modified to perform an inline check for this PFA_SPREAD_PAGE task
|
||||||
|
flag, and if set, a call to a new routine cpuset_mem_spread_node()
|
||||||
|
returns the node to prefer for the allocation.
|
||||||
|
|
||||||
|
Similarly, setting 'cpuset.memory_spread_slab' turns on the flag
|
||||||
|
PFA_SPREAD_SLAB, and appropriately marked slab caches will allocate
|
||||||
|
pages from the node returned by cpuset_mem_spread_node().
|
||||||
|
|
||||||
|
The cpuset_mem_spread_node() routine is also simple. It uses the
|
||||||
|
value of a per-task rotor cpuset_mem_spread_rotor to select the next
|
||||||
|
node in the current task's mems_allowed to prefer for the allocation.
|
||||||
|
|
||||||
|
This memory placement policy is also known (in other contexts) as
|
||||||
|
round-robin or interleave.
|
||||||
|
|
||||||
|
This policy can provide substantial improvements for jobs that need
|
||||||
|
to place thread local data on the corresponding node, but that need
|
||||||
|
to access large file system data sets that need to be spread across
|
||||||
|
the several nodes in the jobs cpuset in order to fit. Without this
|
||||||
|
policy, especially for jobs that might have one thread reading in the
|
||||||
|
data set, the memory allocation across the nodes in the jobs cpuset
|
||||||
|
can become very uneven.
|
||||||
|
|
||||||
|
1.7 What is sched_load_balance ?
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
The kernel scheduler (kernel/sched/core.c) automatically load balances
|
||||||
|
tasks. If one CPU is underutilized, kernel code running on that
|
||||||
|
CPU will look for tasks on other more overloaded CPUs and move those
|
||||||
|
tasks to itself, within the constraints of such placement mechanisms
|
||||||
|
as cpusets and sched_setaffinity.
|
||||||
|
|
||||||
|
The algorithmic cost of load balancing and its impact on key shared
|
||||||
|
kernel data structures such as the task list increases more than
|
||||||
|
linearly with the number of CPUs being balanced. So the scheduler
|
||||||
|
has support to partition the systems CPUs into a number of sched
|
||||||
|
domains such that it only load balances within each sched domain.
|
||||||
|
Each sched domain covers some subset of the CPUs in the system;
|
||||||
|
no two sched domains overlap; some CPUs might not be in any sched
|
||||||
|
domain and hence won't be load balanced.
|
||||||
|
|
||||||
|
Put simply, it costs less to balance between two smaller sched domains
|
||||||
|
than one big one, but doing so means that overloads in one of the
|
||||||
|
two domains won't be load balanced to the other one.
|
||||||
|
|
||||||
|
By default, there is one sched domain covering all CPUs, including those
|
||||||
|
marked isolated using the kernel boot time "isolcpus=" argument. However,
|
||||||
|
the isolated CPUs will not participate in load balancing, and will not
|
||||||
|
have tasks running on them unless explicitly assigned.
|
||||||
|
|
||||||
|
This default load balancing across all CPUs is not well suited for
|
||||||
|
the following two situations:
|
||||||
|
|
||||||
|
1) On large systems, load balancing across many CPUs is expensive.
|
||||||
|
If the system is managed using cpusets to place independent jobs
|
||||||
|
on separate sets of CPUs, full load balancing is unnecessary.
|
||||||
|
2) Systems supporting realtime on some CPUs need to minimize
|
||||||
|
system overhead on those CPUs, including avoiding task load
|
||||||
|
balancing if that is not needed.
|
||||||
|
|
||||||
|
When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
|
||||||
|
setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
|
||||||
|
be contained in a single sched domain, ensuring that load balancing
|
||||||
|
can move a task (not otherwised pinned, as by sched_setaffinity)
|
||||||
|
from any CPU in that cpuset to any other.
|
||||||
|
|
||||||
|
When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the
|
||||||
|
scheduler will avoid load balancing across the CPUs in that cpuset,
|
||||||
|
--except-- in so far as is necessary because some overlapping cpuset
|
||||||
|
has "sched_load_balance" enabled.
|
||||||
|
|
||||||
|
So, for example, if the top cpuset has the flag "cpuset.sched_load_balance"
|
||||||
|
enabled, then the scheduler will have one sched domain covering all
|
||||||
|
CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other
|
||||||
|
cpusets won't matter, as we're already fully load balancing.
|
||||||
|
|
||||||
|
Therefore in the above two situations, the top cpuset flag
|
||||||
|
"cpuset.sched_load_balance" should be disabled, and only some of the smaller,
|
||||||
|
child cpusets have this flag enabled.
|
||||||
|
|
||||||
|
When doing this, you don't usually want to leave any unpinned tasks in
|
||||||
|
the top cpuset that might use non-trivial amounts of CPU, as such tasks
|
||||||
|
may be artificially constrained to some subset of CPUs, depending on
|
||||||
|
the particulars of this flag setting in descendant cpusets. Even if
|
||||||
|
such a task could use spare CPU cycles in some other CPUs, the kernel
|
||||||
|
scheduler might not consider the possibility of load balancing that
|
||||||
|
task to that underused CPU.
|
||||||
|
|
||||||
|
Of course, tasks pinned to a particular CPU can be left in a cpuset
|
||||||
|
that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere
|
||||||
|
else anyway.
|
||||||
|
|
||||||
|
There is an impedance mismatch here, between cpusets and sched domains.
|
||||||
|
Cpusets are hierarchical and nest. Sched domains are flat; they don't
|
||||||
|
overlap and each CPU is in at most one sched domain.
|
||||||
|
|
||||||
|
It is necessary for sched domains to be flat because load balancing
|
||||||
|
across partially overlapping sets of CPUs would risk unstable dynamics
|
||||||
|
that would be beyond our understanding. So if each of two partially
|
||||||
|
overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we
|
||||||
|
form a single sched domain that is a superset of both. We won't move
|
||||||
|
a task to a CPU outside its cpuset, but the scheduler load balancing
|
||||||
|
code might waste some compute cycles considering that possibility.
|
||||||
|
|
||||||
|
This mismatch is why there is not a simple one-to-one relation
|
||||||
|
between which cpusets have the flag "cpuset.sched_load_balance" enabled,
|
||||||
|
and the sched domain configuration. If a cpuset enables the flag, it
|
||||||
|
will get balancing across all its CPUs, but if it disables the flag,
|
||||||
|
it will only be assured of no load balancing if no other overlapping
|
||||||
|
cpuset enables the flag.
|
||||||
|
|
||||||
|
If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only
|
||||||
|
one of them has this flag enabled, then the other may find its
|
||||||
|
tasks only partially load balanced, just on the overlapping CPUs.
|
||||||
|
This is just the general case of the top_cpuset example given a few
|
||||||
|
paragraphs above. In the general case, as in the top cpuset case,
|
||||||
|
don't leave tasks that might use non-trivial amounts of CPU in
|
||||||
|
such partially load balanced cpusets, as they may be artificially
|
||||||
|
constrained to some subset of the CPUs allowed to them, for lack of
|
||||||
|
load balancing to the other CPUs.
|
||||||
|
|
||||||
|
CPUs in "cpuset.isolcpus" were excluded from load balancing by the
|
||||||
|
isolcpus= kernel boot option, and will never be load balanced regardless
|
||||||
|
of the value of "cpuset.sched_load_balance" in any cpuset.
|
||||||
|
|
||||||
|
1.7.1 sched_load_balance implementation details.
|
||||||
|
------------------------------------------------
|
||||||
|
|
||||||
|
The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary
|
||||||
|
to most cpuset flags.) When enabled for a cpuset, the kernel will
|
||||||
|
ensure that it can load balance across all the CPUs in that cpuset
|
||||||
|
(makes sure that all the CPUs in the cpus_allowed of that cpuset are
|
||||||
|
in the same sched domain.)
|
||||||
|
|
||||||
|
If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled,
|
||||||
|
then they will be (must be) both in the same sched domain.
|
||||||
|
|
||||||
|
If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled,
|
||||||
|
then by the above that means there is a single sched domain covering
|
||||||
|
the whole system, regardless of any other cpuset settings.
|
||||||
|
|
||||||
|
The kernel commits to user space that it will avoid load balancing
|
||||||
|
where it can. It will pick as fine a granularity partition of sched
|
||||||
|
domains as it can while still providing load balancing for any set
|
||||||
|
of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled.
|
||||||
|
|
||||||
|
The internal kernel cpuset to scheduler interface passes from the
|
||||||
|
cpuset code to the scheduler code a partition of the load balanced
|
||||||
|
CPUs in the system. This partition is a set of subsets (represented
|
||||||
|
as an array of struct cpumask) of CPUs, pairwise disjoint, that cover
|
||||||
|
all the CPUs that must be load balanced.
|
||||||
|
|
||||||
|
The cpuset code builds a new such partition and passes it to the
|
||||||
|
scheduler sched domain setup code, to have the sched domains rebuilt
|
||||||
|
as necessary, whenever:
|
||||||
|
|
||||||
|
- the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
|
||||||
|
- or CPUs come or go from a cpuset with this flag enabled,
|
||||||
|
- or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
|
||||||
|
and with this flag enabled changes,
|
||||||
|
- or a cpuset with non-empty CPUs and with this flag enabled is removed,
|
||||||
|
- or a cpu is offlined/onlined.
|
||||||
|
|
||||||
|
This partition exactly defines what sched domains the scheduler should
|
||||||
|
setup - one sched domain for each element (struct cpumask) in the
|
||||||
|
partition.
|
||||||
|
|
||||||
|
The scheduler remembers the currently active sched domain partitions.
|
||||||
|
When the scheduler routine partition_sched_domains() is invoked from
|
||||||
|
the cpuset code to update these sched domains, it compares the new
|
||||||
|
partition requested with the current, and updates its sched domains,
|
||||||
|
removing the old and adding the new, for each change.
|
||||||
|
|
||||||
|
|
||||||
|
1.8 What is sched_relax_domain_level ?
|
||||||
|
--------------------------------------
|
||||||
|
|
||||||
|
In sched domain, the scheduler migrates tasks in 2 ways; periodic load
|
||||||
|
balance on tick, and at time of some schedule events.
|
||||||
|
|
||||||
|
When a task is woken up, scheduler try to move the task on idle CPU.
|
||||||
|
For example, if a task A running on CPU X activates another task B
|
||||||
|
on the same CPU X, and if CPU Y is X's sibling and performing idle,
|
||||||
|
then scheduler migrate task B to CPU Y so that task B can start on
|
||||||
|
CPU Y without waiting task A on CPU X.
|
||||||
|
|
||||||
|
And if a CPU run out of tasks in its runqueue, the CPU try to pull
|
||||||
|
extra tasks from other busy CPUs to help them before it is going to
|
||||||
|
be idle.
|
||||||
|
|
||||||
|
Of course it takes some searching cost to find movable tasks and/or
|
||||||
|
idle CPUs, the scheduler might not search all CPUs in the domain
|
||||||
|
every time. In fact, in some architectures, the searching ranges on
|
||||||
|
events are limited in the same socket or node where the CPU locates,
|
||||||
|
while the load balance on tick searches all.
|
||||||
|
|
||||||
|
For example, assume CPU Z is relatively far from CPU X. Even if CPU Z
|
||||||
|
is idle while CPU X and the siblings are busy, scheduler can't migrate
|
||||||
|
woken task B from X to Z since it is out of its searching range.
|
||||||
|
As the result, task B on CPU X need to wait task A or wait load balance
|
||||||
|
on the next tick. For some applications in special situation, waiting
|
||||||
|
1 tick may be too long.
|
||||||
|
|
||||||
|
The 'cpuset.sched_relax_domain_level' file allows you to request changing
|
||||||
|
this searching range as you like. This file takes int value which
|
||||||
|
indicates size of searching range in levels ideally as follows,
|
||||||
|
otherwise initial value -1 that indicates the cpuset has no request.
|
||||||
|
|
||||||
|
====== ===========================================================
|
||||||
|
-1 no request. use system default or follow request of others.
|
||||||
|
0 no search.
|
||||||
|
1 search siblings (hyperthreads in a core).
|
||||||
|
2 search cores in a package.
|
||||||
|
3 search cpus in a node [= system wide on non-NUMA system]
|
||||||
|
4 search nodes in a chunk of node [on NUMA system]
|
||||||
|
5 search system wide [on NUMA system]
|
||||||
|
====== ===========================================================
|
||||||
|
|
||||||
|
The system default is architecture dependent. The system default
|
||||||
|
can be changed using the relax_domain_level= boot parameter.
|
||||||
|
|
||||||
|
This file is per-cpuset and affect the sched domain where the cpuset
|
||||||
|
belongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset
|
||||||
|
is disabled, then 'cpuset.sched_relax_domain_level' have no effect since
|
||||||
|
there is no sched domain belonging the cpuset.
|
||||||
|
|
||||||
|
If multiple cpusets are overlapping and hence they form a single sched
|
||||||
|
domain, the largest value among those is used. Be careful, if one
|
||||||
|
requests 0 and others are -1 then 0 is used.
|
||||||
|
|
||||||
|
Note that modifying this file will have both good and bad effects,
|
||||||
|
and whether it is acceptable or not depends on your situation.
|
||||||
|
Don't modify this file if you are not sure.
|
||||||
|
|
||||||
|
If your situation is:
|
||||||
|
|
||||||
|
- The migration costs between each cpu can be assumed considerably
|
||||||
|
small(for you) due to your special application's behavior or
|
||||||
|
special hardware support for CPU cache etc.
|
||||||
|
- The searching cost doesn't have impact(for you) or you can make
|
||||||
|
the searching cost enough small by managing cpuset to compact etc.
|
||||||
|
- The latency is required even it sacrifices cache hit rate etc.
|
||||||
|
then increasing 'sched_relax_domain_level' would benefit you.
|
||||||
|
|
||||||
|
|
||||||
|
1.9 How do I use cpusets ?
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
In order to minimize the impact of cpusets on critical kernel
|
||||||
|
code, such as the scheduler, and due to the fact that the kernel
|
||||||
|
does not support one task updating the memory placement of another
|
||||||
|
task directly, the impact on a task of changing its cpuset CPU
|
||||||
|
or Memory Node placement, or of changing to which cpuset a task
|
||||||
|
is attached, is subtle.
|
||||||
|
|
||||||
|
If a cpuset has its Memory Nodes modified, then for each task attached
|
||||||
|
to that cpuset, the next time that the kernel attempts to allocate
|
||||||
|
a page of memory for that task, the kernel will notice the change
|
||||||
|
in the task's cpuset, and update its per-task memory placement to
|
||||||
|
remain within the new cpusets memory placement. If the task was using
|
||||||
|
mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
|
||||||
|
its new cpuset, then the task will continue to use whatever subset
|
||||||
|
of MPOL_BIND nodes are still allowed in the new cpuset. If the task
|
||||||
|
was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
|
||||||
|
in the new cpuset, then the task will be essentially treated as if it
|
||||||
|
was MPOL_BIND bound to the new cpuset (even though its NUMA placement,
|
||||||
|
as queried by get_mempolicy(), doesn't change). If a task is moved
|
||||||
|
from one cpuset to another, then the kernel will adjust the task's
|
||||||
|
memory placement, as above, the next time that the kernel attempts
|
||||||
|
to allocate a page of memory for that task.
|
||||||
|
|
||||||
|
If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset
|
||||||
|
will have its allowed CPU placement changed immediately. Similarly,
|
||||||
|
if a task's pid is written to another cpuset's 'tasks' file, then its
|
||||||
|
allowed CPU placement is changed immediately. If such a task had been
|
||||||
|
bound to some subset of its cpuset using the sched_setaffinity() call,
|
||||||
|
the task will be allowed to run on any CPU allowed in its new cpuset,
|
||||||
|
negating the effect of the prior sched_setaffinity() call.
|
||||||
|
|
||||||
|
In summary, the memory placement of a task whose cpuset is changed is
|
||||||
|
updated by the kernel, on the next allocation of a page for that task,
|
||||||
|
and the processor placement is updated immediately.
|
||||||
|
|
||||||
|
Normally, once a page is allocated (given a physical page
|
||||||
|
of main memory) then that page stays on whatever node it
|
||||||
|
was allocated, so long as it remains allocated, even if the
|
||||||
|
cpusets memory placement policy 'cpuset.mems' subsequently changes.
|
||||||
|
If the cpuset flag file 'cpuset.memory_migrate' is set true, then when
|
||||||
|
tasks are attached to that cpuset, any pages that task had
|
||||||
|
allocated to it on nodes in its previous cpuset are migrated
|
||||||
|
to the task's new cpuset. The relative placement of the page within
|
||||||
|
the cpuset is preserved during these migration operations if possible.
|
||||||
|
For example if the page was on the second valid node of the prior cpuset
|
||||||
|
then the page will be placed on the second valid node of the new cpuset.
|
||||||
|
|
||||||
|
Also if 'cpuset.memory_migrate' is set true, then if that cpuset's
|
||||||
|
'cpuset.mems' file is modified, pages allocated to tasks in that
|
||||||
|
cpuset, that were on nodes in the previous setting of 'cpuset.mems',
|
||||||
|
will be moved to nodes in the new setting of 'mems.'
|
||||||
|
Pages that were not in the task's prior cpuset, or in the cpuset's
|
||||||
|
prior 'cpuset.mems' setting, will not be moved.
|
||||||
|
|
||||||
|
There is an exception to the above. If hotplug functionality is used
|
||||||
|
to remove all the CPUs that are currently assigned to a cpuset,
|
||||||
|
then all the tasks in that cpuset will be moved to the nearest ancestor
|
||||||
|
with non-empty cpus. But the moving of some (or all) tasks might fail if
|
||||||
|
cpuset is bound with another cgroup subsystem which has some restrictions
|
||||||
|
on task attaching. In this failing case, those tasks will stay
|
||||||
|
in the original cpuset, and the kernel will automatically update
|
||||||
|
their cpus_allowed to allow all online CPUs. When memory hotplug
|
||||||
|
functionality for removing Memory Nodes is available, a similar exception
|
||||||
|
is expected to apply there as well. In general, the kernel prefers to
|
||||||
|
violate cpuset placement, over starving a task that has had all
|
||||||
|
its allowed CPUs or Memory Nodes taken offline.
|
||||||
|
|
||||||
|
There is a second exception to the above. GFP_ATOMIC requests are
|
||||||
|
kernel internal allocations that must be satisfied, immediately.
|
||||||
|
The kernel may drop some request, in rare cases even panic, if a
|
||||||
|
GFP_ATOMIC alloc fails. If the request cannot be satisfied within
|
||||||
|
the current task's cpuset, then we relax the cpuset, and look for
|
||||||
|
memory anywhere we can find it. It's better to violate the cpuset
|
||||||
|
than stress the kernel.
|
||||||
|
|
||||||
|
To start a new job that is to be contained within a cpuset, the steps are:
|
||||||
|
|
||||||
|
1) mkdir /sys/fs/cgroup/cpuset
|
||||||
|
2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
|
||||||
|
3) Create the new cpuset by doing mkdir's and write's (or echo's) in
|
||||||
|
the /sys/fs/cgroup/cpuset virtual file system.
|
||||||
|
4) Start a task that will be the "founding father" of the new job.
|
||||||
|
5) Attach that task to the new cpuset by writing its pid to the
|
||||||
|
/sys/fs/cgroup/cpuset tasks file for that cpuset.
|
||||||
|
6) fork, exec or clone the job tasks from this founding father task.
|
||||||
|
|
||||||
|
For example, the following sequence of commands will setup a cpuset
|
||||||
|
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
|
||||||
|
and then start a subshell 'sh' in that cpuset::
|
||||||
|
|
||||||
|
mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
|
||||||
|
cd /sys/fs/cgroup/cpuset
|
||||||
|
mkdir Charlie
|
||||||
|
cd Charlie
|
||||||
|
/bin/echo 2-3 > cpuset.cpus
|
||||||
|
/bin/echo 1 > cpuset.mems
|
||||||
|
/bin/echo $$ > tasks
|
||||||
|
sh
|
||||||
|
# The subshell 'sh' is now running in cpuset Charlie
|
||||||
|
# The next line should display '/Charlie'
|
||||||
|
cat /proc/self/cpuset
|
||||||
|
|
||||||
|
There are ways to query or modify cpusets:
|
||||||
|
|
||||||
|
- via the cpuset file system directly, using the various cd, mkdir, echo,
|
||||||
|
cat, rmdir commands from the shell, or their equivalent from C.
|
||||||
|
- via the C library libcpuset.
|
||||||
|
- via the C library libcgroup.
|
||||||
|
(http://sourceforge.net/projects/libcg/)
|
||||||
|
- via the python application cset.
|
||||||
|
(http://code.google.com/p/cpuset/)
|
||||||
|
|
||||||
|
The sched_setaffinity calls can also be done at the shell prompt using
|
||||||
|
SGI's runon or Robert Love's taskset. The mbind and set_mempolicy
|
||||||
|
calls can be done at the shell prompt using the numactl command
|
||||||
|
(part of Andi Kleen's numa package).
|
||||||
|
|
||||||
|
2. Usage Examples and Syntax
|
||||||
|
============================
|
||||||
|
|
||||||
|
2.1 Basic Usage
|
||||||
|
---------------
|
||||||
|
|
||||||
|
Creating, modifying, using the cpusets can be done through the cpuset
|
||||||
|
virtual filesystem.
|
||||||
|
|
||||||
|
To mount it, type:
|
||||||
|
# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset
|
||||||
|
|
||||||
|
Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the
|
||||||
|
tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
|
||||||
|
is the cpuset that holds the whole system.
|
||||||
|
|
||||||
|
If you want to create a new cpuset under /sys/fs/cgroup/cpuset::
|
||||||
|
|
||||||
|
# cd /sys/fs/cgroup/cpuset
|
||||||
|
# mkdir my_cpuset
|
||||||
|
|
||||||
|
Now you want to do something with this cpuset::
|
||||||
|
|
||||||
|
# cd my_cpuset
|
||||||
|
|
||||||
|
In this directory you can find several files::
|
||||||
|
|
||||||
|
# ls
|
||||||
|
cgroup.clone_children cpuset.memory_pressure
|
||||||
|
cgroup.event_control cpuset.memory_spread_page
|
||||||
|
cgroup.procs cpuset.memory_spread_slab
|
||||||
|
cpuset.cpu_exclusive cpuset.mems
|
||||||
|
cpuset.cpus cpuset.sched_load_balance
|
||||||
|
cpuset.mem_exclusive cpuset.sched_relax_domain_level
|
||||||
|
cpuset.mem_hardwall notify_on_release
|
||||||
|
cpuset.memory_migrate tasks
|
||||||
|
|
||||||
|
Reading them will give you information about the state of this cpuset:
|
||||||
|
the CPUs and Memory Nodes it can use, the processes that are using
|
||||||
|
it, its properties. By writing to these files you can manipulate
|
||||||
|
the cpuset.
|
||||||
|
|
||||||
|
Set some flags::
|
||||||
|
|
||||||
|
# /bin/echo 1 > cpuset.cpu_exclusive
|
||||||
|
|
||||||
|
Add some cpus::
|
||||||
|
|
||||||
|
# /bin/echo 0-7 > cpuset.cpus
|
||||||
|
|
||||||
|
Add some mems::
|
||||||
|
|
||||||
|
# /bin/echo 0-7 > cpuset.mems
|
||||||
|
|
||||||
|
Now attach your shell to this cpuset::
|
||||||
|
|
||||||
|
# /bin/echo $$ > tasks
|
||||||
|
|
||||||
|
You can also create cpusets inside your cpuset by using mkdir in this
|
||||||
|
directory::
|
||||||
|
|
||||||
|
# mkdir my_sub_cs
|
||||||
|
|
||||||
|
To remove a cpuset, just use rmdir::
|
||||||
|
|
||||||
|
# rmdir my_sub_cs
|
||||||
|
|
||||||
|
This will fail if the cpuset is in use (has cpusets inside, or has
|
||||||
|
processes attached).
|
||||||
|
|
||||||
|
Note that for legacy reasons, the "cpuset" filesystem exists as a
|
||||||
|
wrapper around the cgroup filesystem.
|
||||||
|
|
||||||
|
The command::
|
||||||
|
|
||||||
|
mount -t cpuset X /sys/fs/cgroup/cpuset
|
||||||
|
|
||||||
|
is equivalent to::
|
||||||
|
|
||||||
|
mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
|
||||||
|
echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
|
||||||
|
|
||||||
|
2.2 Adding/removing cpus
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
This is the syntax to use when writing in the cpus or mems files
|
||||||
|
in cpuset directories::
|
||||||
|
|
||||||
|
# /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
|
||||||
|
# /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
|
||||||
|
|
||||||
|
To add a CPU to a cpuset, write the new list of CPUs including the
|
||||||
|
CPU to be added. To add 6 to the above cpuset::
|
||||||
|
|
||||||
|
# /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6
|
||||||
|
|
||||||
|
Similarly to remove a CPU from a cpuset, write the new list of CPUs
|
||||||
|
without the CPU to be removed.
|
||||||
|
|
||||||
|
To remove all the CPUs::
|
||||||
|
|
||||||
|
# /bin/echo "" > cpuset.cpus -> clear cpus list
|
||||||
|
|
||||||
|
2.3 Setting flags
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
The syntax is very simple::
|
||||||
|
|
||||||
|
# /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive'
|
||||||
|
# /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive'
|
||||||
|
|
||||||
|
2.4 Attaching processes
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
# /bin/echo PID > tasks
|
||||||
|
|
||||||
|
Note that it is PID, not PIDs. You can only attach ONE task at a time.
|
||||||
|
If you have several tasks to attach, you have to do it one after another::
|
||||||
|
|
||||||
|
# /bin/echo PID1 > tasks
|
||||||
|
# /bin/echo PID2 > tasks
|
||||||
|
...
|
||||||
|
# /bin/echo PIDn > tasks
|
||||||
|
|
||||||
|
|
||||||
|
3. Questions
|
||||||
|
============
|
||||||
|
|
||||||
|
Q:
|
||||||
|
what's up with this '/bin/echo' ?
|
||||||
|
|
||||||
|
A:
|
||||||
|
bash's builtin 'echo' command does not check calls to write() against
|
||||||
|
errors. If you use it in the cpuset file system, you won't be
|
||||||
|
able to tell whether a command succeeded or failed.
|
||||||
|
|
||||||
|
Q:
|
||||||
|
When I attach processes, only the first of the line gets really attached !
|
||||||
|
|
||||||
|
A:
|
||||||
|
We can only return one error code per call to write(). So you should also
|
||||||
|
put only ONE pid.
|
||||||
|
|
||||||
|
4. Contact
|
||||||
|
==========
|
||||||
|
|
||||||
|
Web: http://www.bullopensource.org/cpuset
|
28
Documentation/admin-guide/cgroup-v1/index.rst
Normal file
28
Documentation/admin-guide/cgroup-v1/index.rst
Normal file
@ -0,0 +1,28 @@
|
|||||||
|
========================
|
||||||
|
Control Groups version 1
|
||||||
|
========================
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
cgroups
|
||||||
|
|
||||||
|
blkio-controller
|
||||||
|
cpuacct
|
||||||
|
cpusets
|
||||||
|
devices
|
||||||
|
freezer-subsystem
|
||||||
|
hugetlb
|
||||||
|
memcg_test
|
||||||
|
memory
|
||||||
|
net_cls
|
||||||
|
net_prio
|
||||||
|
pids
|
||||||
|
rdma
|
||||||
|
|
||||||
|
.. only:: subproject and html
|
||||||
|
|
||||||
|
Indices
|
||||||
|
=======
|
||||||
|
|
||||||
|
* :ref:`genindex`
|
355
Documentation/admin-guide/cgroup-v1/memcg_test.rst
Normal file
355
Documentation/admin-guide/cgroup-v1/memcg_test.rst
Normal file
@ -0,0 +1,355 @@
|
|||||||
|
=====================================================
|
||||||
|
Memory Resource Controller(Memcg) Implementation Memo
|
||||||
|
=====================================================
|
||||||
|
|
||||||
|
Last Updated: 2010/2
|
||||||
|
|
||||||
|
Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
|
||||||
|
|
||||||
|
Because VM is getting complex (one of reasons is memcg...), memcg's behavior
|
||||||
|
is complex. This is a document for memcg's internal behavior.
|
||||||
|
Please note that implementation details can be changed.
|
||||||
|
|
||||||
|
(*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst)
|
||||||
|
|
||||||
|
0. How to record usage ?
|
||||||
|
========================
|
||||||
|
|
||||||
|
2 objects are used.
|
||||||
|
|
||||||
|
page_cgroup ....an object per page.
|
||||||
|
|
||||||
|
Allocated at boot or memory hotplug. Freed at memory hot removal.
|
||||||
|
|
||||||
|
swap_cgroup ... an entry per swp_entry.
|
||||||
|
|
||||||
|
Allocated at swapon(). Freed at swapoff().
|
||||||
|
|
||||||
|
The page_cgroup has USED bit and double count against a page_cgroup never
|
||||||
|
occurs. swap_cgroup is used only when a charged page is swapped-out.
|
||||||
|
|
||||||
|
1. Charge
|
||||||
|
=========
|
||||||
|
|
||||||
|
a page/swp_entry may be charged (usage += PAGE_SIZE) at
|
||||||
|
|
||||||
|
mem_cgroup_try_charge()
|
||||||
|
|
||||||
|
2. Uncharge
|
||||||
|
===========
|
||||||
|
|
||||||
|
a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
|
||||||
|
|
||||||
|
mem_cgroup_uncharge()
|
||||||
|
Called when a page's refcount goes down to 0.
|
||||||
|
|
||||||
|
mem_cgroup_uncharge_swap()
|
||||||
|
Called when swp_entry's refcnt goes down to 0. A charge against swap
|
||||||
|
disappears.
|
||||||
|
|
||||||
|
3. charge-commit-cancel
|
||||||
|
=======================
|
||||||
|
|
||||||
|
Memcg pages are charged in two steps:
|
||||||
|
|
||||||
|
- mem_cgroup_try_charge()
|
||||||
|
- mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
|
||||||
|
|
||||||
|
At try_charge(), there are no flags to say "this page is charged".
|
||||||
|
at this point, usage += PAGE_SIZE.
|
||||||
|
|
||||||
|
At commit(), the page is associated with the memcg.
|
||||||
|
|
||||||
|
At cancel(), simply usage -= PAGE_SIZE.
|
||||||
|
|
||||||
|
Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
|
||||||
|
|
||||||
|
4. Anonymous
|
||||||
|
============
|
||||||
|
|
||||||
|
Anonymous page is newly allocated at
|
||||||
|
- page fault into MAP_ANONYMOUS mapping.
|
||||||
|
- Copy-On-Write.
|
||||||
|
|
||||||
|
4.1 Swap-in.
|
||||||
|
At swap-in, the page is taken from swap-cache. There are 2 cases.
|
||||||
|
|
||||||
|
(a) If the SwapCache is newly allocated and read, it has no charges.
|
||||||
|
(b) If the SwapCache has been mapped by processes, it has been
|
||||||
|
charged already.
|
||||||
|
|
||||||
|
4.2 Swap-out.
|
||||||
|
At swap-out, typical state transition is below.
|
||||||
|
|
||||||
|
(a) add to swap cache. (marked as SwapCache)
|
||||||
|
swp_entry's refcnt += 1.
|
||||||
|
(b) fully unmapped.
|
||||||
|
swp_entry's refcnt += # of ptes.
|
||||||
|
(c) write back to swap.
|
||||||
|
(d) delete from swap cache. (remove from SwapCache)
|
||||||
|
swp_entry's refcnt -= 1.
|
||||||
|
|
||||||
|
|
||||||
|
Finally, at task exit,
|
||||||
|
(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
|
||||||
|
|
||||||
|
5. Page Cache
|
||||||
|
=============
|
||||||
|
|
||||||
|
Page Cache is charged at
|
||||||
|
- add_to_page_cache_locked().
|
||||||
|
|
||||||
|
The logic is very clear. (About migration, see below)
|
||||||
|
|
||||||
|
Note:
|
||||||
|
__remove_from_page_cache() is called by remove_from_page_cache()
|
||||||
|
and __remove_mapping().
|
||||||
|
|
||||||
|
6. Shmem(tmpfs) Page Cache
|
||||||
|
===========================
|
||||||
|
|
||||||
|
The best way to understand shmem's page state transition is to read
|
||||||
|
mm/shmem.c.
|
||||||
|
|
||||||
|
But brief explanation of the behavior of memcg around shmem will be
|
||||||
|
helpful to understand the logic.
|
||||||
|
|
||||||
|
Shmem's page (just leaf page, not direct/indirect block) can be on
|
||||||
|
|
||||||
|
- radix-tree of shmem's inode.
|
||||||
|
- SwapCache.
|
||||||
|
- Both on radix-tree and SwapCache. This happens at swap-in
|
||||||
|
and swap-out,
|
||||||
|
|
||||||
|
It's charged when...
|
||||||
|
|
||||||
|
- A new page is added to shmem's radix-tree.
|
||||||
|
- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
|
||||||
|
|
||||||
|
7. Page Migration
|
||||||
|
=================
|
||||||
|
|
||||||
|
mem_cgroup_migrate()
|
||||||
|
|
||||||
|
8. LRU
|
||||||
|
======
|
||||||
|
Each memcg has its own private LRU. Now, its handling is under global
|
||||||
|
VM's control (means that it's handled under global pgdat->lru_lock).
|
||||||
|
Almost all routines around memcg's LRU is called by global LRU's
|
||||||
|
list management functions under pgdat->lru_lock.
|
||||||
|
|
||||||
|
A special function is mem_cgroup_isolate_pages(). This scans
|
||||||
|
memcg's private LRU and call __isolate_lru_page() to extract a page
|
||||||
|
from LRU.
|
||||||
|
|
||||||
|
(By __isolate_lru_page(), the page is removed from both of global and
|
||||||
|
private LRU.)
|
||||||
|
|
||||||
|
|
||||||
|
9. Typical Tests.
|
||||||
|
=================
|
||||||
|
|
||||||
|
Tests for racy cases.
|
||||||
|
|
||||||
|
9.1 Small limit to memcg.
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
When you do test to do racy case, it's good test to set memcg's limit
|
||||||
|
to be very small rather than GB. Many races found in the test under
|
||||||
|
xKB or xxMB limits.
|
||||||
|
|
||||||
|
(Memory behavior under GB and Memory behavior under MB shows very
|
||||||
|
different situation.)
|
||||||
|
|
||||||
|
9.2 Shmem
|
||||||
|
---------
|
||||||
|
|
||||||
|
Historically, memcg's shmem handling was poor and we saw some amount
|
||||||
|
of troubles here. This is because shmem is page-cache but can be
|
||||||
|
SwapCache. Test with shmem/tmpfs is always good test.
|
||||||
|
|
||||||
|
9.3 Migration
|
||||||
|
-------------
|
||||||
|
|
||||||
|
For NUMA, migration is an another special case. To do easy test, cpuset
|
||||||
|
is useful. Following is a sample script to do migration::
|
||||||
|
|
||||||
|
mount -t cgroup -o cpuset none /opt/cpuset
|
||||||
|
|
||||||
|
mkdir /opt/cpuset/01
|
||||||
|
echo 1 > /opt/cpuset/01/cpuset.cpus
|
||||||
|
echo 0 > /opt/cpuset/01/cpuset.mems
|
||||||
|
echo 1 > /opt/cpuset/01/cpuset.memory_migrate
|
||||||
|
mkdir /opt/cpuset/02
|
||||||
|
echo 1 > /opt/cpuset/02/cpuset.cpus
|
||||||
|
echo 1 > /opt/cpuset/02/cpuset.mems
|
||||||
|
echo 1 > /opt/cpuset/02/cpuset.memory_migrate
|
||||||
|
|
||||||
|
In above set, when you moves a task from 01 to 02, page migration to
|
||||||
|
node 0 to node 1 will occur. Following is a script to migrate all
|
||||||
|
under cpuset.::
|
||||||
|
|
||||||
|
--
|
||||||
|
move_task()
|
||||||
|
{
|
||||||
|
for pid in $1
|
||||||
|
do
|
||||||
|
/bin/echo $pid >$2/tasks 2>/dev/null
|
||||||
|
echo -n $pid
|
||||||
|
echo -n " "
|
||||||
|
done
|
||||||
|
echo END
|
||||||
|
}
|
||||||
|
|
||||||
|
G1_TASK=`cat ${G1}/tasks`
|
||||||
|
G2_TASK=`cat ${G2}/tasks`
|
||||||
|
move_task "${G1_TASK}" ${G2} &
|
||||||
|
--
|
||||||
|
|
||||||
|
9.4 Memory hotplug
|
||||||
|
------------------
|
||||||
|
|
||||||
|
memory hotplug test is one of good test.
|
||||||
|
|
||||||
|
to offline memory, do following::
|
||||||
|
|
||||||
|
# echo offline > /sys/devices/system/memory/memoryXXX/state
|
||||||
|
|
||||||
|
(XXX is the place of memory)
|
||||||
|
|
||||||
|
This is an easy way to test page migration, too.
|
||||||
|
|
||||||
|
9.5 mkdir/rmdir
|
||||||
|
---------------
|
||||||
|
|
||||||
|
When using hierarchy, mkdir/rmdir test should be done.
|
||||||
|
Use tests like the following::
|
||||||
|
|
||||||
|
echo 1 >/opt/cgroup/01/memory/use_hierarchy
|
||||||
|
mkdir /opt/cgroup/01/child_a
|
||||||
|
mkdir /opt/cgroup/01/child_b
|
||||||
|
|
||||||
|
set limit to 01.
|
||||||
|
add limit to 01/child_b
|
||||||
|
run jobs under child_a and child_b
|
||||||
|
|
||||||
|
create/delete following groups at random while jobs are running::
|
||||||
|
|
||||||
|
/opt/cgroup/01/child_a/child_aa
|
||||||
|
/opt/cgroup/01/child_b/child_bb
|
||||||
|
/opt/cgroup/01/child_c
|
||||||
|
|
||||||
|
running new jobs in new group is also good.
|
||||||
|
|
||||||
|
9.6 Mount with other subsystems
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
Mounting with other subsystems is a good test because there is a
|
||||||
|
race and lock dependency with other cgroup subsystems.
|
||||||
|
|
||||||
|
example::
|
||||||
|
|
||||||
|
# mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
|
||||||
|
|
||||||
|
and do task move, mkdir, rmdir etc...under this.
|
||||||
|
|
||||||
|
9.7 swapoff
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Besides management of swap is one of complicated parts of memcg,
|
||||||
|
call path of swap-in at swapoff is not same as usual swap-in path..
|
||||||
|
It's worth to be tested explicitly.
|
||||||
|
|
||||||
|
For example, test like following is good:
|
||||||
|
|
||||||
|
(Shell-A)::
|
||||||
|
|
||||||
|
# mount -t cgroup none /cgroup -o memory
|
||||||
|
# mkdir /cgroup/test
|
||||||
|
# echo 40M > /cgroup/test/memory.limit_in_bytes
|
||||||
|
# echo 0 > /cgroup/test/tasks
|
||||||
|
|
||||||
|
Run malloc(100M) program under this. You'll see 60M of swaps.
|
||||||
|
|
||||||
|
(Shell-B)::
|
||||||
|
|
||||||
|
# move all tasks in /cgroup/test to /cgroup
|
||||||
|
# /sbin/swapoff -a
|
||||||
|
# rmdir /cgroup/test
|
||||||
|
# kill malloc task.
|
||||||
|
|
||||||
|
Of course, tmpfs v.s. swapoff test should be tested, too.
|
||||||
|
|
||||||
|
9.8 OOM-Killer
|
||||||
|
--------------
|
||||||
|
|
||||||
|
Out-of-memory caused by memcg's limit will kill tasks under
|
||||||
|
the memcg. When hierarchy is used, a task under hierarchy
|
||||||
|
will be killed by the kernel.
|
||||||
|
|
||||||
|
In this case, panic_on_oom shouldn't be invoked and tasks
|
||||||
|
in other groups shouldn't be killed.
|
||||||
|
|
||||||
|
It's not difficult to cause OOM under memcg as following.
|
||||||
|
|
||||||
|
Case A) when you can swapoff::
|
||||||
|
|
||||||
|
#swapoff -a
|
||||||
|
#echo 50M > /memory.limit_in_bytes
|
||||||
|
|
||||||
|
run 51M of malloc
|
||||||
|
|
||||||
|
Case B) when you use mem+swap limitation::
|
||||||
|
|
||||||
|
#echo 50M > memory.limit_in_bytes
|
||||||
|
#echo 50M > memory.memsw.limit_in_bytes
|
||||||
|
|
||||||
|
run 51M of malloc
|
||||||
|
|
||||||
|
9.9 Move charges at task migration
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
|
Charges associated with a task can be moved along with task migration.
|
||||||
|
|
||||||
|
(Shell-A)::
|
||||||
|
|
||||||
|
#mkdir /cgroup/A
|
||||||
|
#echo $$ >/cgroup/A/tasks
|
||||||
|
|
||||||
|
run some programs which uses some amount of memory in /cgroup/A.
|
||||||
|
|
||||||
|
(Shell-B)::
|
||||||
|
|
||||||
|
#mkdir /cgroup/B
|
||||||
|
#echo 1 >/cgroup/B/memory.move_charge_at_immigrate
|
||||||
|
#echo "pid of the program running in group A" >/cgroup/B/tasks
|
||||||
|
|
||||||
|
You can see charges have been moved by reading ``*.usage_in_bytes`` or
|
||||||
|
memory.stat of both A and B.
|
||||||
|
|
||||||
|
See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should
|
||||||
|
be written to move_charge_at_immigrate.
|
||||||
|
|
||||||
|
9.10 Memory thresholds
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
Memory controller implements memory thresholds using cgroups notification
|
||||||
|
API. You can use tools/cgroup/cgroup_event_listener.c to test it.
|
||||||
|
|
||||||
|
(Shell-A) Create cgroup and run event listener::
|
||||||
|
|
||||||
|
# mkdir /cgroup/A
|
||||||
|
# ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
|
||||||
|
|
||||||
|
(Shell-B) Add task to cgroup and try to allocate and free memory::
|
||||||
|
|
||||||
|
# echo $$ >/cgroup/A/tasks
|
||||||
|
# a="$(dd if=/dev/zero bs=1M count=10)"
|
||||||
|
# a=
|
||||||
|
|
||||||
|
You will see message from cgroup_event_listener every time you cross
|
||||||
|
the thresholds.
|
||||||
|
|
||||||
|
Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
|
||||||
|
|
||||||
|
It's good idea to test root cgroup as well.
|
@ -9,7 +9,7 @@ This is the authoritative documentation on the design, interface and
|
|||||||
conventions of cgroup v2. It describes all userland-visible aspects
|
conventions of cgroup v2. It describes all userland-visible aspects
|
||||||
of cgroup including core and specific controller behaviors. All
|
of cgroup including core and specific controller behaviors. All
|
||||||
future changes must be reflected in this document. Documentation for
|
future changes must be reflected in this document. Documentation for
|
||||||
v1 is available under Documentation/cgroup-v1/.
|
v1 is available under Documentation/admin-guide/cgroup-v1/.
|
||||||
|
|
||||||
.. CONTENTS
|
.. CONTENTS
|
||||||
|
|
||||||
@ -1014,7 +1014,7 @@ All time durations are in microseconds.
|
|||||||
A read-only nested-key file which exists on non-root cgroups.
|
A read-only nested-key file which exists on non-root cgroups.
|
||||||
|
|
||||||
Shows pressure stall information for CPU. See
|
Shows pressure stall information for CPU. See
|
||||||
Documentation/accounting/psi.txt for details.
|
Documentation/accounting/psi.rst for details.
|
||||||
|
|
||||||
|
|
||||||
Memory
|
Memory
|
||||||
@ -1355,7 +1355,7 @@ PAGE_SIZE multiple when read back.
|
|||||||
A read-only nested-key file which exists on non-root cgroups.
|
A read-only nested-key file which exists on non-root cgroups.
|
||||||
|
|
||||||
Shows pressure stall information for memory. See
|
Shows pressure stall information for memory. See
|
||||||
Documentation/accounting/psi.txt for details.
|
Documentation/accounting/psi.rst for details.
|
||||||
|
|
||||||
|
|
||||||
Usage Guidelines
|
Usage Guidelines
|
||||||
@ -1498,7 +1498,7 @@ IO Interface Files
|
|||||||
A read-only nested-key file which exists on non-root cgroups.
|
A read-only nested-key file which exists on non-root cgroups.
|
||||||
|
|
||||||
Shows pressure stall information for IO. See
|
Shows pressure stall information for IO. See
|
||||||
Documentation/accounting/psi.txt for details.
|
Documentation/accounting/psi.rst for details.
|
||||||
|
|
||||||
|
|
||||||
Writeback
|
Writeback
|
||||||
@ -2124,7 +2124,7 @@ following two functions.
|
|||||||
a queue (device) has been associated with the bio and
|
a queue (device) has been associated with the bio and
|
||||||
before submission.
|
before submission.
|
||||||
|
|
||||||
wbc_account_io(@wbc, @page, @bytes)
|
wbc_account_cgroup_owner(@wbc, @page, @bytes)
|
||||||
Should be called for each data segment being written out.
|
Should be called for each data segment being written out.
|
||||||
While this function doesn't care exactly when it's called
|
While this function doesn't care exactly when it's called
|
||||||
during the writeback session, it's the easiest and most
|
during the writeback session, it's the easiest and most
|
||||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user