diff --git a/.gitignore b/.gitignore index 7587ef56b92d..8f5422cba6e2 100644 --- a/.gitignore +++ b/.gitignore @@ -30,6 +30,7 @@ *.lz4 *.lzma *.lzo +*.mod *.mod.c *.o *.o.* diff --git a/.mailmap b/.mailmap index 07a777f9d687..0fef932de3db 100644 --- a/.mailmap +++ b/.mailmap @@ -81,6 +81,7 @@ Greg Kroah-Hartman Greg Kroah-Hartman Greg Kroah-Hartman Gregory CLEMENT +Hanjun Guo Henk Vergonet Henrik Kretzschmar Henrik Rydberg @@ -238,6 +239,7 @@ Vlad Dogaru Vladimir Davydov Vladimir Davydov Takashi YOSHII +Will Deacon Yakir Yang Yusuke Goda Gustavo Padovan diff --git a/CREDITS b/CREDITS index 681335f42491..a7387605fbdc 100644 --- a/CREDITS +++ b/CREDITS @@ -1770,7 +1770,6 @@ S: USA N: Dave Jones E: davej@codemonkey.org.uk -W: http://www.codemonkey.org.uk D: Assorted VIA x86 support. D: 2.5 AGPGART overhaul. D: CPUFREQ maintenance. @@ -1800,7 +1799,7 @@ S: 2300 Copenhagen S. S: Denmark N: Jozsef Kadlecsik -E: kadlec@blackhole.kfki.hu +E: kadlec@netfilter.org P: 1024D/470DB964 4CB3 1A05 713E 9BF7 FAC5 5809 DD8C B7B1 470D B964 D: netfilter: TCP window tracking code D: netfilter: raw table @@ -3120,7 +3119,7 @@ S: France N: Rik van Riel E: riel@redhat.com W: http://www.surriel.com/ -D: Linux-MM site, Documentation/sysctl/*, swap/mm readaround +D: Linux-MM site, Documentation/admin-guide/sysctl/*, swap/mm readaround D: kswapd fixes, random kernel hacker, rmap VM, D: nl.linux.org administrator, minor scheduler additions S: Red Hat Boston diff --git a/Documentation/ABI/obsolete/sysfs-driver-hid-roccat-pyra b/Documentation/ABI/obsolete/sysfs-driver-hid-roccat-pyra index 16020b31ae64..5d41ebadf15e 100644 --- a/Documentation/ABI/obsolete/sysfs-driver-hid-roccat-pyra +++ b/Documentation/ABI/obsolete/sysfs-driver-hid-roccat-pyra @@ -5,7 +5,7 @@ Description: It is possible to switch the cpi setting of the mouse with the press of a button. When read, this file returns the raw number of the actual cpi setting reported by the mouse. This number has to be further - processed to receive the real dpi value. + processed to receive the real dpi value: VALUE DPI 1 400 diff --git a/Documentation/ABI/obsolete/sysfs-gpio b/Documentation/ABI/obsolete/sysfs-gpio index 40d41ea1a3f5..e0d4e5e2dd90 100644 --- a/Documentation/ABI/obsolete/sysfs-gpio +++ b/Documentation/ABI/obsolete/sysfs-gpio @@ -11,7 +11,7 @@ Description: Kernel code may export it for complete or partial access. GPIOs are identified as they are inside the kernel, using integers in - the range 0..INT_MAX. See Documentation/gpio for more information. + the range 0..INT_MAX. See Documentation/admin-guide/gpio for more information. /sys/class/gpio /export ... asks the kernel to export a GPIO to userspace diff --git a/Documentation/ABI/removed/sysfs-class-rfkill b/Documentation/ABI/removed/sysfs-class-rfkill index 3ce6231f20b2..9c08c7f98ffb 100644 --- a/Documentation/ABI/removed/sysfs-class-rfkill +++ b/Documentation/ABI/removed/sysfs-class-rfkill @@ -1,6 +1,6 @@ rfkill - radio frequency (RF) connector kill switch support -For details to this subsystem look at Documentation/rfkill.txt. +For details to this subsystem look at Documentation/driver-api/rfkill.rst. What: /sys/class/rfkill/rfkill[0-9]+/claim Date: 09-Jul-2007 diff --git a/Documentation/ABI/stable/sysfs-class-infiniband b/Documentation/ABI/stable/sysfs-class-infiniband index 17211ceb9bf4..aed21b8916a2 100644 --- a/Documentation/ABI/stable/sysfs-class-infiniband +++ b/Documentation/ABI/stable/sysfs-class-infiniband @@ -423,23 +423,6 @@ Description: (e.g. driver restart on the VM which owns the VF). -sysfs interface for NetEffect RNIC Low-Level iWARP driver (nes) ---------------------------------------------------------------- - -What: /sys/class/infiniband/nesX/hw_rev -What: /sys/class/infiniband/nesX/hca_type -What: /sys/class/infiniband/nesX/board_id -Date: Feb, 2008 -KernelVersion: v2.6.25 -Contact: linux-rdma@vger.kernel.org -Description: - hw_rev: (RO) Hardware revision number - - hca_type: (RO) Host Channel Adapter type (NEX020) - - board_id: (RO) Manufacturing board id - - sysfs interface for Chelsio T4/T5 RDMA driver (cxgb4) ----------------------------------------------------- diff --git a/Documentation/ABI/stable/sysfs-class-rfkill b/Documentation/ABI/stable/sysfs-class-rfkill index 80151a409d67..5b154f922643 100644 --- a/Documentation/ABI/stable/sysfs-class-rfkill +++ b/Documentation/ABI/stable/sysfs-class-rfkill @@ -1,6 +1,6 @@ rfkill - radio frequency (RF) connector kill switch support -For details to this subsystem look at Documentation/rfkill.txt. +For details to this subsystem look at Documentation/driver-api/rfkill.rst. For the deprecated /sys/class/rfkill/*/claim knobs of this interface look in Documentation/ABI/removed/sysfs-class-rfkill. diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index f7ce68fbd4b9..df8413cf1468 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -61,7 +61,7 @@ Date: October 2002 Contact: Linux Memory Management list Description: The node's hit/miss statistics, in units of pages. - See Documentation/numastat.txt + See Documentation/admin-guide/numastat.rst What: /sys/devices/system/node/nodeX/distance Date: October 2002 diff --git a/Documentation/ABI/stable/sysfs-driver-mlxreg-io b/Documentation/ABI/stable/sysfs-driver-mlxreg-io index 156319fc5b80..8ca498447aeb 100644 --- a/Documentation/ABI/stable/sysfs-driver-mlxreg-io +++ b/Documentation/ABI/stable/sysfs-driver-mlxreg-io @@ -1,5 +1,4 @@ -What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/ - asic_health +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/asic_health Date: June 2018 KernelVersion: 4.19 @@ -9,9 +8,8 @@ Description: This file shows ASIC health status. The possible values are: The files are read only. -What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/ - cpld1_version - cpld2_version +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/cpld1_version +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/cpld2_version Date: June 2018 KernelVersion: 4.19 Contact: Vadim Pasternak @@ -20,8 +18,7 @@ Description: These files show with which CPLD versions have been burned The files are read only. -What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/ - fan_dir +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/fan_dir Date: December 2018 KernelVersion: 5.0 @@ -32,8 +29,7 @@ Description: This file shows the system fans direction: The files are read only. -What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/ - jtag_enable +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/jtag_enable Date: November 2018 KernelVersion: 5.0 @@ -43,8 +39,7 @@ Description: These files show with which CPLD versions have been burned The files are read only. -What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/ - jtag_enable +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/jtag_enable Date: November 2018 KernelVersion: 5.0 @@ -87,16 +82,15 @@ Description: These files allow asserting system power cycling, switching The files are write only. -What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/ - reset_aux_pwr_or_ref - reset_asic_thermal - reset_hotswap_or_halt - reset_hotswap_or_wd - reset_fw_reset - reset_long_pb - reset_main_pwr_fail - reset_short_pb - reset_sw_reset +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_aux_pwr_or_ref +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_asic_thermal +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_hotswap_or_halt +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_hotswap_or_wd +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_fw_reset +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_long_pb +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_main_pwr_fail +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_short_pb +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_sw_reset Date: June 2018 KernelVersion: 4.19 Contact: Vadim Pasternak @@ -110,11 +104,10 @@ Description: These files show the system reset cause, as following: power The files are read only. -What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/ - reset_comex_pwr_fail - reset_from_comex - reset_system - reset_voltmon_upgrade_fail +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_comex_pwr_fail +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_from_comex +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_system +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_voltmon_upgrade_fail Date: November 2018 KernelVersion: 5.0 @@ -127,3 +120,23 @@ Description: These files show the system reset cause, as following: ComEx the last reset cause. The files are read only. + +Date: June 2019 +KernelVersion: 5.3 +Contact: Vadim Pasternak +Description: These files show the system reset cause, as following: + COMEX thermal shutdown; wathchdog power off or reset was derived + by one of the next components: COMEX, switch board or by Small Form + Factor mezzanine, reset requested from ASIC, reset cuased by BIOS + reload. Value 1 in file means this is reset cause, 0 - otherwise. + Only one of the above causes could be 1 at the same time, representing + only last reset cause. + + The files are read only. + +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_comex_thermal +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_comex_wd +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_from_asic +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_reload_bios +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_sff_wd +What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_swb_wd diff --git a/Documentation/ABI/testing/debugfs-cec-error-inj b/Documentation/ABI/testing/debugfs-cec-error-inj index 122b65c5fe62..4c3596c6d25b 100644 --- a/Documentation/ABI/testing/debugfs-cec-error-inj +++ b/Documentation/ABI/testing/debugfs-cec-error-inj @@ -1,6 +1,6 @@ What: /sys/kernel/debug/cec/*/error-inj Date: March 2018 -Contact: Hans Verkuil +Contact: Hans Verkuil Description: The CEC Framework allows for CEC error injection commands through diff --git a/Documentation/ABI/testing/debugfs-cros-ec b/Documentation/ABI/testing/debugfs-cros-ec new file mode 100644 index 000000000000..1fe0add99a2a --- /dev/null +++ b/Documentation/ABI/testing/debugfs-cros-ec @@ -0,0 +1,56 @@ +What: /sys/kernel/debug//console_log +Date: September 2017 +KernelVersion: 4.13 +Description: + If the EC supports the CONSOLE_READ command type, this file + can be used to grab the EC logs. The kernel polls for the log + and keeps its own buffer but userspace should grab this and + write it out to some logs. + +What: /sys/kernel/debug//panicinfo +Date: September 2017 +KernelVersion: 4.13 +Description: + This file dumps the EC panic information from the previous + reboot. This file will only exist if the PANIC_INFO command + type is supported by the EC. + +What: /sys/kernel/debug//pdinfo +Date: June 2018 +KernelVersion: 4.17 +Description: + This file provides the port role, muxes and power debug + information for all the USB PD/type-C ports available. If + the are no ports available, this file will be just an empty + file. + +What: /sys/kernel/debug//uptime +Date: June 2019 +KernelVersion: 5.3 +Description: + A u32 providing the time since EC booted in ms. This is + is used for synchronizing the AP host time with the EC + log. An error is returned if the command is not supported + by the EC or there is a communication problem. + +What: /sys/kernel/debug//last_resume_result +Date: June 2019 +KernelVersion: 5.3 +Description: + Some ECs have a feature where they will track transitions to + the (Intel) processor's SLP_S0 line, in order to detect cases + where a system failed to go into S0ix. When the system resumes, + an EC with this feature will return a summary of SLP_S0 + transitions that occurred. The last_resume_result file returns + the most recent response from the AP's resume message to the EC. + + The bottom 31 bits contain a count of the number of SLP_S0 + transitions that occurred since the suspend message was + received. Bit 31 is set if the EC attempted to wake the + system due to a timeout when watching for SLP_S0 transitions. + Callers can use this to detect a wake from the EC due to + S0ix timeouts. The result will be zero if no suspend + transitions have been attempted, or the EC does not support + this feature. + + Output will be in the format: "0x%08x\n". diff --git a/Documentation/ABI/testing/debugfs-driver-habanalabs b/Documentation/ABI/testing/debugfs-driver-habanalabs index 2f5b80be07a3..f0ac14b70ecb 100644 --- a/Documentation/ABI/testing/debugfs-driver-habanalabs +++ b/Documentation/ABI/testing/debugfs-driver-habanalabs @@ -3,7 +3,10 @@ Date: Jan 2019 KernelVersion: 5.1 Contact: oded.gabbay@gmail.com Description: Sets the device address to be used for read or write through - PCI bar. The acceptable value is a string that starts with "0x" + PCI bar, or the device VA of a host mapped memory to be read or + written directly from the host. The latter option is allowed + only when the IOMMU is disabled. + The acceptable value is a string that starts with "0x" What: /sys/kernel/debug/habanalabs/hl/command_buffers Date: Jan 2019 @@ -33,10 +36,12 @@ Contact: oded.gabbay@gmail.com Description: Allows the root user to read or write directly through the device's PCI bar. Writing to this file generates a write transaction while reading from the file generates a read - transcation. This custom interface is needed (instead of using + transaction. This custom interface is needed (instead of using the generic Linux user-space PCI mapping) because the DDR bar is very small compared to the DDR memory and only the driver can - move the bar before and after the transaction + move the bar before and after the transaction. + If the IOMMU is disabled, it also allows the root user to read + or write from the host a device VA of a host mapped memory What: /sys/kernel/debug/habanalabs/hl/device Date: Jan 2019 @@ -46,6 +51,13 @@ Description: Enables the root user to set the device to specific state. Valid values are "disable", "enable", "suspend", "resume". User can read this property to see the valid values +What: /sys/kernel/debug/habanalabs/hl/engines +Date: Jul 2019 +KernelVersion: 5.3 +Contact: oded.gabbay@gmail.com +Description: Displays the status registers values of the device engines and + their derived idle status + What: /sys/kernel/debug/habanalabs/hl/i2c_addr Date: Jan 2019 KernelVersion: 5.1 diff --git a/Documentation/ABI/testing/debugfs-wilco-ec b/Documentation/ABI/testing/debugfs-wilco-ec index 73a5a66ddca6..9d8d9d2def5b 100644 --- a/Documentation/ABI/testing/debugfs-wilco-ec +++ b/Documentation/ABI/testing/debugfs-wilco-ec @@ -23,11 +23,9 @@ Description: For writing, bytes 0-1 indicate the message type, one of enum wilco_ec_msg_type. Byte 2+ consist of the data passed in the - request, starting at MBOX[0] - - At least three bytes are required for writing, two for the type - and at least a single byte of data. Only the first - EC_MAILBOX_DATA_SIZE bytes of MBOX will be used. + request, starting at MBOX[0]. At least three bytes are required + for writing, two for the type and at least a single byte of + data. Example: // Request EC info type 3 (EC firmware build date) @@ -40,7 +38,7 @@ Description: $ cat /sys/kernel/debug/wilco_ec/raw 00 00 31 32 2f 32 31 2f 31 38 00 38 00 01 00 2f 00 ..12/21/18.8... - Note that the first 32 bytes of the received MBOX[] will be - printed, even if some of the data is junk. It is up to you to - know how many of the first bytes of data are the actual - response. + Note that the first 16 bytes of the received MBOX[] will be + printed, even if some of the data is junk, and skipping bytes + 17 to 32. It is up to you to know how many of the first bytes of + data are the actual response. diff --git a/Documentation/ABI/testing/ima_policy b/Documentation/ABI/testing/ima_policy index 74c6702de74e..fc376a323908 100644 --- a/Documentation/ABI/testing/ima_policy +++ b/Documentation/ABI/testing/ima_policy @@ -24,11 +24,11 @@ Description: [euid=] [fowner=] [fsname=]] lsm: [[subj_user=] [subj_role=] [subj_type=] [obj_user=] [obj_role=] [obj_type=]] - option: [[appraise_type=]] [permit_directio] - + option: [[appraise_type=]] [template=] [permit_directio] base: func:= [BPRM_CHECK][MMAP_CHECK][CREDS_CHECK][FILE_CHECK][MODULE_CHECK] [FIRMWARE_CHECK] [KEXEC_KERNEL_CHECK] [KEXEC_INITRAMFS_CHECK] + [KEXEC_CMDLINE] mask:= [[^]MAY_READ] [[^]MAY_WRITE] [[^]MAY_APPEND] [[^]MAY_EXEC] fsmagic:= hex value @@ -38,6 +38,8 @@ Description: fowner:= decimal value lsm: are LSM specific option: appraise_type:= [imasig] + template:= name of a defined IMA template type + (eg, ima-ng). Only valid when action is "measure". pcr:= decimal value default policy: diff --git a/Documentation/ABI/testing/procfs-diskstats b/Documentation/ABI/testing/procfs-diskstats index abac31d216de..2c44b4f1b060 100644 --- a/Documentation/ABI/testing/procfs-diskstats +++ b/Documentation/ABI/testing/procfs-diskstats @@ -29,4 +29,4 @@ Description: 17 - sectors discarded 18 - time spent discarding - For more details refer to Documentation/iostats.txt + For more details refer to Documentation/admin-guide/iostats.rst diff --git a/Documentation/ABI/testing/procfs-smaps_rollup b/Documentation/ABI/testing/procfs-smaps_rollup index 0a54ed0d63c9..274df44d8b1b 100644 --- a/Documentation/ABI/testing/procfs-smaps_rollup +++ b/Documentation/ABI/testing/procfs-smaps_rollup @@ -3,18 +3,28 @@ Date: August 2017 Contact: Daniel Colascione Description: This file provides pre-summed memory information for a - process. The format is identical to /proc/pid/smaps, + process. The format is almost identical to /proc/pid/smaps, except instead of an entry for each VMA in a process, smaps_rollup has a single entry (tagged "[rollup]") for which each field is the sum of the corresponding fields from all the maps in /proc/pid/smaps. - For more details, see the procfs man page. + Additionally, the fields Pss_Anon, Pss_File and Pss_Shmem + are not present in /proc/pid/smaps. These fields represent + the sum of the Pss field of each type (anon, file, shmem). + For more details, see Documentation/filesystems/proc.txt + and the procfs man page. Typical output looks like this: 00100000-ff709000 ---p 00000000 00:00 0 [rollup] + Size: 1192 kB + KernelPageSize: 4 kB + MMUPageSize: 4 kB Rss: 884 kB Pss: 385 kB + Pss_Anon: 301 kB + Pss_File: 80 kB + Pss_Shmem: 4 kB Shared_Clean: 696 kB Shared_Dirty: 0 kB Private_Clean: 120 kB diff --git a/Documentation/ABI/testing/pstore b/Documentation/ABI/testing/pstore index 5fca9f5e10a3..d45209abdb1b 100644 --- a/Documentation/ABI/testing/pstore +++ b/Documentation/ABI/testing/pstore @@ -1,6 +1,6 @@ -Where: /sys/fs/pstore/... (or /dev/pstore/...) +What: /sys/fs/pstore/... (or /dev/pstore/...) Date: March 2011 -Kernel Version: 2.6.39 +KernelVersion: 2.6.39 Contact: tony.luck@intel.com Description: Generic interface to platform dependent persistent storage. diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block index dfad7427817c..f8c7c7126bb1 100644 --- a/Documentation/ABI/testing/sysfs-block +++ b/Documentation/ABI/testing/sysfs-block @@ -15,7 +15,7 @@ Description: 9 - I/Os currently in progress 10 - time spent doing I/Os (ms) 11 - weighted time spent doing I/Os (ms) - For more details refer Documentation/iostats.txt + For more details refer Documentation/admin-guide/iostats.rst What: /sys/block///stat diff --git a/Documentation/ABI/testing/sysfs-block-device b/Documentation/ABI/testing/sysfs-block-device index 82ef6eab042d..17f2bc7dd261 100644 --- a/Documentation/ABI/testing/sysfs-block-device +++ b/Documentation/ABI/testing/sysfs-block-device @@ -45,7 +45,7 @@ Description: - Values below -2 are rejected with -EINVAL For more information, see - Documentation/laptops/disk-shock-protection.txt + Documentation/admin-guide/laptops/disk-shock-protection.rst What: /sys/block/*/device/ncq_prio_enable diff --git a/Documentation/ABI/testing/sysfs-bus-css b/Documentation/ABI/testing/sysfs-bus-css index 2979c40c10e9..966f8504bd7b 100644 --- a/Documentation/ABI/testing/sysfs-bus-css +++ b/Documentation/ABI/testing/sysfs-bus-css @@ -33,3 +33,26 @@ Description: Contains the PIM/PAM/POM values, as reported by the in sync with the values current in the channel subsystem). Note: This is an I/O-subchannel specific attribute. Users: s390-tools, HAL + +What: /sys/bus/css/devices/.../driver_override +Date: June 2019 +Contact: Cornelia Huck + linux-s390@vger.kernel.org +Description: This file allows the driver for a device to be specified. When + specified, only a driver with a name matching the value written + to driver_override will have an opportunity to bind to the + device. The override is specified by writing a string to the + driver_override file (echo vfio-ccw > driver_override) and + may be cleared with an empty string (echo > driver_override). + This returns the device to standard matching rules binding. + Writing to driver_override does not automatically unbind the + device from its current driver or make any attempt to + automatically load the specified driver. If no driver with a + matching name is currently loaded in the kernel, the device + will not bind to any driver. This also allows devices to + opt-out of driver binding using a driver_override name such as + "none". Only a single driver may be specified in the override, + there is no support for parsing delimiters. + Note that unlike the mechanism of the same name for pci, this + file does not allow to override basic matching rules. I.e., + the driver must still match the subchannel type of the device. diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format index 77f47ff5ee02..5bb793ec926c 100644 --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format @@ -1,6 +1,6 @@ -Where: /sys/bus/event_source/devices//format +What: /sys/bus/event_source/devices//format Date: January 2012 -Kernel Version: 3.3 +KernelVersion: 3.3 Contact: Jiri Olsa Description: Attribute group to describe the magic bits that go into diff --git a/Documentation/ABI/testing/sysfs-bus-i2c-devices-hm6352 b/Documentation/ABI/testing/sysfs-bus-i2c-devices-hm6352 index feb2e4a87075..4a251b7f11e4 100644 --- a/Documentation/ABI/testing/sysfs-bus-i2c-devices-hm6352 +++ b/Documentation/ABI/testing/sysfs-bus-i2c-devices-hm6352 @@ -1,20 +1,20 @@ -Where: /sys/bus/i2c/devices/.../heading0_input +What: /sys/bus/i2c/devices/.../heading0_input Date: April 2010 -Kernel Version: 2.6.36? +KernelVersion: 2.6.36? Contact: alan.cox@intel.com Description: Reports the current heading from the compass as a floating point value in degrees. -Where: /sys/bus/i2c/devices/.../power_state +What: /sys/bus/i2c/devices/.../power_state Date: April 2010 -Kernel Version: 2.6.36? +KernelVersion: 2.6.36? Contact: alan.cox@intel.com Description: Sets the power state of the device. 0 sets the device into sleep mode, 1 wakes it up. -Where: /sys/bus/i2c/devices/.../calibration +What: /sys/bus/i2c/devices/.../calibration Date: April 2010 -Kernel Version: 2.6.36? +KernelVersion: 2.6.36? Contact: alan.cox@intel.com Description: Sets the calibration on or off (1 = on, 0 = off). See the chip data sheet. diff --git a/Documentation/ABI/testing/sysfs-bus-iio b/Documentation/ABI/testing/sysfs-bus-iio index 6aef7dbbde44..680451695422 100644 --- a/Documentation/ABI/testing/sysfs-bus-iio +++ b/Documentation/ABI/testing/sysfs-bus-iio @@ -61,8 +61,11 @@ What: /sys/bus/iio/devices/triggerX/sampling_frequency_available KernelVersion: 2.6.35 Contact: linux-iio@vger.kernel.org Description: - When the internal sampling clock can only take a small - discrete set of values, this file lists those available. + When the internal sampling clock can only take a specific set of + frequencies, we can specify the available values with: + - a small discrete set of values like "0 2 4 6 8" + - a range with minimum, step and maximum frequencies like + "[min step max]" What: /sys/bus/iio/devices/iio:deviceX/oversampling_ratio KernelVersion: 2.6.38 diff --git a/Documentation/ABI/testing/sysfs-bus-iio-cros-ec b/Documentation/ABI/testing/sysfs-bus-iio-cros-ec index 0e95c2ca105c..6158f831c761 100644 --- a/Documentation/ABI/testing/sysfs-bus-iio-cros-ec +++ b/Documentation/ABI/testing/sysfs-bus-iio-cros-ec @@ -18,11 +18,11 @@ Description: values are 'base' and 'lid'. What: /sys/bus/iio/devices/iio:deviceX/id -Date: Septembre 2017 +Date: September 2017 KernelVersion: 4.14 Contact: linux-iio@vger.kernel.org Description: - This attribute is exposed by the CrOS EC legacy accelerometer - driver and represents the sensor ID as exposed by the EC. This - ID is used by the Android sensor service hardware abstraction - layer (sensor HAL) through the Android container on ChromeOS. + This attribute is exposed by the CrOS EC sensors driver and + represents the sensor ID as exposed by the EC. This ID is used + by the Android sensor service hardware abstraction layer (sensor + HAL) through the Android container on ChromeOS. diff --git a/Documentation/ABI/testing/sysfs-bus-iio-distance-srf08 b/Documentation/ABI/testing/sysfs-bus-iio-distance-srf08 index 0a1ca1487fa9..a133fd8d081a 100644 --- a/Documentation/ABI/testing/sysfs-bus-iio-distance-srf08 +++ b/Documentation/ABI/testing/sysfs-bus-iio-distance-srf08 @@ -1,4 +1,4 @@ -What /sys/bus/iio/devices/iio:deviceX/sensor_sensitivity +What: /sys/bus/iio/devices/iio:deviceX/sensor_sensitivity Date: January 2017 KernelVersion: 4.11 Contact: linux-iio@vger.kernel.org @@ -6,7 +6,7 @@ Description: Show or set the gain boost of the amp, from 0-31 range. default 31 -What /sys/bus/iio/devices/iio:deviceX/sensor_max_range +What: /sys/bus/iio/devices/iio:deviceX/sensor_max_range Date: January 2017 KernelVersion: 4.11 Contact: linux-iio@vger.kernel.org diff --git a/Documentation/ABI/testing/sysfs-bus-iio-frequency-adf4371 b/Documentation/ABI/testing/sysfs-bus-iio-frequency-adf4371 new file mode 100644 index 000000000000..302de64cb424 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-iio-frequency-adf4371 @@ -0,0 +1,44 @@ +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_frequency +KernelVersion: +Contact: linux-iio@vger.kernel.org +Description: + Stores the PLL frequency in Hz for channel Y. + Reading returns the actual frequency in Hz. + The ADF4371 has an integrated VCO with fundamendal output + frequency ranging from 4000000000 Hz 8000000000 Hz. + + out_altvoltage0_frequency: + A divide by 1, 2, 4, 8, 16, 32 or circuit generates + frequencies from 62500000 Hz to 8000000000 Hz. + out_altvoltage1_frequency: + This channel duplicates the channel 0 frequency + out_altvoltage2_frequency: + A frequency doubler generates frequencies from + 8000000000 Hz to 16000000000 Hz. + out_altvoltage3_frequency: + A frequency quadrupler generates frequencies from + 16000000000 Hz to 32000000000 Hz. + + Note: writes to one of the channels will affect the frequency of + all the other channels, since it involves changing the VCO + fundamental output frequency. + +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_name +KernelVersion: +Contact: linux-iio@vger.kernel.org +Description: + Reading returns the datasheet name for channel Y: + + out_altvoltage0_name: RF8x + out_altvoltage1_name: RFAUX8x + out_altvoltage2_name: RF16x + out_altvoltage3_name: RF32x + +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_powerdown +KernelVersion: +Contact: linux-iio@vger.kernel.org +Description: + This attribute allows the user to power down the PLL and it's + RFOut buffers. + Writing 1 causes the specified channel to power down. + Clearing returns to normal operation. diff --git a/Documentation/ABI/testing/sysfs-bus-iio-proximity-as3935 b/Documentation/ABI/testing/sysfs-bus-iio-proximity-as3935 index 9a17ab5036a4..c59d95346341 100644 --- a/Documentation/ABI/testing/sysfs-bus-iio-proximity-as3935 +++ b/Documentation/ABI/testing/sysfs-bus-iio-proximity-as3935 @@ -1,4 +1,4 @@ -What /sys/bus/iio/devices/iio:deviceX/in_proximity_input +What: /sys/bus/iio/devices/iio:deviceX/in_proximity_input Date: March 2014 KernelVersion: 3.15 Contact: Matt Ranostay @@ -6,7 +6,7 @@ Description: Get the current distance in meters of storm (1km steps) 1000-40000 = distance in meters -What /sys/bus/iio/devices/iio:deviceX/sensor_sensitivity +What: /sys/bus/iio/devices/iio:deviceX/sensor_sensitivity Date: March 2014 KernelVersion: 3.15 Contact: Matt Ranostay diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats index 4b0318c99507..3c9a8c4a25eb 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats @@ -9,9 +9,9 @@ errors may be "seen" / reported by the link partner and not the problematic endpoint itself (which may report all counters as 0 as it never saw any problems). -Where: /sys/bus/pci/devices//aer_dev_correctable +What: /sys/bus/pci/devices//aer_dev_correctable Date: July 2018 -Kernel Version: 4.19.0 +KernelVersion: 4.19.0 Contact: linux-pci@vger.kernel.org, rajatja@google.com Description: List of correctable errors seen and reported by this PCI device using ERR_COR. Note that since multiple errors may @@ -31,9 +31,9 @@ Header Log Overflow 0 TOTAL_ERR_COR 2 ------------------------------------------------------------------------- -Where: /sys/bus/pci/devices//aer_dev_fatal +What: /sys/bus/pci/devices//aer_dev_fatal Date: July 2018 -Kernel Version: 4.19.0 +KernelVersion: 4.19.0 Contact: linux-pci@vger.kernel.org, rajatja@google.com Description: List of uncorrectable fatal errors seen and reported by this PCI device using ERR_FATAL. Note that since multiple errors may @@ -62,9 +62,9 @@ TLP Prefix Blocked Error 0 TOTAL_ERR_FATAL 0 ------------------------------------------------------------------------- -Where: /sys/bus/pci/devices//aer_dev_nonfatal +What: /sys/bus/pci/devices//aer_dev_nonfatal Date: July 2018 -Kernel Version: 4.19.0 +KernelVersion: 4.19.0 Contact: linux-pci@vger.kernel.org, rajatja@google.com Description: List of uncorrectable nonfatal errors seen and reported by this PCI device using ERR_NONFATAL. Note that since multiple errors @@ -103,20 +103,20 @@ collectors) that are AER capable. These indicate the number of error messages as device, so these counters include them and are thus cumulative of all the error messages on the PCI hierarchy originating at that root port. -Where: /sys/bus/pci/devices//aer_stats/aer_rootport_total_err_cor +What: /sys/bus/pci/devices//aer_stats/aer_rootport_total_err_cor Date: July 2018 -Kernel Version: 4.19.0 +KernelVersion: 4.19.0 Contact: linux-pci@vger.kernel.org, rajatja@google.com Description: Total number of ERR_COR messages reported to rootport. -Where: /sys/bus/pci/devices//aer_stats/aer_rootport_total_err_fatal +What: /sys/bus/pci/devices//aer_stats/aer_rootport_total_err_fatal Date: July 2018 -Kernel Version: 4.19.0 +KernelVersion: 4.19.0 Contact: linux-pci@vger.kernel.org, rajatja@google.com Description: Total number of ERR_FATAL messages reported to rootport. -Where: /sys/bus/pci/devices//aer_stats/aer_rootport_total_err_nonfatal +What: /sys/bus/pci/devices//aer_stats/aer_rootport_total_err_nonfatal Date: July 2018 -Kernel Version: 4.19.0 +KernelVersion: 4.19.0 Contact: linux-pci@vger.kernel.org, rajatja@google.com Description: Total number of ERR_NONFATAL messages reported to rootport. diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-cciss b/Documentation/ABI/testing/sysfs-bus-pci-devices-cciss index 53d99edd1d75..92a94e1068c2 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci-devices-cciss +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-cciss @@ -1,68 +1,68 @@ -Where: /sys/bus/pci/devices//ccissX/cXdY/model +What: /sys/bus/pci/devices//ccissX/cXdY/model Date: March 2009 -Kernel Version: 2.6.30 +KernelVersion: 2.6.30 Contact: iss_storagedev@hp.com Description: Displays the SCSI INQUIRY page 0 model for logical drive Y of controller X. -Where: /sys/bus/pci/devices//ccissX/cXdY/rev +What: /sys/bus/pci/devices//ccissX/cXdY/rev Date: March 2009 -Kernel Version: 2.6.30 +KernelVersion: 2.6.30 Contact: iss_storagedev@hp.com Description: Displays the SCSI INQUIRY page 0 revision for logical drive Y of controller X. -Where: /sys/bus/pci/devices//ccissX/cXdY/unique_id +What: /sys/bus/pci/devices//ccissX/cXdY/unique_id Date: March 2009 -Kernel Version: 2.6.30 +KernelVersion: 2.6.30 Contact: iss_storagedev@hp.com Description: Displays the SCSI INQUIRY page 83 serial number for logical drive Y of controller X. -Where: /sys/bus/pci/devices//ccissX/cXdY/vendor +What: /sys/bus/pci/devices//ccissX/cXdY/vendor Date: March 2009 -Kernel Version: 2.6.30 +KernelVersion: 2.6.30 Contact: iss_storagedev@hp.com Description: Displays the SCSI INQUIRY page 0 vendor for logical drive Y of controller X. -Where: /sys/bus/pci/devices//ccissX/cXdY/block:cciss!cXdY +What: /sys/bus/pci/devices//ccissX/cXdY/block:cciss!cXdY Date: March 2009 -Kernel Version: 2.6.30 +KernelVersion: 2.6.30 Contact: iss_storagedev@hp.com Description: A symbolic link to /sys/block/cciss!cXdY -Where: /sys/bus/pci/devices//ccissX/rescan +What: /sys/bus/pci/devices//ccissX/rescan Date: August 2009 -Kernel Version: 2.6.31 +KernelVersion: 2.6.31 Contact: iss_storagedev@hp.com Description: Kicks of a rescan of the controller to discover logical drive topology changes. -Where: /sys/bus/pci/devices//ccissX/cXdY/lunid +What: /sys/bus/pci/devices//ccissX/cXdY/lunid Date: August 2009 -Kernel Version: 2.6.31 +KernelVersion: 2.6.31 Contact: iss_storagedev@hp.com Description: Displays the 8-byte LUN ID used to address logical drive Y of controller X. -Where: /sys/bus/pci/devices//ccissX/cXdY/raid_level +What: /sys/bus/pci/devices//ccissX/cXdY/raid_level Date: August 2009 -Kernel Version: 2.6.31 +KernelVersion: 2.6.31 Contact: iss_storagedev@hp.com Description: Displays the RAID level of logical drive Y of controller X. -Where: /sys/bus/pci/devices//ccissX/cXdY/usage_count +What: /sys/bus/pci/devices//ccissX/cXdY/usage_count Date: August 2009 -Kernel Version: 2.6.31 +KernelVersion: 2.6.31 Contact: iss_storagedev@hp.com Description: Displays the usage count (number of opens) of logical drive Y of controller X. -Where: /sys/bus/pci/devices//ccissX/resettable +What: /sys/bus/pci/devices//ccissX/resettable Date: February 2011 -Kernel Version: 2.6.38 +KernelVersion: 2.6.38 Contact: iss_storagedev@hp.com Description: Value of 1 indicates the controller can honor the reset_devices kernel parameter. Value of 0 indicates reset_devices cannot be @@ -71,9 +71,9 @@ Description: Value of 1 indicates the controller can honor the reset_devices a dump device, as kdump requires resetting the device in order to work reliably. -Where: /sys/bus/pci/devices//ccissX/transport_mode +What: /sys/bus/pci/devices//ccissX/transport_mode Date: July 2011 -Kernel Version: 3.0 +KernelVersion: 3.0 Contact: iss_storagedev@hp.com Description: Value of "simple" indicates that the controller has been placed in "simple mode". Value of "performant" indicates that the diff --git a/Documentation/ABI/testing/sysfs-bus-siox b/Documentation/ABI/testing/sysfs-bus-siox index fed7c3765a4e..c2a403f20b90 100644 --- a/Documentation/ABI/testing/sysfs-bus-siox +++ b/Documentation/ABI/testing/sysfs-bus-siox @@ -1,6 +1,6 @@ What: /sys/bus/siox/devices/siox-X/active KernelVersion: 4.16 -Contact: Gavin Schenk , Uwe Kleine-König +Contact: Thorsten Scherer , Uwe Kleine-König Description: On reading represents the current state of the bus. If it contains a "0" the bus is stopped and connected devices are @@ -12,7 +12,7 @@ Description: What: /sys/bus/siox/devices/siox-X/device_add KernelVersion: 4.16 -Contact: Gavin Schenk , Uwe Kleine-König +Contact: Thorsten Scherer , Uwe Kleine-König Description: Write-only file. Write @@ -27,13 +27,13 @@ Description: What: /sys/bus/siox/devices/siox-X/device_remove KernelVersion: 4.16 -Contact: Gavin Schenk , Uwe Kleine-König +Contact: Thorsten Scherer , Uwe Kleine-König Description: Write-only file. A single write removes the last device in the siox chain. What: /sys/bus/siox/devices/siox-X/poll_interval_ns KernelVersion: 4.16 -Contact: Gavin Schenk , Uwe Kleine-König +Contact: Thorsten Scherer , Uwe Kleine-König Description: Defines the interval between two poll cycles in nano seconds. Note this is rounded to jiffies on writing. On reading the current value @@ -41,33 +41,33 @@ Description: What: /sys/bus/siox/devices/siox-X-Y/connected KernelVersion: 4.16 -Contact: Gavin Schenk , Uwe Kleine-König +Contact: Thorsten Scherer , Uwe Kleine-König Description: Read-only value. "0" means the Yth device on siox bus X isn't "connected" i.e. communication with it is not ensured. "1" signals a working connection. What: /sys/bus/siox/devices/siox-X-Y/inbytes KernelVersion: 4.16 -Contact: Gavin Schenk , Uwe Kleine-König +Contact: Thorsten Scherer , Uwe Kleine-König Description: Read-only value reporting the inbytes value provided to siox-X/device_add What: /sys/bus/siox/devices/siox-X-Y/status_errors KernelVersion: 4.16 -Contact: Gavin Schenk , Uwe Kleine-König +Contact: Thorsten Scherer , Uwe Kleine-König Description: Counts the number of time intervals when the read status byte doesn't yield the expected value. What: /sys/bus/siox/devices/siox-X-Y/type KernelVersion: 4.16 -Contact: Gavin Schenk , Uwe Kleine-König +Contact: Thorsten Scherer , Uwe Kleine-König Description: Read-only value reporting the type value provided to siox-X/device_add. What: /sys/bus/siox/devices/siox-X-Y/watchdog KernelVersion: 4.16 -Contact: Gavin Schenk , Uwe Kleine-König +Contact: Thorsten Scherer , Uwe Kleine-König Description: Read-only value reporting if the watchdog of the siox device is active. "0" means the watchdog is not active and the device is expected to @@ -75,13 +75,13 @@ Description: What: /sys/bus/siox/devices/siox-X-Y/watchdog_errors KernelVersion: 4.16 -Contact: Gavin Schenk , Uwe Kleine-König +Contact: Thorsten Scherer , Uwe Kleine-König Description: Read-only value reporting the number to time intervals when the watchdog was active. What: /sys/bus/siox/devices/siox-X-Y/outbytes KernelVersion: 4.16 -Contact: Gavin Schenk , Uwe Kleine-König +Contact: Thorsten Scherer , Uwe Kleine-König Description: Read-only value reporting the outbytes value provided to siox-X/device_add. diff --git a/Documentation/ABI/testing/sysfs-bus-usb-devices-usbsevseg b/Documentation/ABI/testing/sysfs-bus-usb-devices-usbsevseg index 70d00dfa443d..9ade80f81f96 100644 --- a/Documentation/ABI/testing/sysfs-bus-usb-devices-usbsevseg +++ b/Documentation/ABI/testing/sysfs-bus-usb-devices-usbsevseg @@ -1,14 +1,14 @@ -Where: /sys/bus/usb/.../powered +What: /sys/bus/usb/.../powered Date: August 2008 -Kernel Version: 2.6.26 +KernelVersion: 2.6.26 Contact: Harrison Metzger Description: Controls whether the device's display will powered. A value of 0 is off and a non-zero value is on. -Where: /sys/bus/usb/.../mode_msb -Where: /sys/bus/usb/.../mode_lsb +What: /sys/bus/usb/.../mode_msb +What: /sys/bus/usb/.../mode_lsb Date: August 2008 -Kernel Version: 2.6.26 +KernelVersion: 2.6.26 Contact: Harrison Metzger Description: Controls the devices display mode. For a 6 character display the values are @@ -16,24 +16,24 @@ Description: Controls the devices display mode. for an 8 character display the values are MSB 0x08; LSB 0xFF. -Where: /sys/bus/usb/.../textmode +What: /sys/bus/usb/.../textmode Date: August 2008 -Kernel Version: 2.6.26 +KernelVersion: 2.6.26 Contact: Harrison Metzger Description: Controls the way the device interprets its text buffer. raw: each character controls its segment manually hex: each character is between 0-15 ascii: each character is between '0'-'9' and 'A'-'F'. -Where: /sys/bus/usb/.../text +What: /sys/bus/usb/.../text Date: August 2008 -Kernel Version: 2.6.26 +KernelVersion: 2.6.26 Contact: Harrison Metzger Description: The text (or data) for the device to display -Where: /sys/bus/usb/.../decimals +What: /sys/bus/usb/.../decimals Date: August 2008 -Kernel Version: 2.6.26 +KernelVersion: 2.6.26 Contact: Harrison Metzger Description: Controls the decimal places on the device. To set the nth decimal place, give this field diff --git a/Documentation/ABI/testing/sysfs-class-backlight-driver-lm3533 b/Documentation/ABI/testing/sysfs-class-backlight-driver-lm3533 index 77cf7ac949af..c0e0a9ae7b3d 100644 --- a/Documentation/ABI/testing/sysfs-class-backlight-driver-lm3533 +++ b/Documentation/ABI/testing/sysfs-class-backlight-driver-lm3533 @@ -4,7 +4,7 @@ KernelVersion: 3.5 Contact: Johan Hovold Description: Get the ALS output channel used as input in - ALS-current-control mode (0, 1), where + ALS-current-control mode (0, 1), where: 0 - out_current0 (backlight 0) 1 - out_current1 (backlight 1) @@ -28,7 +28,7 @@ Date: April 2012 KernelVersion: 3.5 Contact: Johan Hovold Description: - Set the brightness-mapping mode (0, 1), where + Set the brightness-mapping mode (0, 1), where: 0 - exponential mode 1 - linear mode @@ -38,7 +38,7 @@ Date: April 2012 KernelVersion: 3.5 Contact: Johan Hovold Description: - Set the PWM-input control mask (5 bits), where + Set the PWM-input control mask (5 bits), where: bit 5 - PWM-input enabled in Zone 4 bit 4 - PWM-input enabled in Zone 3 diff --git a/Documentation/ABI/testing/sysfs-class-cxl b/Documentation/ABI/testing/sysfs-class-cxl index bbbabffc682a..7970e3713e70 100644 --- a/Documentation/ABI/testing/sysfs-class-cxl +++ b/Documentation/ABI/testing/sysfs-class-cxl @@ -1,6 +1,6 @@ -Note: Attributes that are shared between devices are stored in the directory -pointed to by the symlink device/. -Example: The real path of the attribute /sys/class/cxl/afu0.0s/irqs_max is +Please note that attributes that are shared between devices are stored in +the directory pointed to by the symlink device/. +For example, the real path of the attribute /sys/class/cxl/afu0.0s/irqs_max is /sys/class/cxl/afu0.0s/device/irqs_max, i.e. /sys/class/cxl/afu0.0/irqs_max. diff --git a/Documentation/ABI/testing/sysfs-class-devfreq b/Documentation/ABI/testing/sysfs-class-devfreq index ee39acacf6f8..01196e19afca 100644 --- a/Documentation/ABI/testing/sysfs-class-devfreq +++ b/Documentation/ABI/testing/sysfs-class-devfreq @@ -47,7 +47,7 @@ Description: What: /sys/class/devfreq/.../trans_stat Date: October 2012 Contact: MyungJoo Ham -Descrtiption: +Description: This ABI shows the statistics of devfreq behavior on a specific device. It shows the time spent in each state and the number of transitions between states. diff --git a/Documentation/ABI/testing/sysfs-class-led-driver-lm3533 b/Documentation/ABI/testing/sysfs-class-led-driver-lm3533 index 620ebb3b9baa..e4c89b261546 100644 --- a/Documentation/ABI/testing/sysfs-class-led-driver-lm3533 +++ b/Documentation/ABI/testing/sysfs-class-led-driver-lm3533 @@ -4,7 +4,7 @@ KernelVersion: 3.5 Contact: Johan Hovold Description: Set the ALS output channel to use as input in - ALS-current-control mode (1, 2), where + ALS-current-control mode (1, 2), where: 1 - out_current1 2 - out_current2 @@ -22,7 +22,7 @@ Date: April 2012 KernelVersion: 3.5 Contact: Johan Hovold Description: - Set the pattern generator fall and rise times (0..7), where + Set the pattern generator fall and rise times (0..7), where: 0 - 2048 us 1 - 262 ms @@ -45,7 +45,7 @@ Date: April 2012 KernelVersion: 3.5 Contact: Johan Hovold Description: - Set the brightness-mapping mode (0, 1), where + Set the brightness-mapping mode (0, 1), where: 0 - exponential mode 1 - linear mode @@ -55,7 +55,7 @@ Date: April 2012 KernelVersion: 3.5 Contact: Johan Hovold Description: - Set the PWM-input control mask (5 bits), where + Set the PWM-input control mask (5 bits), where: bit 5 - PWM-input enabled in Zone 4 bit 4 - PWM-input enabled in Zone 3 diff --git a/Documentation/ABI/testing/sysfs-class-leds-gt683r b/Documentation/ABI/testing/sysfs-class-leds-gt683r index e4fae6026e79..6adab27f646e 100644 --- a/Documentation/ABI/testing/sysfs-class-leds-gt683r +++ b/Documentation/ABI/testing/sysfs-class-leds-gt683r @@ -5,7 +5,7 @@ Contact: Janne Kanniainen Description: Set the mode of LEDs. You should notice that changing the mode of one LED will update the mode of its two sibling devices as - well. + well. Possible values are: 0 - normal 1 - audio @@ -13,4 +13,4 @@ Description: Normal: LEDs are fully on when enabled Audio: LEDs brightness depends on sound level - Breathing: LEDs brightness varies at human breathing rate \ No newline at end of file + Breathing: LEDs brightness varies at human breathing rate diff --git a/Documentation/ABI/testing/sysfs-class-net-phydev b/Documentation/ABI/testing/sysfs-class-net-phydev index 2a5723343aba..206cbf538b59 100644 --- a/Documentation/ABI/testing/sysfs-class-net-phydev +++ b/Documentation/ABI/testing/sysfs-class-net-phydev @@ -41,3 +41,11 @@ Description: xgmii, moca, qsgmii, trgmii, 1000base-x, 2500base-x, rxaui, xaui, 10gbase-kr, unknown +What: /sys/class/mdio_bus///phy_standalone +Date: May 2019 +KernelVersion: 5.3 +Contact: netdev@vger.kernel.org +Description: + Boolean value indicating whether the PHY device is used in + standalone mode, without a net_device associated, by PHYLINK. + Attribute created only when this is the case. diff --git a/Documentation/ABI/testing/sysfs-class-net-qmi b/Documentation/ABI/testing/sysfs-class-net-qmi index 7122d6264c49..c310db4ccbc2 100644 --- a/Documentation/ABI/testing/sysfs-class-net-qmi +++ b/Documentation/ABI/testing/sysfs-class-net-qmi @@ -29,7 +29,7 @@ Contact: Bjørn Mork Description: Unsigned integer. - Write a number ranging from 1 to 127 to add a qmap mux + Write a number ranging from 1 to 254 to add a qmap mux based network device, supported by recent Qualcomm based modems. @@ -46,5 +46,5 @@ Contact: Bjørn Mork Description: Unsigned integer. - Write a number ranging from 1 to 127 to delete a previously + Write a number ranging from 1 to 254 to delete a previously created qmap mux based network device. diff --git a/Documentation/ABI/testing/sysfs-class-power b/Documentation/ABI/testing/sysfs-class-power index b77e30b9014e..27edc06e2495 100644 --- a/Documentation/ABI/testing/sysfs-class-power +++ b/Documentation/ABI/testing/sysfs-class-power @@ -376,10 +376,42 @@ Description: supply. Normally this is configured based on the type of connection made (e.g. A configured SDP should output a maximum of 500mA so the input current limit is set to the same value). + Use preferably input_power_limit, and for problems that can be + solved using power limit use input_current_limit. Access: Read, Write Valid values: Represented in microamps +What: /sys/class/power_supply//input_voltage_limit +Date: May 2019 +Contact: linux-pm@vger.kernel.org +Description: + This entry configures the incoming VBUS voltage limit currently + set in the supply. Normally this is configured based on + system-level knowledge or user input (e.g. This is part of the + Pixel C's thermal management strategy to effectively limit the + input power to 5V when the screen is on to meet Google's skin + temperature targets). Note that this feature should not be + used for safety critical things. + Use preferably input_power_limit, and for problems that can be + solved using power limit use input_voltage_limit. + + Access: Read, Write + Valid values: Represented in microvolts + +What: /sys/class/power_supply//input_power_limit +Date: May 2019 +Contact: linux-pm@vger.kernel.org +Description: + This entry configures the incoming power limit currently set + in the supply. Normally this is configured based on + system-level knowledge or user input. Use preferably this + feature to limit the incoming power and use current/voltage + limit only for problems that can be solved using power limit. + + Access: Read, Write + Valid values: Represented in microwatts + What: /sys/class/power_supply//online, Date: May 2007 Contact: linux-pm@vger.kernel.org diff --git a/Documentation/ABI/testing/sysfs-class-power-wilco b/Documentation/ABI/testing/sysfs-class-power-wilco new file mode 100644 index 000000000000..da1d6ffe5e3c --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-power-wilco @@ -0,0 +1,30 @@ +What: /sys/class/power_supply/wilco-charger/charge_type +Date: April 2019 +KernelVersion: 5.2 +Description: + What charging algorithm to use: + + Standard: Fully charges battery at a standard rate. + Adaptive: Battery settings adaptively optimized based on + typical battery usage pattern. + Fast: Battery charges over a shorter period. + Trickle: Extends battery lifespan, intended for users who + primarily use their Chromebook while connected to AC. + Custom: A low and high threshold percentage is specified. + Charging begins when level drops below + charge_control_start_threshold, and ceases when + level is above charge_control_end_threshold. + +What: /sys/class/power_supply/wilco-charger/charge_control_start_threshold +Date: April 2019 +KernelVersion: 5.2 +Description: + Used when charge_type="Custom", as described above. Measured in + percentages. The valid range is [50, 95]. + +What: /sys/class/power_supply/wilco-charger/charge_control_end_threshold +Date: April 2019 +KernelVersion: 5.2 +Description: + Used when charge_type="Custom", as described above. Measured in + percentages. The valid range is [55, 100]. diff --git a/Documentation/ABI/testing/sysfs-class-powercap b/Documentation/ABI/testing/sysfs-class-powercap index db3b3ff70d84..ca491ec4e693 100644 --- a/Documentation/ABI/testing/sysfs-class-powercap +++ b/Documentation/ABI/testing/sysfs-class-powercap @@ -5,7 +5,7 @@ Contact: linux-pm@vger.kernel.org Description: The powercap/ class sub directory belongs to the power cap subsystem. Refer to - Documentation/power/powercap/powercap.txt for details. + Documentation/power/powercap/powercap.rst for details. What: /sys/class/powercap/ Date: September 2013 @@ -147,6 +147,6 @@ What: /sys/class/powercap/...//enabled Date: September 2013 KernelVersion: 3.13 Contact: linux-pm@vger.kernel.org -Description +Description: This allows to enable/disable power capping at power zone level. This applies to current power zone and its children. diff --git a/Documentation/ABI/testing/sysfs-class-switchtec b/Documentation/ABI/testing/sysfs-class-switchtec index 48cb4c15e430..76c7a661a595 100644 --- a/Documentation/ABI/testing/sysfs-class-switchtec +++ b/Documentation/ABI/testing/sysfs-class-switchtec @@ -1,6 +1,6 @@ switchtec - Microsemi Switchtec PCI Switch Management Endpoint -For details on this subsystem look at Documentation/switchtec.txt. +For details on this subsystem look at Documentation/driver-api/switchtec.rst. What: /sys/class/switchtec Date: 05-Jan-2017 diff --git a/Documentation/ABI/testing/sysfs-class-uwb_rc b/Documentation/ABI/testing/sysfs-class-uwb_rc index 85f4875d16ac..a0578751c1e3 100644 --- a/Documentation/ABI/testing/sysfs-class-uwb_rc +++ b/Documentation/ABI/testing/sysfs-class-uwb_rc @@ -125,12 +125,6 @@ Description: The EUI-48 of this device in colon separated hex octets. -What: /sys/class/uwb_rc/uwbN//BPST -Date: July 2008 -KernelVersion: 2.6.27 -Contact: linux-usb@vger.kernel.org -Description: - What: /sys/class/uwb_rc/uwbN//IEs Date: July 2008 KernelVersion: 2.6.27 diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 1528239f69b2..5f7d7b14fa44 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -34,7 +34,7 @@ Description: CPU topology files that describe kernel limits related to present: cpus that have been identified as being present in the system. - See Documentation/cputopology.txt for more information. + See Documentation/admin-guide/cputopology.rst for more information. What: /sys/devices/system/cpu/probe @@ -103,7 +103,7 @@ Description: CPU topology files that describe a logical CPU's relationship thread_siblings_list: human-readable list of cpu#'s hardware threads within the same core as cpu# - See Documentation/cputopology.txt for more information. + See Documentation/admin-guide/cputopology.rst for more information. What: /sys/devices/system/cpu/cpuidle/current_driver @@ -137,7 +137,8 @@ Description: Discover cpuidle policy and mechanism current_governor: (RW) displays current idle policy. Users can switch the governor at runtime by writing to this file. - See files in Documentation/cpuidle/ for more information. + See Documentation/admin-guide/pm/cpuidle.rst and + Documentation/driver-api/pm/cpuidle.rst for more information. What: /sys/devices/system/cpu/cpuX/cpuidle/stateN/name @@ -538,3 +539,26 @@ Description: Intel Energy and Performance Bias Hint (EPB) This attribute is present for all online CPUs supporting the Intel EPB feature. + +What: /sys/devices/system/cpu/umwait_control + /sys/devices/system/cpu/umwait_control/enable_c02 + /sys/devices/system/cpu/umwait_control/max_time +Date: May 2019 +Contact: Linux kernel mailing list +Description: Umwait control + + enable_c02: Read/write interface to control umwait C0.2 state + Read returns C0.2 state status: + 0: C0.2 is disabled + 1: C0.2 is enabled + + Write 'y' or '1' or 'on' to enable C0.2 state. + Write 'n' or '0' or 'off' to disable C0.2 state. + + The interface is case insensitive. + + max_time: Read/write interface to control umwait maximum time + in TSC-quanta that the CPU can reside in either C0.1 + or C0.2 state. The time is an unsigned 32-bit number. + Note that a value of zero means there is no limit. + Low order two bits must be zero. diff --git a/Documentation/ABI/testing/sysfs-driver-altera-cvp b/Documentation/ABI/testing/sysfs-driver-altera-cvp index 8cde64a71edb..fbd8078fd7ad 100644 --- a/Documentation/ABI/testing/sysfs-driver-altera-cvp +++ b/Documentation/ABI/testing/sysfs-driver-altera-cvp @@ -1,6 +1,6 @@ What: /sys/bus/pci/drivers/altera-cvp/chkcfg Date: May 2017 -Kernel Version: 4.13 +KernelVersion: 4.13 Contact: Anatolij Gustschin Description: Contains either 1 or 0 and controls if configuration diff --git a/Documentation/ABI/testing/sysfs-driver-habanalabs b/Documentation/ABI/testing/sysfs-driver-habanalabs index 78b2bcf316a3..f433fc6db3c6 100644 --- a/Documentation/ABI/testing/sysfs-driver-habanalabs +++ b/Documentation/ABI/testing/sysfs-driver-habanalabs @@ -62,18 +62,20 @@ What: /sys/class/habanalabs/hl/ic_clk Date: Jan 2019 KernelVersion: 5.1 Contact: oded.gabbay@gmail.com -Description: Allows the user to set the maximum clock frequency of the - Interconnect fabric. Writes to this parameter affect the device - only when the power management profile is set to "manual" mode. - The device IC clock might be set to lower value then the +Description: Allows the user to set the maximum clock frequency, in Hz, of + the Interconnect fabric. Writes to this parameter affect the + device only when the power management profile is set to "manual" + mode. The device IC clock might be set to lower value than the maximum. The user should read the ic_clk_curr to see the actual - frequency value of the IC + frequency value of the IC. This property is valid only for the + Goya ASIC family What: /sys/class/habanalabs/hl/ic_clk_curr Date: Jan 2019 KernelVersion: 5.1 Contact: oded.gabbay@gmail.com -Description: Displays the current clock frequency of the Interconnect fabric +Description: Displays the current clock frequency, in Hz, of the Interconnect + fabric. This property is valid only for the Goya ASIC family What: /sys/class/habanalabs/hl/infineon_ver Date: Jan 2019 @@ -92,18 +94,20 @@ What: /sys/class/habanalabs/hl/mme_clk Date: Jan 2019 KernelVersion: 5.1 Contact: oded.gabbay@gmail.com -Description: Allows the user to set the maximum clock frequency of the - MME compute engine. Writes to this parameter affect the device - only when the power management profile is set to "manual" mode. - The device MME clock might be set to lower value then the +Description: Allows the user to set the maximum clock frequency, in Hz, of + the MME compute engine. Writes to this parameter affect the + device only when the power management profile is set to "manual" + mode. The device MME clock might be set to lower value than the maximum. The user should read the mme_clk_curr to see the actual - frequency value of the MME + frequency value of the MME. This property is valid only for the + Goya ASIC family What: /sys/class/habanalabs/hl/mme_clk_curr Date: Jan 2019 KernelVersion: 5.1 Contact: oded.gabbay@gmail.com -Description: Displays the current clock frequency of the MME compute engine +Description: Displays the current clock frequency, in Hz, of the MME compute + engine. This property is valid only for the Goya ASIC family What: /sys/class/habanalabs/hl/pci_addr Date: Jan 2019 @@ -163,18 +167,20 @@ What: /sys/class/habanalabs/hl/tpc_clk Date: Jan 2019 KernelVersion: 5.1 Contact: oded.gabbay@gmail.com -Description: Allows the user to set the maximum clock frequency of the - TPC compute engines. Writes to this parameter affect the device - only when the power management profile is set to "manual" mode. - The device TPC clock might be set to lower value then the +Description: Allows the user to set the maximum clock frequency, in Hz, of + the TPC compute engines. Writes to this parameter affect the + device only when the power management profile is set to "manual" + mode. The device TPC clock might be set to lower value than the maximum. The user should read the tpc_clk_curr to see the actual - frequency value of the TPC + frequency value of the TPC. This property is valid only for + Goya ASIC family What: /sys/class/habanalabs/hl/tpc_clk_curr Date: Jan 2019 KernelVersion: 5.1 Contact: oded.gabbay@gmail.com -Description: Displays the current clock frequency of the TPC compute engines +Description: Displays the current clock frequency, in Hz, of the TPC compute + engines. This property is valid only for the Goya ASIC family What: /sys/class/habanalabs/hl/uboot_ver Date: Jan 2019 diff --git a/Documentation/ABI/testing/sysfs-driver-hid b/Documentation/ABI/testing/sysfs-driver-hid index 48942cacb0bf..a59533410871 100644 --- a/Documentation/ABI/testing/sysfs-driver-hid +++ b/Documentation/ABI/testing/sysfs-driver-hid @@ -1,6 +1,6 @@ -What: For USB devices : /sys/bus/usb/devices/-:./::./report_descriptor - For BT devices : /sys/class/bluetooth/hci/::./report_descriptor - Symlink : /sys/class/hidraw/hidraw/device/report_descriptor +What: /sys/bus/usb/devices/-:./::./report_descriptor +What: /sys/class/bluetooth/hci/::./report_descriptor +What: /sys/class/hidraw/hidraw/device/report_descriptor Date: Jan 2011 KernelVersion: 2.0.39 Contact: Alan Ott @@ -9,9 +9,9 @@ Description: When read, this file returns the device's raw binary HID This file cannot be written. Users: HIDAPI library (http://www.signal11.us/oss/hidapi) -What: For USB devices : /sys/bus/usb/devices/-:./::./country - For BT devices : /sys/class/bluetooth/hci/::./country - Symlink : /sys/class/hidraw/hidraw/device/country +What: /sys/bus/usb/devices/-:./::./country +What: /sys/class/bluetooth/hci/::./country +What: /sys/class/hidraw/hidraw/device/country Date: February 2015 KernelVersion: 3.19 Contact: Olivier Gay diff --git a/Documentation/ABI/testing/sysfs-driver-hid-roccat-kone b/Documentation/ABI/testing/sysfs-driver-hid-roccat-kone index 3ca3971109bf..8f7982c70d72 100644 --- a/Documentation/ABI/testing/sysfs-driver-hid-roccat-kone +++ b/Documentation/ABI/testing/sysfs-driver-hid-roccat-kone @@ -5,7 +5,7 @@ Description: It is possible to switch the dpi setting of the mouse with the press of a button. When read, this file returns the raw number of the actual dpi setting reported by the mouse. This number has to be further - processed to receive the real dpi value. + processed to receive the real dpi value: VALUE DPI 1 800 diff --git a/Documentation/ABI/testing/sysfs-driver-ppi b/Documentation/ABI/testing/sysfs-driver-ppi index 9921ef285899..1a56fc507689 100644 --- a/Documentation/ABI/testing/sysfs-driver-ppi +++ b/Documentation/ABI/testing/sysfs-driver-ppi @@ -1,6 +1,6 @@ What: /sys/class/tpm/tpmX/ppi/ Date: August 2012 -Kernel Version: 3.6 +KernelVersion: 3.6 Contact: xiaoyan.zhang@intel.com Description: This folder includes the attributes related with PPI (Physical diff --git a/Documentation/ABI/testing/sysfs-driver-st b/Documentation/ABI/testing/sysfs-driver-st index ba5d77008a85..88cab66fd77f 100644 --- a/Documentation/ABI/testing/sysfs-driver-st +++ b/Documentation/ABI/testing/sysfs-driver-st @@ -1,6 +1,6 @@ What: /sys/bus/scsi/drivers/st/debug_flag Date: October 2015 -Kernel Version: ?.? +KernelVersion: ?.? Contact: shane.seymour@hpe.com Description: This file allows you to turn debug output from the st driver diff --git a/Documentation/ABI/testing/sysfs-driver-wacom b/Documentation/ABI/testing/sysfs-driver-wacom index 2aa5503ee200..afc48fc163b5 100644 --- a/Documentation/ABI/testing/sysfs-driver-wacom +++ b/Documentation/ABI/testing/sysfs-driver-wacom @@ -1,6 +1,6 @@ What: /sys/bus/hid/devices/::./speed Date: April 2010 -Kernel Version: 2.6.35 +KernelVersion: 2.6.35 Contact: linux-bluetooth@vger.kernel.org Description: The /sys/bus/hid/devices/::./speed file diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs b/Documentation/ABI/testing/sysfs-fs-f2fs index 91822ce25831..dca326e0ee3e 100644 --- a/Documentation/ABI/testing/sysfs-fs-f2fs +++ b/Documentation/ABI/testing/sysfs-fs-f2fs @@ -243,3 +243,11 @@ Description: - Del: echo '[h/c]!extension' > /sys/fs/f2fs//extension_list - [h] means add/del hot file extension - [c] means add/del cold file extension + +What: /sys/fs/f2fs//unusable +Date April 2019 +Contact: "Daniel Rosenberg" +Description: + If checkpoint=disable, it displays the number of blocks that are unusable. + If checkpoint=enable it displays the enumber of blocks that would be unusable + if checkpoint=disable were to be set. diff --git a/Documentation/ABI/testing/sysfs-kernel-fscaps b/Documentation/ABI/testing/sysfs-kernel-fscaps index 50a3033b5e15..bcff34665192 100644 --- a/Documentation/ABI/testing/sysfs-kernel-fscaps +++ b/Documentation/ABI/testing/sysfs-kernel-fscaps @@ -2,7 +2,7 @@ What: /sys/kernel/fscaps Date: February 2011 KernelVersion: 2.6.38 Contact: Ludwig Nussel -Description +Description: Shows whether file system capabilities are honored when executing a binary diff --git a/Documentation/ABI/testing/sysfs-kernel-iommu_groups b/Documentation/ABI/testing/sysfs-kernel-iommu_groups index 35c64e00b35c..017f5bc3920c 100644 --- a/Documentation/ABI/testing/sysfs-kernel-iommu_groups +++ b/Documentation/ABI/testing/sysfs-kernel-iommu_groups @@ -24,3 +24,12 @@ Description: /sys/kernel/iommu_groups/reserved_regions list IOVA region is described on a single line: the 1st field is the base IOVA, the second is the end IOVA and the third field describes the type of the region. + +What: /sys/kernel/iommu_groups/reserved_regions +Date: June 2019 +KernelVersion: v5.3 +Contact: Eric Auger +Description: In case an RMRR is used only by graphics or USB devices + it is now exposed as "direct-relaxable" instead of "direct". + In device assignment use case, for instance, those RMRR + are considered to be relaxable and safe. diff --git a/Documentation/ABI/testing/sysfs-kernel-uids b/Documentation/ABI/testing/sysfs-kernel-uids index 28f14695a852..4182b7061816 100644 --- a/Documentation/ABI/testing/sysfs-kernel-uids +++ b/Documentation/ABI/testing/sysfs-kernel-uids @@ -11,4 +11,4 @@ Description: example would be, if User A has shares = 1024 and user B has shares = 2048, User B will get twice the CPU bandwidth user A will. For more details refer - Documentation/scheduler/sched-design-CFS.txt + Documentation/scheduler/sched-design-CFS.rst diff --git a/Documentation/ABI/testing/sysfs-kernel-vmcoreinfo b/Documentation/ABI/testing/sysfs-kernel-vmcoreinfo index 7bd81168e063..1f1087a5f075 100644 --- a/Documentation/ABI/testing/sysfs-kernel-vmcoreinfo +++ b/Documentation/ABI/testing/sysfs-kernel-vmcoreinfo @@ -4,7 +4,7 @@ KernelVersion: 2.6.24 Contact: Ken'ichi Ohmichi Kexec Mailing List Vivek Goyal -Description +Description: Shows physical address and size of vmcoreinfo ELF note. First value contains physical address of note in hex and second value contains the size of note in hex. This ELF diff --git a/Documentation/ABI/testing/sysfs-platform-asus-laptop b/Documentation/ABI/testing/sysfs-platform-asus-laptop index cd9d667c3da2..8b0e8205a6a2 100644 --- a/Documentation/ABI/testing/sysfs-platform-asus-laptop +++ b/Documentation/ABI/testing/sysfs-platform-asus-laptop @@ -31,7 +31,7 @@ Description: To control the LED display, use the following : echo 0x0T000DDD > /sys/devices/platform/asus_laptop/ where T control the 3 letters display, and DDD the 3 digits display. - The DDD table can be found in Documentation/laptops/asus-laptop.txt + The DDD table can be found in Documentation/admin-guide/laptops/asus-laptop.rst What: /sys/devices/platform/asus_laptop/bluetooth Date: January 2007 diff --git a/Documentation/ABI/testing/sysfs-platform-asus-wmi b/Documentation/ABI/testing/sysfs-platform-asus-wmi index 019e1e29370e..9e99f2909612 100644 --- a/Documentation/ABI/testing/sysfs-platform-asus-wmi +++ b/Documentation/ABI/testing/sysfs-platform-asus-wmi @@ -36,3 +36,13 @@ KernelVersion: 3.5 Contact: "AceLan Kao" Description: Resume on lid open. 1 means on, 0 means off. + +What: /sys/devices/platform//fan_boost_mode +Date: Sep 2019 +KernelVersion: 5.3 +Contact: "Yurii Pavlovskyi" +Description: + Fan boost mode: + * 0 - normal, + * 1 - overboost, + * 2 - silent diff --git a/Documentation/ABI/testing/sysfs-platform-i2c-demux-pinctrl b/Documentation/ABI/testing/sysfs-platform-i2c-demux-pinctrl index 3c3514815cd5..c394b808be19 100644 --- a/Documentation/ABI/testing/sysfs-platform-i2c-demux-pinctrl +++ b/Documentation/ABI/testing/sysfs-platform-i2c-demux-pinctrl @@ -1,7 +1,7 @@ What: /sys/devices/platform//available_masters Date: January 2016 KernelVersion: 4.6 -Contact: Wolfram Sang +Contact: Wolfram Sang Description: Reading the file will give you a list of masters which can be selected for a demultiplexed bus. The format is @@ -12,7 +12,7 @@ Description: What: /sys/devices/platform//current_master Date: January 2016 KernelVersion: 4.6 -Contact: Wolfram Sang +Contact: Wolfram Sang Description: This file selects/shows the active I2C master for a demultiplexed bus. It uses the value from the file 'available_masters'. diff --git a/Documentation/ABI/testing/sysfs-platform-wilco-ec b/Documentation/ABI/testing/sysfs-platform-wilco-ec new file mode 100644 index 000000000000..8827a734f933 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-platform-wilco-ec @@ -0,0 +1,40 @@ +What: /sys/bus/platform/devices/GOOG000C\:00/boot_on_ac +Date: April 2019 +KernelVersion: 5.3 +Description: + Boot on AC is a policy which makes the device boot from S5 + when AC power is connected. This is useful for users who + want to run their device headless or with a dock. + + Input should be parseable by kstrtou8() to 0 or 1. + +What: /sys/bus/platform/devices/GOOG000C\:00/build_date +Date: May 2019 +KernelVersion: 5.3 +Description: + Display Wilco Embedded Controller firmware build date. + Output will a MM/DD/YY string. + +What: /sys/bus/platform/devices/GOOG000C\:00/build_revision +Date: May 2019 +KernelVersion: 5.3 +Description: + Display Wilco Embedded Controller build revision. + Output will a version string be similar to the example below: + d2592cae0 + +What: /sys/bus/platform/devices/GOOG000C\:00/model_number +Date: May 2019 +KernelVersion: 5.3 +Description: + Display Wilco Embedded Controller model number. + Output will a version string be similar to the example below: + 08B6 + +What: /sys/bus/platform/devices/GOOG000C\:00/version +Date: May 2019 +KernelVersion: 5.3 +Description: + Display Wilco Embedded Controller firmware version. + The format of the string is x.y.z. Where x is major, y is minor + and z is the build number. For example: 95.00.06 diff --git a/Documentation/ABI/testing/sysfs-power b/Documentation/ABI/testing/sysfs-power index 18b7dc929234..3c5130355011 100644 --- a/Documentation/ABI/testing/sysfs-power +++ b/Documentation/ABI/testing/sysfs-power @@ -300,4 +300,4 @@ Description: attempt. Using this sysfs file will override any values that were - set using the kernel command line for disk offset. \ No newline at end of file + set using the kernel command line for disk offset. diff --git a/Documentation/logo.txt b/Documentation/COPYING-logo similarity index 100% rename from Documentation/logo.txt rename to Documentation/COPYING-logo diff --git a/Documentation/DMA-API-HOWTO.txt b/Documentation/DMA-API-HOWTO.txt index cb712a02f59f..358d495456d1 100644 --- a/Documentation/DMA-API-HOWTO.txt +++ b/Documentation/DMA-API-HOWTO.txt @@ -212,7 +212,7 @@ The standard 64-bit addressing device would do something like this:: If the device only supports 32-bit addressing for descriptors in the coherent allocations, but supports full 64-bits for streaming mappings -it would look like this: +it would look like this:: if (dma_set_mask(dev, DMA_BIT_MASK(64))) { dev_warn(dev, "mydev: No suitable DMA available\n"); diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index 0076150fdccb..e47c63bd4887 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -198,7 +198,7 @@ call to set the mask to the value returned. :: size_t - dma_direct_max_mapping_size(struct device *dev); + dma_max_mapping_size(struct device *dev); Returns the maximum size of a mapping for the device. The size parameter of the mapping functions like dma_map_single(), dma_map_page() and diff --git a/Documentation/EDID/HOWTO.txt b/Documentation/EDID/HOWTO.txt deleted file mode 100644 index 539871c3b785..000000000000 --- a/Documentation/EDID/HOWTO.txt +++ /dev/null @@ -1,49 +0,0 @@ -In the good old days when graphics parameters were configured explicitly -in a file called xorg.conf, even broken hardware could be managed. - -Today, with the advent of Kernel Mode Setting, a graphics board is -either correctly working because all components follow the standards - -or the computer is unusable, because the screen remains dark after -booting or it displays the wrong area. Cases when this happens are: -- The graphics board does not recognize the monitor. -- The graphics board is unable to detect any EDID data. -- The graphics board incorrectly forwards EDID data to the driver. -- The monitor sends no or bogus EDID data. -- A KVM sends its own EDID data instead of querying the connected monitor. -Adding the kernel parameter "nomodeset" helps in most cases, but causes -restrictions later on. - -As a remedy for such situations, the kernel configuration item -CONFIG_DRM_LOAD_EDID_FIRMWARE was introduced. It allows to provide an -individually prepared or corrected EDID data set in the /lib/firmware -directory from where it is loaded via the firmware interface. The code -(see drivers/gpu/drm/drm_edid_load.c) contains built-in data sets for -commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200, -1680x1050, 1920x1080) as binary blobs, but the kernel source tree does -not contain code to create these data. In order to elucidate the origin -of the built-in binary EDID blobs and to facilitate the creation of -individual data for a specific misbehaving monitor, commented sources -and a Makefile environment are given here. - -To create binary EDID and C source code files from the existing data -material, simply type "make". - -If you want to create your own EDID file, copy the file 1024x768.S, -replace the settings with your own data and add a new target to the -Makefile. Please note that the EDID data structure expects the timing -values in a different way as compared to the standard X11 format. - -X11: -HTimings: hdisp hsyncstart hsyncend htotal -VTimings: vdisp vsyncstart vsyncend vtotal - -EDID: -#define XPIX hdisp -#define XBLANK htotal-hdisp -#define XOFFSET hsyncstart-hdisp -#define XPULSE hsyncend-hsyncstart - -#define YPIX vdisp -#define YBLANK vtotal-vdisp -#define YOFFSET vsyncstart-vdisp -#define YPULSE vsyncend-vsyncstart diff --git a/Documentation/Kconfig b/Documentation/Kconfig new file mode 100644 index 000000000000..66046fa1c341 --- /dev/null +++ b/Documentation/Kconfig @@ -0,0 +1,13 @@ +config WARN_MISSING_DOCUMENTS + + bool "Warn if there's a missing documentation file" + depends on COMPILE_TEST + help + It is not uncommon that a document gets renamed. + This option makes the Kernel to check for missing dependencies, + warning when something is missing. Works only if the Kernel + is built from a git tree. + + If unsure, select 'N'. + + diff --git a/Documentation/Makefile b/Documentation/Makefile index e889e7cb8511..e145e4db508b 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -4,6 +4,11 @@ subdir-y := devicetree/bindings/ +# Check for broken documentation file references +ifeq ($(CONFIG_WARN_MISSING_DOCUMENTS),y) +$(shell $(srctree)/scripts/documentation-file-ref-check --warn) +endif + # You can set these variables from the command line. SPHINXBUILD = sphinx-build SPHINXOPTS = @@ -23,11 +28,13 @@ ifeq ($(HAVE_SPHINX),0) .DEFAULT: $(warning The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed and in PATH, or set the SPHINXBUILD make variable to point to the full path of the '$(SPHINXBUILD)' executable.) @echo - @./scripts/sphinx-pre-install + @$(srctree)/scripts/sphinx-pre-install @echo " SKIP Sphinx $@ target." else # HAVE_SPHINX +export SPHINXOPTS = $(shell perl -e 'open IN,"sphinx-build --version 2>&1 |"; while () { if (m/([\d\.]+)/) { print "-jauto" if ($$1 >= "1.7") } ;} close IN') + # User-friendly check for pdflatex and latexmk HAVE_PDFLATEX := $(shell if which $(PDFLATEX) >/dev/null 2>&1; then echo 1; else echo 0; fi) HAVE_LATEXMK := $(shell if which latexmk >/dev/null 2>&1; then echo 1; else echo 0; fi) @@ -70,12 +77,14 @@ quiet_cmd_sphinx = SPHINX $@ --> file://$(abspath $(BUILDDIR)/$3/$4) $(abspath $(BUILDDIR)/$3/$4) htmldocs: + @$(srctree)/scripts/sphinx-pre-install --version-check @+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,html,$(var),,$(var))) linkcheckdocs: @$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,linkcheck,$(var),,$(var))) latexdocs: + @$(srctree)/scripts/sphinx-pre-install --version-check @+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,latex,$(var),latex,$(var))) ifeq ($(HAVE_PDFLATEX),0) @@ -87,14 +96,17 @@ pdfdocs: else # HAVE_PDFLATEX pdfdocs: latexdocs + @$(srctree)/scripts/sphinx-pre-install --version-check $(foreach var,$(SPHINXDIRS), $(MAKE) PDFLATEX="$(PDFLATEX)" LATEXOPTS="$(LATEXOPTS)" -C $(BUILDDIR)/$(var)/latex || exit;) endif # HAVE_PDFLATEX epubdocs: + @$(srctree)/scripts/sphinx-pre-install --version-check @+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,epub,$(var),epub,$(var))) xmldocs: + @$(srctree)/scripts/sphinx-pre-install --version-check @+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,xml,$(var),xml,$(var))) endif # HAVE_SPHINX diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt deleted file mode 100644 index 618e13d5e276..000000000000 --- a/Documentation/PCI/MSI-HOWTO.txt +++ /dev/null @@ -1,270 +0,0 @@ - The MSI Driver Guide HOWTO - Tom L Nguyen tom.l.nguyen@intel.com - 10/03/2003 - Revised Feb 12, 2004 by Martine Silbermann - email: Martine.Silbermann@hp.com - Revised Jun 25, 2004 by Tom L Nguyen - Revised Jul 9, 2008 by Matthew Wilcox - Copyright 2003, 2008 Intel Corporation - -1. About this guide - -This guide describes the basics of Message Signaled Interrupts (MSIs), -the advantages of using MSI over traditional interrupt mechanisms, how -to change your driver to use MSI or MSI-X and some basic diagnostics to -try if a device doesn't support MSIs. - - -2. What are MSIs? - -A Message Signaled Interrupt is a write from the device to a special -address which causes an interrupt to be received by the CPU. - -The MSI capability was first specified in PCI 2.2 and was later enhanced -in PCI 3.0 to allow each interrupt to be masked individually. The MSI-X -capability was also introduced with PCI 3.0. It supports more interrupts -per device than MSI and allows interrupts to be independently configured. - -Devices may support both MSI and MSI-X, but only one can be enabled at -a time. - - -3. Why use MSIs? - -There are three reasons why using MSIs can give an advantage over -traditional pin-based interrupts. - -Pin-based PCI interrupts are often shared amongst several devices. -To support this, the kernel must call each interrupt handler associated -with an interrupt, which leads to reduced performance for the system as -a whole. MSIs are never shared, so this problem cannot arise. - -When a device writes data to memory, then raises a pin-based interrupt, -it is possible that the interrupt may arrive before all the data has -arrived in memory (this becomes more likely with devices behind PCI-PCI -bridges). In order to ensure that all the data has arrived in memory, -the interrupt handler must read a register on the device which raised -the interrupt. PCI transaction ordering rules require that all the data -arrive in memory before the value may be returned from the register. -Using MSIs avoids this problem as the interrupt-generating write cannot -pass the data writes, so by the time the interrupt is raised, the driver -knows that all the data has arrived in memory. - -PCI devices can only support a single pin-based interrupt per function. -Often drivers have to query the device to find out what event has -occurred, slowing down interrupt handling for the common case. With -MSIs, a device can support more interrupts, allowing each interrupt -to be specialised to a different purpose. One possible design gives -infrequent conditions (such as errors) their own interrupt which allows -the driver to handle the normal interrupt handling path more efficiently. -Other possible designs include giving one interrupt to each packet queue -in a network card or each port in a storage controller. - - -4. How to use MSIs - -PCI devices are initialised to use pin-based interrupts. The device -driver has to set up the device to use MSI or MSI-X. Not all machines -support MSIs correctly, and for those machines, the APIs described below -will simply fail and the device will continue to use pin-based interrupts. - -4.1 Include kernel support for MSIs - -To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI -option enabled. This option is only available on some architectures, -and it may depend on some other options also being set. For example, -on x86, you must also enable X86_UP_APIC or SMP in order to see the -CONFIG_PCI_MSI option. - -4.2 Using MSI - -Most of the hard work is done for the driver in the PCI layer. The driver -simply has to request that the PCI layer set up the MSI capability for this -device. - -To automatically use MSI or MSI-X interrupt vectors, use the following -function: - - int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, - unsigned int max_vecs, unsigned int flags); - -which allocates up to max_vecs interrupt vectors for a PCI device. It -returns the number of vectors allocated or a negative error. If the device -has a requirements for a minimum number of vectors the driver can pass a -min_vecs argument set to this limit, and the PCI core will return -ENOSPC -if it can't meet the minimum number of vectors. - -The flags argument is used to specify which type of interrupt can be used -by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX). -A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for -any possible kind of interrupt. If the PCI_IRQ_AFFINITY flag is set, -pci_alloc_irq_vectors() will spread the interrupts around the available CPUs. - -To get the Linux IRQ numbers passed to request_irq() and free_irq() and the -vectors, use the following function: - - int pci_irq_vector(struct pci_dev *dev, unsigned int nr); - -Any allocated resources should be freed before removing the device using -the following function: - - void pci_free_irq_vectors(struct pci_dev *dev); - -If a device supports both MSI-X and MSI capabilities, this API will use the -MSI-X facilities in preference to the MSI facilities. MSI-X supports any -number of interrupts between 1 and 2048. In contrast, MSI is restricted to -a maximum of 32 interrupts (and must be a power of two). In addition, the -MSI interrupt vectors must be allocated consecutively, so the system might -not be able to allocate as many vectors for MSI as it could for MSI-X. On -some platforms, MSI interrupts must all be targeted at the same set of CPUs -whereas MSI-X interrupts can all be targeted at different CPUs. - -If a device supports neither MSI-X or MSI it will fall back to a single -legacy IRQ vector. - -The typical usage of MSI or MSI-X interrupts is to allocate as many vectors -as possible, likely up to the limit supported by the device. If nvec is -larger than the number supported by the device it will automatically be -capped to the supported limit, so there is no need to query the number of -vectors supported beforehand: - - nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_ALL_TYPES) - if (nvec < 0) - goto out_err; - -If a driver is unable or unwilling to deal with a variable number of MSI -interrupts it can request a particular number of interrupts by passing that -number to pci_alloc_irq_vectors() function as both 'min_vecs' and -'max_vecs' parameters: - - ret = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_ALL_TYPES); - if (ret < 0) - goto out_err; - -The most notorious example of the request type described above is enabling -the single MSI mode for a device. It could be done by passing two 1s as -'min_vecs' and 'max_vecs': - - ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES); - if (ret < 0) - goto out_err; - -Some devices might not support using legacy line interrupts, in which case -the driver can specify that only MSI or MSI-X is acceptable: - - nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_MSI | PCI_IRQ_MSIX); - if (nvec < 0) - goto out_err; - -4.3 Legacy APIs - -The following old APIs to enable and disable MSI or MSI-X interrupts should -not be used in new code: - - pci_enable_msi() /* deprecated */ - pci_disable_msi() /* deprecated */ - pci_enable_msix_range() /* deprecated */ - pci_enable_msix_exact() /* deprecated */ - pci_disable_msix() /* deprecated */ - -Additionally there are APIs to provide the number of supported MSI or MSI-X -vectors: pci_msi_vec_count() and pci_msix_vec_count(). In general these -should be avoided in favor of letting pci_alloc_irq_vectors() cap the -number of vectors. If you have a legitimate special use case for the count -of vectors we might have to revisit that decision and add a -pci_nr_irq_vectors() helper that handles MSI and MSI-X transparently. - -4.4 Considerations when using MSIs - -4.4.1 Spinlocks - -Most device drivers have a per-device spinlock which is taken in the -interrupt handler. With pin-based interrupts or a single MSI, it is not -necessary to disable interrupts (Linux guarantees the same interrupt will -not be re-entered). If a device uses multiple interrupts, the driver -must disable interrupts while the lock is held. If the device sends -a different interrupt, the driver will deadlock trying to recursively -acquire the spinlock. Such deadlocks can be avoided by using -spin_lock_irqsave() or spin_lock_irq() which disable local interrupts -and acquire the lock (see Documentation/kernel-hacking/locking.rst). - -4.5 How to tell whether MSI/MSI-X is enabled on a device - -Using 'lspci -v' (as root) may show some devices with "MSI", "Message -Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities -has an 'Enable' flag which is followed with either "+" (enabled) -or "-" (disabled). - - -5. MSI quirks - -Several PCI chipsets or devices are known not to support MSIs. -The PCI stack provides three ways to disable MSIs: - -1. globally -2. on all devices behind a specific bridge -3. on a single device - -5.1. Disabling MSIs globally - -Some host chipsets simply don't support MSIs properly. If we're -lucky, the manufacturer knows this and has indicated it in the ACPI -FADT table. In this case, Linux automatically disables MSIs. -Some boards don't include this information in the table and so we have -to detect them ourselves. The complete list of these is found near the -quirk_disable_all_msi() function in drivers/pci/quirks.c. - -If you have a board which has problems with MSIs, you can pass pci=nomsi -on the kernel command line to disable MSIs on all devices. It would be -in your best interests to report the problem to linux-pci@vger.kernel.org -including a full 'lspci -v' so we can add the quirks to the kernel. - -5.2. Disabling MSIs below a bridge - -Some PCI bridges are not able to route MSIs between busses properly. -In this case, MSIs must be disabled on all devices behind the bridge. - -Some bridges allow you to enable MSIs by changing some bits in their -PCI configuration space (especially the Hypertransport chipsets such -as the nVidia nForce and Serverworks HT2000). As with host chipsets, -Linux mostly knows about them and automatically enables MSIs if it can. -If you have a bridge unknown to Linux, you can enable -MSIs in configuration space using whatever method you know works, then -enable MSIs on that bridge by doing: - - echo 1 > /sys/bus/pci/devices/$bridge/msi_bus - -where $bridge is the PCI address of the bridge you've enabled (eg -0000:00:0e.0). - -To disable MSIs, echo 0 instead of 1. Changing this value should be -done with caution as it could break interrupt handling for all devices -below this bridge. - -Again, please notify linux-pci@vger.kernel.org of any bridges that need -special handling. - -5.3. Disabling MSIs on a single device - -Some devices are known to have faulty MSI implementations. Usually this -is handled in the individual device driver, but occasionally it's necessary -to handle this with a quirk. Some drivers have an option to disable use -of MSI. While this is a convenient workaround for the driver author, -it is not good practice, and should not be emulated. - -5.4. Finding why MSIs are disabled on a device - -From the above three sections, you can see that there are many reasons -why MSIs may not be enabled for a given device. Your first step should -be to examine your dmesg carefully to determine whether MSIs are enabled -for your machine. You should also check your .config to be sure you -have enabled CONFIG_PCI_MSI. - -Then, 'lspci -t' gives the list of bridges above a device. Reading -/sys/bus/pci/devices/*/msi_bus will tell you whether MSIs are enabled (1) -or disabled (0). If 0 is found in any of the msi_bus files belonging -to bridges between the PCI root and the device, MSIs are disabled. - -It is also worth checking the device driver to see whether it supports MSIs. -For example, it may contain calls to pci_irq_alloc_vectors() with the -PCI_IRQ_MSI or PCI_IRQ_MSIX flags. diff --git a/Documentation/PCI/PCIEBUS-HOWTO.txt b/Documentation/PCI/PCIEBUS-HOWTO.txt deleted file mode 100644 index 15f0bb3b5045..000000000000 --- a/Documentation/PCI/PCIEBUS-HOWTO.txt +++ /dev/null @@ -1,198 +0,0 @@ - The PCI Express Port Bus Driver Guide HOWTO - Tom L Nguyen tom.l.nguyen@intel.com - 11/03/2004 - -1. About this guide - -This guide describes the basics of the PCI Express Port Bus driver -and provides information on how to enable the service drivers to -register/unregister with the PCI Express Port Bus Driver. - -2. Copyright 2004 Intel Corporation - -3. What is the PCI Express Port Bus Driver - -A PCI Express Port is a logical PCI-PCI Bridge structure. There -are two types of PCI Express Port: the Root Port and the Switch -Port. The Root Port originates a PCI Express link from a PCI Express -Root Complex and the Switch Port connects PCI Express links to -internal logical PCI buses. The Switch Port, which has its secondary -bus representing the switch's internal routing logic, is called the -switch's Upstream Port. The switch's Downstream Port is bridging from -switch's internal routing bus to a bus representing the downstream -PCI Express link from the PCI Express Switch. - -A PCI Express Port can provide up to four distinct functions, -referred to in this document as services, depending on its port type. -PCI Express Port's services include native hotplug support (HP), -power management event support (PME), advanced error reporting -support (AER), and virtual channel support (VC). These services may -be handled by a single complex driver or be individually distributed -and handled by corresponding service drivers. - -4. Why use the PCI Express Port Bus Driver? - -In existing Linux kernels, the Linux Device Driver Model allows a -physical device to be handled by only a single driver. The PCI -Express Port is a PCI-PCI Bridge device with multiple distinct -services. To maintain a clean and simple solution each service -may have its own software service driver. In this case several -service drivers will compete for a single PCI-PCI Bridge device. -For example, if the PCI Express Root Port native hotplug service -driver is loaded first, it claims a PCI-PCI Bridge Root Port. The -kernel therefore does not load other service drivers for that Root -Port. In other words, it is impossible to have multiple service -drivers load and run on a PCI-PCI Bridge device simultaneously -using the current driver model. - -To enable multiple service drivers running simultaneously requires -having a PCI Express Port Bus driver, which manages all populated -PCI Express Ports and distributes all provided service requests -to the corresponding service drivers as required. Some key -advantages of using the PCI Express Port Bus driver are listed below: - - - Allow multiple service drivers to run simultaneously on - a PCI-PCI Bridge Port device. - - - Allow service drivers implemented in an independent - staged approach. - - - Allow one service driver to run on multiple PCI-PCI Bridge - Port devices. - - - Manage and distribute resources of a PCI-PCI Bridge Port - device to requested service drivers. - -5. Configuring the PCI Express Port Bus Driver vs. Service Drivers - -5.1 Including the PCI Express Port Bus Driver Support into the Kernel - -Including the PCI Express Port Bus driver depends on whether the PCI -Express support is included in the kernel config. The kernel will -automatically include the PCI Express Port Bus driver as a kernel -driver when the PCI Express support is enabled in the kernel. - -5.2 Enabling Service Driver Support - -PCI device drivers are implemented based on Linux Device Driver Model. -All service drivers are PCI device drivers. As discussed above, it is -impossible to load any service driver once the kernel has loaded the -PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver -Model requires some minimal changes on existing service drivers that -imposes no impact on the functionality of existing service drivers. - -A service driver is required to use the two APIs shown below to -register its service with the PCI Express Port Bus driver (see -section 5.2.1 & 5.2.2). It is important that a service driver -initializes the pcie_port_service_driver data structure, included in -header file /include/linux/pcieport_if.h, before calling these APIs. -Failure to do so will result an identity mismatch, which prevents -the PCI Express Port Bus driver from loading a service driver. - -5.2.1 pcie_port_service_register - -int pcie_port_service_register(struct pcie_port_service_driver *new) - -This API replaces the Linux Driver Model's pci_register_driver API. A -service driver should always calls pcie_port_service_register at -module init. Note that after service driver being loaded, calls -such as pci_enable_device(dev) and pci_set_master(dev) are no longer -necessary since these calls are executed by the PCI Port Bus driver. - -5.2.2 pcie_port_service_unregister - -void pcie_port_service_unregister(struct pcie_port_service_driver *new) - -pcie_port_service_unregister replaces the Linux Driver Model's -pci_unregister_driver. It's always called by service driver when a -module exits. - -5.2.3 Sample Code - -Below is sample service driver code to initialize the port service -driver data structure. - -static struct pcie_port_service_id service_id[] = { { - .vendor = PCI_ANY_ID, - .device = PCI_ANY_ID, - .port_type = PCIE_RC_PORT, - .service_type = PCIE_PORT_SERVICE_AER, - }, { /* end: all zeroes */ } -}; - -static struct pcie_port_service_driver root_aerdrv = { - .name = (char *)device_name, - .id_table = &service_id[0], - - .probe = aerdrv_load, - .remove = aerdrv_unload, - - .suspend = aerdrv_suspend, - .resume = aerdrv_resume, -}; - -Below is a sample code for registering/unregistering a service -driver. - -static int __init aerdrv_service_init(void) -{ - int retval = 0; - - retval = pcie_port_service_register(&root_aerdrv); - if (!retval) { - /* - * FIX ME - */ - } - return retval; -} - -static void __exit aerdrv_service_exit(void) -{ - pcie_port_service_unregister(&root_aerdrv); -} - -module_init(aerdrv_service_init); -module_exit(aerdrv_service_exit); - -6. Possible Resource Conflicts - -Since all service drivers of a PCI-PCI Bridge Port device are -allowed to run simultaneously, below lists a few of possible resource -conflicts with proposed solutions. - -6.1 MSI and MSI-X Vector Resource - -Once MSI or MSI-X interrupts are enabled on a device, it stays in this -mode until they are disabled again. Since service drivers of the same -PCI-PCI Bridge port share the same physical device, if an individual -service driver enables or disables MSI/MSI-X mode it may result -unpredictable behavior. - -To avoid this situation all service drivers are not permitted to -switch interrupt mode on its device. The PCI Express Port Bus driver -is responsible for determining the interrupt mode and this should be -transparent to service drivers. Service drivers need to know only -the vector IRQ assigned to the field irq of struct pcie_device, which -is passed in when the PCI Express Port Bus driver probes each service -driver. Service drivers should use (struct pcie_device*)dev->irq to -call request_irq/free_irq. In addition, the interrupt mode is stored -in the field interrupt_mode of struct pcie_device. - -6.3 PCI Memory/IO Mapped Regions - -Service drivers for PCI Express Power Management (PME), Advanced -Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access -PCI configuration space on the PCI Express port. In all cases the -registers accessed are independent of each other. This patch assumes -that all service drivers will be well behaved and not overwrite -other service driver's configuration settings. - -6.4 PCI Config Registers - -Each service driver runs its PCI config operations on its own -capability structure except the PCI Express capability structure, in -which Root Control register and Device Control register are shared -between PME and AER. This patch assumes that all service drivers -will be well behaved and not overwrite other service driver's -configuration settings. diff --git a/Documentation/PCI/acpi-info.rst b/Documentation/PCI/acpi-info.rst new file mode 100644 index 000000000000..060217081c79 --- /dev/null +++ b/Documentation/PCI/acpi-info.rst @@ -0,0 +1,192 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================== +ACPI considerations for PCI host bridges +======================================== + +The general rule is that the ACPI namespace should describe everything the +OS might use unless there's another way for the OS to find it [1, 2]. + +For example, there's no standard hardware mechanism for enumerating PCI +host bridges, so the ACPI namespace must describe each host bridge, the +method for accessing PCI config space below it, the address space windows +the host bridge forwards to PCI (using _CRS), and the routing of legacy +INTx interrupts (using _PRT). + +PCI devices, which are below the host bridge, generally do not need to be +described via ACPI. The OS can discover them via the standard PCI +enumeration mechanism, using config accesses to discover and identify +devices and read and size their BARs. However, ACPI may describe PCI +devices if it provides power management or hotplug functionality for them +or if the device has INTx interrupts connected by platform interrupt +controllers and a _PRT is needed to describe those connections. + +ACPI resource description is done via _CRS objects of devices in the ACPI +namespace [2].   The _CRS is like a generalized PCI BAR: the OS can read +_CRS and figure out what resource is being consumed even if it doesn't have +a driver for the device [3].  That's important because it means an old OS +can work correctly even on a system with new devices unknown to the OS. +The new devices might not do anything, but the OS can at least make sure no +resources conflict with them. + +Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for +reserving address space. The static tables are for things the OS needs to +know early in boot, before it can parse the ACPI namespace. If a new table +is defined, an old OS needs to operate correctly even though it ignores the +table. _CRS allows that because it is generic and understood by the old +OS; a static table does not. + +If the OS is expected to manage a non-discoverable device described via +ACPI, that device will have a specific _HID/_CID that tells the OS what +driver to bind to it, and the _CRS tells the OS and the driver where the +device's registers are. + +PCI host bridges are PNP0A03 or PNP0A08 devices.  Their _CRS should +describe all the address space they consume.  This includes all the windows +they forward down to the PCI bus, as well as registers of the host bridge +itself that are not forwarded to PCI.  The host bridge registers include +things like secondary/subordinate bus registers that determine the bus +range below the bridge, window registers that describe the apertures, etc. +These are all device-specific, non-architected things, so the only way a +PNP0A03/PNP0A08 driver can manage them is via _PRS/_CRS/_SRS, which contain +the device-specific details.  The host bridge registers also include ECAM +space, since it is consumed by the host bridge. + +ACPI defines a Consumer/Producer bit to distinguish the bridge registers +("Consumer") from the bridge apertures ("Producer") [4, 5], but early +BIOSes didn't use that bit correctly. The result is that the current ACPI +spec defines Consumer/Producer only for the Extended Address Space +descriptors; the bit should be ignored in the older QWord/DWord/Word +Address Space descriptors. Consequently, OSes have to assume all +QWord/DWord/Word descriptors are windows. + +Prior to the addition of Extended Address Space descriptors, the failure of +Consumer/Producer meant there was no way to describe bridge registers in +the PNP0A03/PNP0A08 device itself. The workaround was to describe the +bridge registers (including ECAM space) in PNP0C02 catch-all devices [6]. +With the exception of ECAM, the bridge register space is device-specific +anyway, so the generic PNP0A03/PNP0A08 driver (pci_root.c) has no need to +know about it.   + +New architectures should be able to use "Consumer" Extended Address Space +descriptors in the PNP0A03 device for bridge registers, including ECAM, +although a strict interpretation of [6] might prohibit this. Old x86 and +ia64 kernels assume all address space descriptors, including "Consumer" +Extended Address Space ones, are windows, so it would not be safe to +describe bridge registers this way on those architectures. + +PNP0C02 "motherboard" devices are basically a catch-all.  There's no +programming model for them other than "don't use these resources for +anything else."  So a PNP0C02 _CRS should claim any address space that is +(1) not claimed by _CRS under any other device object in the ACPI namespace +and (2) should not be assigned by the OS to something else. + +The PCIe spec requires the Enhanced Configuration Access Method (ECAM) +unless there's a standard firmware interface for config access, e.g., the +ia64 SAL interface [7]. A host bridge consumes ECAM memory address space +and converts memory accesses into PCI configuration accesses. The spec +defines the ECAM address space layout and functionality; only the base of +the address space is device-specific. An ACPI OS learns the base address +from either the static MCFG table or a _CBA method in the PNP0A03 device. + +The MCFG table must describe the ECAM space of non-hot pluggable host +bridges [8]. Since MCFG is a static table and can't be updated by hotplug, +a _CBA method in the PNP0A03 device describes the ECAM space of a +hot-pluggable host bridge [9]. Note that for both MCFG and _CBA, the base +address always corresponds to bus 0, even if the bus range below the bridge +(which is reported via _CRS) doesn't start at 0. + + +[1] ACPI 6.2, sec 6.1: + For any device that is on a non-enumerable type of bus (for example, an + ISA bus), OSPM enumerates the devices' identifier(s) and the ACPI + system firmware must supply an _HID object ... for each device to + enable OSPM to do that. + +[2] ACPI 6.2, sec 3.7: + The OS enumerates motherboard devices simply by reading through the + ACPI Namespace looking for devices with hardware IDs. + + Each device enumerated by ACPI includes ACPI-defined objects in the + ACPI Namespace that report the hardware resources the device could + occupy [_PRS], an object that reports the resources that are currently + used by the device [_CRS], and objects for configuring those resources + [_SRS]. The information is used by the Plug and Play OS (OSPM) to + configure the devices. + +[3] ACPI 6.2, sec 6.2: + OSPM uses device configuration objects to configure hardware resources + for devices enumerated via ACPI. Device configuration objects provide + information about current and possible resource requirements, the + relationship between shared resources, and methods for configuring + hardware resources. + + When OSPM enumerates a device, it calls _PRS to determine the resource + requirements of the device. It may also call _CRS to find the current + resource settings for the device. Using this information, the Plug and + Play system determines what resources the device should consume and + sets those resources by calling the device’s _SRS control method. + + In ACPI, devices can consume resources (for example, legacy keyboards), + provide resources (for example, a proprietary PCI bridge), or do both. + Unless otherwise specified, resources for a device are assumed to be + taken from the nearest matching resource above the device in the device + hierarchy. + +[4] ACPI 6.2, sec 6.4.3.5.1, 2, 3, 4: + QWord/DWord/Word Address Space Descriptor (.1, .2, .3) + General Flags: Bit [0] Ignored + + Extended Address Space Descriptor (.4) + General Flags: Bit [0] Consumer/Producer: + + * 1 – This device consumes this resource + * 0 – This device produces and consumes this resource + +[5] ACPI 6.2, sec 19.6.43: + ResourceUsage specifies whether the Memory range is consumed by + this device (ResourceConsumer) or passed on to child devices + (ResourceProducer). If nothing is specified, then + ResourceConsumer is assumed. + +[6] PCI Firmware 3.2, sec 4.1.2: + If the operating system does not natively comprehend reserving the + MMCFG region, the MMCFG region must be reserved by firmware. The + address range reported in the MCFG table or by _CBA method (see Section + 4.1.3) must be reserved by declaring a motherboard resource. For most + systems, the motherboard resource would appear at the root of the ACPI + namespace (under \_SB) in a node with a _HID of EISAID (PNP0C02), and + the resources in this case should not be claimed in the root PCI bus’s + _CRS. The resources can optionally be returned in Int15 E820 or + EFIGetMemoryMap as reserved memory but must always be reported through + ACPI as a motherboard resource. + +[7] PCI Express 4.0, sec 7.2.2: + For systems that are PC-compatible, or that do not implement a + processor-architecture-specific firmware interface standard that allows + access to the Configuration Space, the ECAM is required as defined in + this section. + +[8] PCI Firmware 3.2, sec 4.1.2: + The MCFG table is an ACPI table that is used to communicate the base + addresses corresponding to the non-hot removable PCI Segment Groups + range within a PCI Segment Group available to the operating system at + boot. This is required for the PC-compatible systems. + + The MCFG table is only used to communicate the base addresses + corresponding to the PCI Segment Groups available to the system at + boot. + +[9] PCI Firmware 3.2, sec 4.1.3: + The _CBA (Memory mapped Configuration Base Address) control method is + an optional ACPI object that returns the 64-bit memory mapped + configuration base address for the hot plug capable host bridge. The + base address returned by _CBA is processor-relative address. The _CBA + control method evaluates to an Integer. + + This control method appears under a host bridge object. When the _CBA + method appears under an active host bridge object, the operating system + evaluates this structure to identify the memory mapped configuration + base address corresponding to the PCI Segment Group for the bus number + range specified in _CRS method. An ACPI name space object that contains + the _CBA method must also contain a corresponding _SEG method. diff --git a/Documentation/PCI/acpi-info.txt b/Documentation/PCI/acpi-info.txt deleted file mode 100644 index 3ffa3b03970e..000000000000 --- a/Documentation/PCI/acpi-info.txt +++ /dev/null @@ -1,187 +0,0 @@ - ACPI considerations for PCI host bridges - -The general rule is that the ACPI namespace should describe everything the -OS might use unless there's another way for the OS to find it [1, 2]. - -For example, there's no standard hardware mechanism for enumerating PCI -host bridges, so the ACPI namespace must describe each host bridge, the -method for accessing PCI config space below it, the address space windows -the host bridge forwards to PCI (using _CRS), and the routing of legacy -INTx interrupts (using _PRT). - -PCI devices, which are below the host bridge, generally do not need to be -described via ACPI. The OS can discover them via the standard PCI -enumeration mechanism, using config accesses to discover and identify -devices and read and size their BARs. However, ACPI may describe PCI -devices if it provides power management or hotplug functionality for them -or if the device has INTx interrupts connected by platform interrupt -controllers and a _PRT is needed to describe those connections. - -ACPI resource description is done via _CRS objects of devices in the ACPI -namespace [2].   The _CRS is like a generalized PCI BAR: the OS can read -_CRS and figure out what resource is being consumed even if it doesn't have -a driver for the device [3].  That's important because it means an old OS -can work correctly even on a system with new devices unknown to the OS. -The new devices might not do anything, but the OS can at least make sure no -resources conflict with them. - -Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for -reserving address space. The static tables are for things the OS needs to -know early in boot, before it can parse the ACPI namespace. If a new table -is defined, an old OS needs to operate correctly even though it ignores the -table. _CRS allows that because it is generic and understood by the old -OS; a static table does not. - -If the OS is expected to manage a non-discoverable device described via -ACPI, that device will have a specific _HID/_CID that tells the OS what -driver to bind to it, and the _CRS tells the OS and the driver where the -device's registers are. - -PCI host bridges are PNP0A03 or PNP0A08 devices.  Their _CRS should -describe all the address space they consume.  This includes all the windows -they forward down to the PCI bus, as well as registers of the host bridge -itself that are not forwarded to PCI.  The host bridge registers include -things like secondary/subordinate bus registers that determine the bus -range below the bridge, window registers that describe the apertures, etc. -These are all device-specific, non-architected things, so the only way a -PNP0A03/PNP0A08 driver can manage them is via _PRS/_CRS/_SRS, which contain -the device-specific details.  The host bridge registers also include ECAM -space, since it is consumed by the host bridge. - -ACPI defines a Consumer/Producer bit to distinguish the bridge registers -("Consumer") from the bridge apertures ("Producer") [4, 5], but early -BIOSes didn't use that bit correctly. The result is that the current ACPI -spec defines Consumer/Producer only for the Extended Address Space -descriptors; the bit should be ignored in the older QWord/DWord/Word -Address Space descriptors. Consequently, OSes have to assume all -QWord/DWord/Word descriptors are windows. - -Prior to the addition of Extended Address Space descriptors, the failure of -Consumer/Producer meant there was no way to describe bridge registers in -the PNP0A03/PNP0A08 device itself. The workaround was to describe the -bridge registers (including ECAM space) in PNP0C02 catch-all devices [6]. -With the exception of ECAM, the bridge register space is device-specific -anyway, so the generic PNP0A03/PNP0A08 driver (pci_root.c) has no need to -know about it.   - -New architectures should be able to use "Consumer" Extended Address Space -descriptors in the PNP0A03 device for bridge registers, including ECAM, -although a strict interpretation of [6] might prohibit this. Old x86 and -ia64 kernels assume all address space descriptors, including "Consumer" -Extended Address Space ones, are windows, so it would not be safe to -describe bridge registers this way on those architectures. - -PNP0C02 "motherboard" devices are basically a catch-all.  There's no -programming model for them other than "don't use these resources for -anything else."  So a PNP0C02 _CRS should claim any address space that is -(1) not claimed by _CRS under any other device object in the ACPI namespace -and (2) should not be assigned by the OS to something else. - -The PCIe spec requires the Enhanced Configuration Access Method (ECAM) -unless there's a standard firmware interface for config access, e.g., the -ia64 SAL interface [7]. A host bridge consumes ECAM memory address space -and converts memory accesses into PCI configuration accesses. The spec -defines the ECAM address space layout and functionality; only the base of -the address space is device-specific. An ACPI OS learns the base address -from either the static MCFG table or a _CBA method in the PNP0A03 device. - -The MCFG table must describe the ECAM space of non-hot pluggable host -bridges [8]. Since MCFG is a static table and can't be updated by hotplug, -a _CBA method in the PNP0A03 device describes the ECAM space of a -hot-pluggable host bridge [9]. Note that for both MCFG and _CBA, the base -address always corresponds to bus 0, even if the bus range below the bridge -(which is reported via _CRS) doesn't start at 0. - - -[1] ACPI 6.2, sec 6.1: - For any device that is on a non-enumerable type of bus (for example, an - ISA bus), OSPM enumerates the devices' identifier(s) and the ACPI - system firmware must supply an _HID object ... for each device to - enable OSPM to do that. - -[2] ACPI 6.2, sec 3.7: - The OS enumerates motherboard devices simply by reading through the - ACPI Namespace looking for devices with hardware IDs. - - Each device enumerated by ACPI includes ACPI-defined objects in the - ACPI Namespace that report the hardware resources the device could - occupy [_PRS], an object that reports the resources that are currently - used by the device [_CRS], and objects for configuring those resources - [_SRS]. The information is used by the Plug and Play OS (OSPM) to - configure the devices. - -[3] ACPI 6.2, sec 6.2: - OSPM uses device configuration objects to configure hardware resources - for devices enumerated via ACPI. Device configuration objects provide - information about current and possible resource requirements, the - relationship between shared resources, and methods for configuring - hardware resources. - - When OSPM enumerates a device, it calls _PRS to determine the resource - requirements of the device. It may also call _CRS to find the current - resource settings for the device. Using this information, the Plug and - Play system determines what resources the device should consume and - sets those resources by calling the device’s _SRS control method. - - In ACPI, devices can consume resources (for example, legacy keyboards), - provide resources (for example, a proprietary PCI bridge), or do both. - Unless otherwise specified, resources for a device are assumed to be - taken from the nearest matching resource above the device in the device - hierarchy. - -[4] ACPI 6.2, sec 6.4.3.5.1, 2, 3, 4: - QWord/DWord/Word Address Space Descriptor (.1, .2, .3) - General Flags: Bit [0] Ignored - - Extended Address Space Descriptor (.4) - General Flags: Bit [0] Consumer/Producer: - 1–This device consumes this resource - 0–This device produces and consumes this resource - -[5] ACPI 6.2, sec 19.6.43: - ResourceUsage specifies whether the Memory range is consumed by - this device (ResourceConsumer) or passed on to child devices - (ResourceProducer). If nothing is specified, then - ResourceConsumer is assumed. - -[6] PCI Firmware 3.2, sec 4.1.2: - If the operating system does not natively comprehend reserving the - MMCFG region, the MMCFG region must be reserved by firmware. The - address range reported in the MCFG table or by _CBA method (see Section - 4.1.3) must be reserved by declaring a motherboard resource. For most - systems, the motherboard resource would appear at the root of the ACPI - namespace (under \_SB) in a node with a _HID of EISAID (PNP0C02), and - the resources in this case should not be claimed in the root PCI bus’s - _CRS. The resources can optionally be returned in Int15 E820 or - EFIGetMemoryMap as reserved memory but must always be reported through - ACPI as a motherboard resource. - -[7] PCI Express 4.0, sec 7.2.2: - For systems that are PC-compatible, or that do not implement a - processor-architecture-specific firmware interface standard that allows - access to the Configuration Space, the ECAM is required as defined in - this section. - -[8] PCI Firmware 3.2, sec 4.1.2: - The MCFG table is an ACPI table that is used to communicate the base - addresses corresponding to the non-hot removable PCI Segment Groups - range within a PCI Segment Group available to the operating system at - boot. This is required for the PC-compatible systems. - - The MCFG table is only used to communicate the base addresses - corresponding to the PCI Segment Groups available to the system at - boot. - -[9] PCI Firmware 3.2, sec 4.1.3: - The _CBA (Memory mapped Configuration Base Address) control method is - an optional ACPI object that returns the 64-bit memory mapped - configuration base address for the hot plug capable host bridge. The - base address returned by _CBA is processor-relative address. The _CBA - control method evaluates to an Integer. - - This control method appears under a host bridge object. When the _CBA - method appears under an active host bridge object, the operating system - evaluates this structure to identify the memory mapped configuration - base address corresponding to the PCI Segment Group for the bus number - range specified in _CRS method. An ACPI name space object that contains - the _CBA method must also contain a corresponding _SEG method. diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst new file mode 100644 index 000000000000..d114ea74b444 --- /dev/null +++ b/Documentation/PCI/endpoint/index.rst @@ -0,0 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +PCI Endpoint Framework +====================== + +.. toctree:: + :maxdepth: 2 + + pci-endpoint + pci-endpoint-cfs + pci-test-function + pci-test-howto diff --git a/Documentation/PCI/endpoint/pci-endpoint-cfs.rst b/Documentation/PCI/endpoint/pci-endpoint-cfs.rst new file mode 100644 index 000000000000..b6d39cdec56e --- /dev/null +++ b/Documentation/PCI/endpoint/pci-endpoint-cfs.rst @@ -0,0 +1,118 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================= +Configuring PCI Endpoint Using CONFIGFS +======================================= + +:Author: Kishon Vijay Abraham I + +The PCI Endpoint Core exposes configfs entry (pci_ep) to configure the +PCI endpoint function and to bind the endpoint function +with the endpoint controller. (For introducing other mechanisms to +configure the PCI Endpoint Function refer to [1]). + +Mounting configfs +================= + +The PCI Endpoint Core layer creates pci_ep directory in the mounted configfs +directory. configfs can be mounted using the following command:: + + mount -t configfs none /sys/kernel/config + +Directory Structure +=================== + +The pci_ep configfs has two directories at its root: controllers and +functions. Every EPC device present in the system will have an entry in +the *controllers* directory and and every EPF driver present in the system +will have an entry in the *functions* directory. +:: + + /sys/kernel/config/pci_ep/ + .. controllers/ + .. functions/ + +Creating EPF Device +=================== + +Every registered EPF driver will be listed in controllers directory. The +entries corresponding to EPF driver will be created by the EPF core. +:: + + /sys/kernel/config/pci_ep/functions/ + .. / + ... / + ... / + .. / + ... / + ... / + +In order to create a of the type probed by , the +user has to create a directory inside . + +Every directory consists of the following entries that can be +used to configure the standard configuration header of the endpoint function. +(These entries are created by the framework when any new is +created) +:: + + .. / + ... / + ... vendorid + ... deviceid + ... revid + ... progif_code + ... subclass_code + ... baseclass_code + ... cache_line_size + ... subsys_vendor_id + ... subsys_id + ... interrupt_pin + +EPC Device +========== + +Every registered EPC device will be listed in controllers directory. The +entries corresponding to EPC device will be created by the EPC core. +:: + + /sys/kernel/config/pci_ep/controllers/ + .. / + ... / + ... / + ... start + .. / + ... / + ... / + ... start + +The directory will have a list of symbolic links to +. These symbolic links should be created by the user to +represent the functions present in the endpoint device. + +The directory will also have a *start* field. Once +"1" is written to this field, the endpoint device will be ready to +establish the link with the host. This is usually done after +all the EPF devices are created and linked with the EPC device. +:: + + | controllers/ + | / + | + | start + | functions/ + | / + | / + | vendorid + | deviceid + | revid + | progif_code + | subclass_code + | baseclass_code + | cache_line_size + | subsys_vendor_id + | subsys_id + | interrupt_pin + | function + +[1] :doc:`pci-endpoint` diff --git a/Documentation/PCI/endpoint/pci-endpoint-cfs.txt b/Documentation/PCI/endpoint/pci-endpoint-cfs.txt deleted file mode 100644 index d740f29960a4..000000000000 --- a/Documentation/PCI/endpoint/pci-endpoint-cfs.txt +++ /dev/null @@ -1,105 +0,0 @@ - CONFIGURING PCI ENDPOINT USING CONFIGFS - Kishon Vijay Abraham I - -The PCI Endpoint Core exposes configfs entry (pci_ep) to configure the -PCI endpoint function and to bind the endpoint function -with the endpoint controller. (For introducing other mechanisms to -configure the PCI Endpoint Function refer to [1]). - -*) Mounting configfs - -The PCI Endpoint Core layer creates pci_ep directory in the mounted configfs -directory. configfs can be mounted using the following command. - - mount -t configfs none /sys/kernel/config - -*) Directory Structure - -The pci_ep configfs has two directories at its root: controllers and -functions. Every EPC device present in the system will have an entry in -the *controllers* directory and and every EPF driver present in the system -will have an entry in the *functions* directory. - -/sys/kernel/config/pci_ep/ - .. controllers/ - .. functions/ - -*) Creating EPF Device - -Every registered EPF driver will be listed in controllers directory. The -entries corresponding to EPF driver will be created by the EPF core. - -/sys/kernel/config/pci_ep/functions/ - .. / - ... / - ... / - .. / - ... / - ... / - -In order to create a of the type probed by , the -user has to create a directory inside . - -Every directory consists of the following entries that can be -used to configure the standard configuration header of the endpoint function. -(These entries are created by the framework when any new is -created) - - .. / - ... / - ... vendorid - ... deviceid - ... revid - ... progif_code - ... subclass_code - ... baseclass_code - ... cache_line_size - ... subsys_vendor_id - ... subsys_id - ... interrupt_pin - -*) EPC Device - -Every registered EPC device will be listed in controllers directory. The -entries corresponding to EPC device will be created by the EPC core. - -/sys/kernel/config/pci_ep/controllers/ - .. / - ... / - ... / - ... start - .. / - ... / - ... / - ... start - -The directory will have a list of symbolic links to -. These symbolic links should be created by the user to -represent the functions present in the endpoint device. - -The directory will also have a *start* field. Once -"1" is written to this field, the endpoint device will be ready to -establish the link with the host. This is usually done after -all the EPF devices are created and linked with the EPC device. - - - | controllers/ - | / - | - | start - | functions/ - | / - | / - | vendorid - | deviceid - | revid - | progif_code - | subclass_code - | baseclass_code - | cache_line_size - | subsys_vendor_id - | subsys_id - | interrupt_pin - | function - -[1] -> Documentation/PCI/endpoint/pci-endpoint.txt diff --git a/Documentation/PCI/endpoint/pci-endpoint.rst b/Documentation/PCI/endpoint/pci-endpoint.rst new file mode 100644 index 000000000000..0e2311b5617b --- /dev/null +++ b/Documentation/PCI/endpoint/pci-endpoint.rst @@ -0,0 +1,231 @@ +.. SPDX-License-Identifier: GPL-2.0 + +:Author: Kishon Vijay Abraham I + +This document is a guide to use the PCI Endpoint Framework in order to create +endpoint controller driver, endpoint function driver, and using configfs +interface to bind the function driver to the controller driver. + +Introduction +============ + +Linux has a comprehensive PCI subsystem to support PCI controllers that +operates in Root Complex mode. The subsystem has capability to scan PCI bus, +assign memory resources and IRQ resources, load PCI driver (based on +vendor ID, device ID), support other services like hot-plug, power management, +advanced error reporting and virtual channels. + +However the PCI controller IP integrated in some SoCs is capable of operating +either in Root Complex mode or Endpoint mode. PCI Endpoint Framework will +add endpoint mode support in Linux. This will help to run Linux in an +EP system which can have a wide variety of use cases from testing or +validation, co-processor accelerator, etc. + +PCI Endpoint Core +================= + +The PCI Endpoint Core layer comprises 3 components: the Endpoint Controller +library, the Endpoint Function library, and the configfs layer to bind the +endpoint function with the endpoint controller. + +PCI Endpoint Controller(EPC) Library +------------------------------------ + +The EPC library provides APIs to be used by the controller that can operate +in endpoint mode. It also provides APIs to be used by function driver/library +in order to implement a particular endpoint function. + +APIs for the PCI controller Driver +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This section lists the APIs that the PCI Endpoint core provides to be used +by the PCI controller driver. + +* devm_pci_epc_create()/pci_epc_create() + + The PCI controller driver should implement the following ops: + + * write_header: ops to populate configuration space header + * set_bar: ops to configure the BAR + * clear_bar: ops to reset the BAR + * alloc_addr_space: ops to allocate in PCI controller address space + * free_addr_space: ops to free the allocated address space + * raise_irq: ops to raise a legacy, MSI or MSI-X interrupt + * start: ops to start the PCI link + * stop: ops to stop the PCI link + + The PCI controller driver can then create a new EPC device by invoking + devm_pci_epc_create()/pci_epc_create(). + +* devm_pci_epc_destroy()/pci_epc_destroy() + + The PCI controller driver can destroy the EPC device created by either + devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or + pci_epc_destroy(). + +* pci_epc_linkup() + + In order to notify all the function devices that the EPC device to which + they are linked has established a link with the host, the PCI controller + driver should invoke pci_epc_linkup(). + +* pci_epc_mem_init() + + Initialize the pci_epc_mem structure used for allocating EPC addr space. + +* pci_epc_mem_exit() + + Cleanup the pci_epc_mem structure allocated during pci_epc_mem_init(). + + +APIs for the PCI Endpoint Function Driver +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This section lists the APIs that the PCI Endpoint core provides to be used +by the PCI endpoint function driver. + +* pci_epc_write_header() + + The PCI endpoint function driver should use pci_epc_write_header() to + write the standard configuration header to the endpoint controller. + +* pci_epc_set_bar() + + The PCI endpoint function driver should use pci_epc_set_bar() to configure + the Base Address Register in order for the host to assign PCI addr space. + Register space of the function driver is usually configured + using this API. + +* pci_epc_clear_bar() + + The PCI endpoint function driver should use pci_epc_clear_bar() to reset + the BAR. + +* pci_epc_raise_irq() + + The PCI endpoint function driver should use pci_epc_raise_irq() to raise + Legacy Interrupt, MSI or MSI-X Interrupt. + +* pci_epc_mem_alloc_addr() + + The PCI endpoint function driver should use pci_epc_mem_alloc_addr(), to + allocate memory address from EPC addr space which is required to access + RC's buffer + +* pci_epc_mem_free_addr() + + The PCI endpoint function driver should use pci_epc_mem_free_addr() to + free the memory space allocated using pci_epc_mem_alloc_addr(). + +Other APIs +~~~~~~~~~~ + +There are other APIs provided by the EPC library. These are used for binding +the EPF device with EPC device. pci-ep-cfs.c can be used as reference for +using these APIs. + +* pci_epc_get() + + Get a reference to the PCI endpoint controller based on the device name of + the controller. + +* pci_epc_put() + + Release the reference to the PCI endpoint controller obtained using + pci_epc_get() + +* pci_epc_add_epf() + + Add a PCI endpoint function to a PCI endpoint controller. A PCIe device + can have up to 8 functions according to the specification. + +* pci_epc_remove_epf() + + Remove the PCI endpoint function from PCI endpoint controller. + +* pci_epc_start() + + The PCI endpoint function driver should invoke pci_epc_start() once it + has configured the endpoint function and wants to start the PCI link. + +* pci_epc_stop() + + The PCI endpoint function driver should invoke pci_epc_stop() to stop + the PCI LINK. + + +PCI Endpoint Function(EPF) Library +---------------------------------- + +The EPF library provides APIs to be used by the function driver and the EPC +library to provide endpoint mode functionality. + +APIs for the PCI Endpoint Function Driver +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This section lists the APIs that the PCI Endpoint core provides to be used +by the PCI endpoint function driver. + +* pci_epf_register_driver() + + The PCI Endpoint Function driver should implement the following ops: + * bind: ops to perform when a EPC device has been bound to EPF device + * unbind: ops to perform when a binding has been lost between a EPC + device and EPF device + * linkup: ops to perform when the EPC device has established a + connection with a host system + + The PCI Function driver can then register the PCI EPF driver by using + pci_epf_register_driver(). + +* pci_epf_unregister_driver() + + The PCI Function driver can unregister the PCI EPF driver by using + pci_epf_unregister_driver(). + +* pci_epf_alloc_space() + + The PCI Function driver can allocate space for a particular BAR using + pci_epf_alloc_space(). + +* pci_epf_free_space() + + The PCI Function driver can free the allocated space + (using pci_epf_alloc_space) by invoking pci_epf_free_space(). + +APIs for the PCI Endpoint Controller Library +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This section lists the APIs that the PCI Endpoint core provides to be used +by the PCI endpoint controller library. + +* pci_epf_linkup() + + The PCI endpoint controller library invokes pci_epf_linkup() when the + EPC device has established the connection to the host. + +Other APIs +~~~~~~~~~~ + +There are other APIs provided by the EPF library. These are used to notify +the function driver when the EPF device is bound to the EPC device. +pci-ep-cfs.c can be used as reference for using these APIs. + +* pci_epf_create() + + Create a new PCI EPF device by passing the name of the PCI EPF device. + This name will be used to bind the the EPF device to a EPF driver. + +* pci_epf_destroy() + + Destroy the created PCI EPF device. + +* pci_epf_bind() + + pci_epf_bind() should be invoked when the EPF device has been bound to + a EPC device. + +* pci_epf_unbind() + + pci_epf_unbind() should be invoked when the binding between EPC device + and EPF device is lost. diff --git a/Documentation/PCI/endpoint/pci-endpoint.txt b/Documentation/PCI/endpoint/pci-endpoint.txt deleted file mode 100644 index e86a96b66a6a..000000000000 --- a/Documentation/PCI/endpoint/pci-endpoint.txt +++ /dev/null @@ -1,215 +0,0 @@ - PCI ENDPOINT FRAMEWORK - Kishon Vijay Abraham I - -This document is a guide to use the PCI Endpoint Framework in order to create -endpoint controller driver, endpoint function driver, and using configfs -interface to bind the function driver to the controller driver. - -1. Introduction - -Linux has a comprehensive PCI subsystem to support PCI controllers that -operates in Root Complex mode. The subsystem has capability to scan PCI bus, -assign memory resources and IRQ resources, load PCI driver (based on -vendor ID, device ID), support other services like hot-plug, power management, -advanced error reporting and virtual channels. - -However the PCI controller IP integrated in some SoCs is capable of operating -either in Root Complex mode or Endpoint mode. PCI Endpoint Framework will -add endpoint mode support in Linux. This will help to run Linux in an -EP system which can have a wide variety of use cases from testing or -validation, co-processor accelerator, etc. - -2. PCI Endpoint Core - -The PCI Endpoint Core layer comprises 3 components: the Endpoint Controller -library, the Endpoint Function library, and the configfs layer to bind the -endpoint function with the endpoint controller. - -2.1 PCI Endpoint Controller(EPC) Library - -The EPC library provides APIs to be used by the controller that can operate -in endpoint mode. It also provides APIs to be used by function driver/library -in order to implement a particular endpoint function. - -2.1.1 APIs for the PCI controller Driver - -This section lists the APIs that the PCI Endpoint core provides to be used -by the PCI controller driver. - -*) devm_pci_epc_create()/pci_epc_create() - - The PCI controller driver should implement the following ops: - * write_header: ops to populate configuration space header - * set_bar: ops to configure the BAR - * clear_bar: ops to reset the BAR - * alloc_addr_space: ops to allocate in PCI controller address space - * free_addr_space: ops to free the allocated address space - * raise_irq: ops to raise a legacy, MSI or MSI-X interrupt - * start: ops to start the PCI link - * stop: ops to stop the PCI link - - The PCI controller driver can then create a new EPC device by invoking - devm_pci_epc_create()/pci_epc_create(). - -*) devm_pci_epc_destroy()/pci_epc_destroy() - - The PCI controller driver can destroy the EPC device created by either - devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or - pci_epc_destroy(). - -*) pci_epc_linkup() - - In order to notify all the function devices that the EPC device to which - they are linked has established a link with the host, the PCI controller - driver should invoke pci_epc_linkup(). - -*) pci_epc_mem_init() - - Initialize the pci_epc_mem structure used for allocating EPC addr space. - -*) pci_epc_mem_exit() - - Cleanup the pci_epc_mem structure allocated during pci_epc_mem_init(). - -2.1.2 APIs for the PCI Endpoint Function Driver - -This section lists the APIs that the PCI Endpoint core provides to be used -by the PCI endpoint function driver. - -*) pci_epc_write_header() - - The PCI endpoint function driver should use pci_epc_write_header() to - write the standard configuration header to the endpoint controller. - -*) pci_epc_set_bar() - - The PCI endpoint function driver should use pci_epc_set_bar() to configure - the Base Address Register in order for the host to assign PCI addr space. - Register space of the function driver is usually configured - using this API. - -*) pci_epc_clear_bar() - - The PCI endpoint function driver should use pci_epc_clear_bar() to reset - the BAR. - -*) pci_epc_raise_irq() - - The PCI endpoint function driver should use pci_epc_raise_irq() to raise - Legacy Interrupt, MSI or MSI-X Interrupt. - -*) pci_epc_mem_alloc_addr() - - The PCI endpoint function driver should use pci_epc_mem_alloc_addr(), to - allocate memory address from EPC addr space which is required to access - RC's buffer - -*) pci_epc_mem_free_addr() - - The PCI endpoint function driver should use pci_epc_mem_free_addr() to - free the memory space allocated using pci_epc_mem_alloc_addr(). - -2.1.3 Other APIs - -There are other APIs provided by the EPC library. These are used for binding -the EPF device with EPC device. pci-ep-cfs.c can be used as reference for -using these APIs. - -*) pci_epc_get() - - Get a reference to the PCI endpoint controller based on the device name of - the controller. - -*) pci_epc_put() - - Release the reference to the PCI endpoint controller obtained using - pci_epc_get() - -*) pci_epc_add_epf() - - Add a PCI endpoint function to a PCI endpoint controller. A PCIe device - can have up to 8 functions according to the specification. - -*) pci_epc_remove_epf() - - Remove the PCI endpoint function from PCI endpoint controller. - -*) pci_epc_start() - - The PCI endpoint function driver should invoke pci_epc_start() once it - has configured the endpoint function and wants to start the PCI link. - -*) pci_epc_stop() - - The PCI endpoint function driver should invoke pci_epc_stop() to stop - the PCI LINK. - -2.2 PCI Endpoint Function(EPF) Library - -The EPF library provides APIs to be used by the function driver and the EPC -library to provide endpoint mode functionality. - -2.2.1 APIs for the PCI Endpoint Function Driver - -This section lists the APIs that the PCI Endpoint core provides to be used -by the PCI endpoint function driver. - -*) pci_epf_register_driver() - - The PCI Endpoint Function driver should implement the following ops: - * bind: ops to perform when a EPC device has been bound to EPF device - * unbind: ops to perform when a binding has been lost between a EPC - device and EPF device - * linkup: ops to perform when the EPC device has established a - connection with a host system - - The PCI Function driver can then register the PCI EPF driver by using - pci_epf_register_driver(). - -*) pci_epf_unregister_driver() - - The PCI Function driver can unregister the PCI EPF driver by using - pci_epf_unregister_driver(). - -*) pci_epf_alloc_space() - - The PCI Function driver can allocate space for a particular BAR using - pci_epf_alloc_space(). - -*) pci_epf_free_space() - - The PCI Function driver can free the allocated space - (using pci_epf_alloc_space) by invoking pci_epf_free_space(). - -2.2.2 APIs for the PCI Endpoint Controller Library -This section lists the APIs that the PCI Endpoint core provides to be used -by the PCI endpoint controller library. - -*) pci_epf_linkup() - - The PCI endpoint controller library invokes pci_epf_linkup() when the - EPC device has established the connection to the host. - -2.2.2 Other APIs -There are other APIs provided by the EPF library. These are used to notify -the function driver when the EPF device is bound to the EPC device. -pci-ep-cfs.c can be used as reference for using these APIs. - -*) pci_epf_create() - - Create a new PCI EPF device by passing the name of the PCI EPF device. - This name will be used to bind the the EPF device to a EPF driver. - -*) pci_epf_destroy() - - Destroy the created PCI EPF device. - -*) pci_epf_bind() - - pci_epf_bind() should be invoked when the EPF device has been bound to - a EPC device. - -*) pci_epf_unbind() - - pci_epf_unbind() should be invoked when the binding between EPC device - and EPF device is lost. diff --git a/Documentation/PCI/endpoint/pci-test-function.rst b/Documentation/PCI/endpoint/pci-test-function.rst new file mode 100644 index 000000000000..3c8521d7aa31 --- /dev/null +++ b/Documentation/PCI/endpoint/pci-test-function.rst @@ -0,0 +1,103 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +PCI Test Function +================= + +:Author: Kishon Vijay Abraham I + +Traditionally PCI RC has always been validated by using standard +PCI cards like ethernet PCI cards or USB PCI cards or SATA PCI cards. +However with the addition of EP-core in linux kernel, it is possible +to configure a PCI controller that can operate in EP mode to work as +a test device. + +The PCI endpoint test device is a virtual device (defined in software) +used to test the endpoint functionality and serve as a sample driver +for other PCI endpoint devices (to use the EP framework). + +The PCI endpoint test device has the following registers: + + 1) PCI_ENDPOINT_TEST_MAGIC + 2) PCI_ENDPOINT_TEST_COMMAND + 3) PCI_ENDPOINT_TEST_STATUS + 4) PCI_ENDPOINT_TEST_SRC_ADDR + 5) PCI_ENDPOINT_TEST_DST_ADDR + 6) PCI_ENDPOINT_TEST_SIZE + 7) PCI_ENDPOINT_TEST_CHECKSUM + 8) PCI_ENDPOINT_TEST_IRQ_TYPE + 9) PCI_ENDPOINT_TEST_IRQ_NUMBER + +* PCI_ENDPOINT_TEST_MAGIC + +This register will be used to test BAR0. A known pattern will be written +and read back from MAGIC register to verify BAR0. + +* PCI_ENDPOINT_TEST_COMMAND + +This register will be used by the host driver to indicate the function +that the endpoint device must perform. + +======== ================================================================ +Bitfield Description +======== ================================================================ +Bit 0 raise legacy IRQ +Bit 1 raise MSI IRQ +Bit 2 raise MSI-X IRQ +Bit 3 read command (read data from RC buffer) +Bit 4 write command (write data to RC buffer) +Bit 5 copy command (copy data from one RC buffer to another RC buffer) +======== ================================================================ + +* PCI_ENDPOINT_TEST_STATUS + +This register reflects the status of the PCI endpoint device. + +======== ============================== +Bitfield Description +======== ============================== +Bit 0 read success +Bit 1 read fail +Bit 2 write success +Bit 3 write fail +Bit 4 copy success +Bit 5 copy fail +Bit 6 IRQ raised +Bit 7 source address is invalid +Bit 8 destination address is invalid +======== ============================== + +* PCI_ENDPOINT_TEST_SRC_ADDR + +This register contains the source address (RC buffer address) for the +COPY/READ command. + +* PCI_ENDPOINT_TEST_DST_ADDR + +This register contains the destination address (RC buffer address) for +the COPY/WRITE command. + +* PCI_ENDPOINT_TEST_IRQ_TYPE + +This register contains the interrupt type (Legacy/MSI) triggered +for the READ/WRITE/COPY and raise IRQ (Legacy/MSI) commands. + +Possible types: + +====== == +Legacy 0 +MSI 1 +MSI-X 2 +====== == + +* PCI_ENDPOINT_TEST_IRQ_NUMBER + +This register contains the triggered ID interrupt. + +Admissible values: + +====== =========== +Legacy 0 +MSI [1 .. 32] +MSI-X [1 .. 2048] +====== =========== diff --git a/Documentation/PCI/endpoint/pci-test-function.txt b/Documentation/PCI/endpoint/pci-test-function.txt deleted file mode 100644 index 5916f1f592bb..000000000000 --- a/Documentation/PCI/endpoint/pci-test-function.txt +++ /dev/null @@ -1,87 +0,0 @@ - PCI TEST - Kishon Vijay Abraham I - -Traditionally PCI RC has always been validated by using standard -PCI cards like ethernet PCI cards or USB PCI cards or SATA PCI cards. -However with the addition of EP-core in linux kernel, it is possible -to configure a PCI controller that can operate in EP mode to work as -a test device. - -The PCI endpoint test device is a virtual device (defined in software) -used to test the endpoint functionality and serve as a sample driver -for other PCI endpoint devices (to use the EP framework). - -The PCI endpoint test device has the following registers: - - 1) PCI_ENDPOINT_TEST_MAGIC - 2) PCI_ENDPOINT_TEST_COMMAND - 3) PCI_ENDPOINT_TEST_STATUS - 4) PCI_ENDPOINT_TEST_SRC_ADDR - 5) PCI_ENDPOINT_TEST_DST_ADDR - 6) PCI_ENDPOINT_TEST_SIZE - 7) PCI_ENDPOINT_TEST_CHECKSUM - 8) PCI_ENDPOINT_TEST_IRQ_TYPE - 9) PCI_ENDPOINT_TEST_IRQ_NUMBER - -*) PCI_ENDPOINT_TEST_MAGIC - -This register will be used to test BAR0. A known pattern will be written -and read back from MAGIC register to verify BAR0. - -*) PCI_ENDPOINT_TEST_COMMAND: - -This register will be used by the host driver to indicate the function -that the endpoint device must perform. - -Bitfield Description: - Bit 0 : raise legacy IRQ - Bit 1 : raise MSI IRQ - Bit 2 : raise MSI-X IRQ - Bit 3 : read command (read data from RC buffer) - Bit 4 : write command (write data to RC buffer) - Bit 5 : copy command (copy data from one RC buffer to another - RC buffer) - -*) PCI_ENDPOINT_TEST_STATUS - -This register reflects the status of the PCI endpoint device. - -Bitfield Description: - Bit 0 : read success - Bit 1 : read fail - Bit 2 : write success - Bit 3 : write fail - Bit 4 : copy success - Bit 5 : copy fail - Bit 6 : IRQ raised - Bit 7 : source address is invalid - Bit 8 : destination address is invalid - -*) PCI_ENDPOINT_TEST_SRC_ADDR - -This register contains the source address (RC buffer address) for the -COPY/READ command. - -*) PCI_ENDPOINT_TEST_DST_ADDR - -This register contains the destination address (RC buffer address) for -the COPY/WRITE command. - -*) PCI_ENDPOINT_TEST_IRQ_TYPE - -This register contains the interrupt type (Legacy/MSI) triggered -for the READ/WRITE/COPY and raise IRQ (Legacy/MSI) commands. - -Possible types: - - Legacy : 0 - - MSI : 1 - - MSI-X : 2 - -*) PCI_ENDPOINT_TEST_IRQ_NUMBER - -This register contains the triggered ID interrupt. - -Admissible values: - - Legacy : 0 - - MSI : [1 .. 32] - - MSI-X : [1 .. 2048] diff --git a/Documentation/PCI/endpoint/pci-test-howto.rst b/Documentation/PCI/endpoint/pci-test-howto.rst new file mode 100644 index 000000000000..909f770a07d6 --- /dev/null +++ b/Documentation/PCI/endpoint/pci-test-howto.rst @@ -0,0 +1,235 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== +PCI Test User Guide +=================== + +:Author: Kishon Vijay Abraham I + +This document is a guide to help users use pci-epf-test function driver +and pci_endpoint_test host driver for testing PCI. The list of steps to +be followed in the host side and EP side is given below. + +Endpoint Device +=============== + +Endpoint Controller Devices +--------------------------- + +To find the list of endpoint controller devices in the system:: + + # ls /sys/class/pci_epc/ + 51000000.pcie_ep + +If PCI_ENDPOINT_CONFIGFS is enabled:: + + # ls /sys/kernel/config/pci_ep/controllers + 51000000.pcie_ep + + +Endpoint Function Drivers +------------------------- + +To find the list of endpoint function drivers in the system:: + + # ls /sys/bus/pci-epf/drivers + pci_epf_test + +If PCI_ENDPOINT_CONFIGFS is enabled:: + + # ls /sys/kernel/config/pci_ep/functions + pci_epf_test + + +Creating pci-epf-test Device +---------------------------- + +PCI endpoint function device can be created using the configfs. To create +pci-epf-test device, the following commands can be used:: + + # mount -t configfs none /sys/kernel/config + # cd /sys/kernel/config/pci_ep/ + # mkdir functions/pci_epf_test/func1 + +The "mkdir func1" above creates the pci-epf-test function device that will +be probed by pci_epf_test driver. + +The PCI endpoint framework populates the directory with the following +configurable fields:: + + # ls functions/pci_epf_test/func1 + baseclass_code interrupt_pin progif_code subsys_id + cache_line_size msi_interrupts revid subsys_vendorid + deviceid msix_interrupts subclass_code vendorid + +The PCI endpoint function driver populates these entries with default values +when the device is bound to the driver. The pci-epf-test driver populates +vendorid with 0xffff and interrupt_pin with 0x0001:: + + # cat functions/pci_epf_test/func1/vendorid + 0xffff + # cat functions/pci_epf_test/func1/interrupt_pin + 0x0001 + + +Configuring pci-epf-test Device +------------------------------- + +The user can configure the pci-epf-test device using configfs entry. In order +to change the vendorid and the number of MSI interrupts used by the function +device, the following commands can be used:: + + # echo 0x104c > functions/pci_epf_test/func1/vendorid + # echo 0xb500 > functions/pci_epf_test/func1/deviceid + # echo 16 > functions/pci_epf_test/func1/msi_interrupts + # echo 8 > functions/pci_epf_test/func1/msix_interrupts + + +Binding pci-epf-test Device to EP Controller +-------------------------------------------- + +In order for the endpoint function device to be useful, it has to be bound to +a PCI endpoint controller driver. Use the configfs to bind the function +device to one of the controller driver present in the system:: + + # ln -s functions/pci_epf_test/func1 controllers/51000000.pcie_ep/ + +Once the above step is completed, the PCI endpoint is ready to establish a link +with the host. + + +Start the Link +-------------- + +In order for the endpoint device to establish a link with the host, the _start_ +field should be populated with '1':: + + # echo 1 > controllers/51000000.pcie_ep/start + + +RootComplex Device +================== + +lspci Output +------------ + +Note that the devices listed here correspond to the value populated in 1.4 +above:: + + 00:00.0 PCI bridge: Texas Instruments Device 8888 (rev 01) + 01:00.0 Unassigned class [ff00]: Texas Instruments Device b500 + + +Using Endpoint Test function Device +----------------------------------- + +pcitest.sh added in tools/pci/ can be used to run all the default PCI endpoint +tests. To compile this tool the following commands should be used:: + + # cd + # make -C tools/pci + +or if you desire to compile and install in your system:: + + # cd + # make -C tools/pci install + +The tool and script will be located in /usr/bin/ + + +pcitest.sh Output +~~~~~~~~~~~~~~~~~ +:: + + # pcitest.sh + BAR tests + + BAR0: OKAY + BAR1: OKAY + BAR2: OKAY + BAR3: OKAY + BAR4: NOT OKAY + BAR5: NOT OKAY + + Interrupt tests + + SET IRQ TYPE TO LEGACY: OKAY + LEGACY IRQ: NOT OKAY + SET IRQ TYPE TO MSI: OKAY + MSI1: OKAY + MSI2: OKAY + MSI3: OKAY + MSI4: OKAY + MSI5: OKAY + MSI6: OKAY + MSI7: OKAY + MSI8: OKAY + MSI9: OKAY + MSI10: OKAY + MSI11: OKAY + MSI12: OKAY + MSI13: OKAY + MSI14: OKAY + MSI15: OKAY + MSI16: OKAY + MSI17: NOT OKAY + MSI18: NOT OKAY + MSI19: NOT OKAY + MSI20: NOT OKAY + MSI21: NOT OKAY + MSI22: NOT OKAY + MSI23: NOT OKAY + MSI24: NOT OKAY + MSI25: NOT OKAY + MSI26: NOT OKAY + MSI27: NOT OKAY + MSI28: NOT OKAY + MSI29: NOT OKAY + MSI30: NOT OKAY + MSI31: NOT OKAY + MSI32: NOT OKAY + SET IRQ TYPE TO MSI-X: OKAY + MSI-X1: OKAY + MSI-X2: OKAY + MSI-X3: OKAY + MSI-X4: OKAY + MSI-X5: OKAY + MSI-X6: OKAY + MSI-X7: OKAY + MSI-X8: OKAY + MSI-X9: NOT OKAY + MSI-X10: NOT OKAY + MSI-X11: NOT OKAY + MSI-X12: NOT OKAY + MSI-X13: NOT OKAY + MSI-X14: NOT OKAY + MSI-X15: NOT OKAY + MSI-X16: NOT OKAY + [...] + MSI-X2047: NOT OKAY + MSI-X2048: NOT OKAY + + Read Tests + + SET IRQ TYPE TO MSI: OKAY + READ ( 1 bytes): OKAY + READ ( 1024 bytes): OKAY + READ ( 1025 bytes): OKAY + READ (1024000 bytes): OKAY + READ (1024001 bytes): OKAY + + Write Tests + + WRITE ( 1 bytes): OKAY + WRITE ( 1024 bytes): OKAY + WRITE ( 1025 bytes): OKAY + WRITE (1024000 bytes): OKAY + WRITE (1024001 bytes): OKAY + + Copy Tests + + COPY ( 1 bytes): OKAY + COPY ( 1024 bytes): OKAY + COPY ( 1025 bytes): OKAY + COPY (1024000 bytes): OKAY + COPY (1024001 bytes): OKAY diff --git a/Documentation/PCI/endpoint/pci-test-howto.txt b/Documentation/PCI/endpoint/pci-test-howto.txt deleted file mode 100644 index 040479f437a5..000000000000 --- a/Documentation/PCI/endpoint/pci-test-howto.txt +++ /dev/null @@ -1,206 +0,0 @@ - PCI TEST USERGUIDE - Kishon Vijay Abraham I - -This document is a guide to help users use pci-epf-test function driver -and pci_endpoint_test host driver for testing PCI. The list of steps to -be followed in the host side and EP side is given below. - -1. Endpoint Device - -1.1 Endpoint Controller Devices - -To find the list of endpoint controller devices in the system: - - # ls /sys/class/pci_epc/ - 51000000.pcie_ep - -If PCI_ENDPOINT_CONFIGFS is enabled - # ls /sys/kernel/config/pci_ep/controllers - 51000000.pcie_ep - -1.2 Endpoint Function Drivers - -To find the list of endpoint function drivers in the system: - - # ls /sys/bus/pci-epf/drivers - pci_epf_test - -If PCI_ENDPOINT_CONFIGFS is enabled - # ls /sys/kernel/config/pci_ep/functions - pci_epf_test - -1.3 Creating pci-epf-test Device - -PCI endpoint function device can be created using the configfs. To create -pci-epf-test device, the following commands can be used - - # mount -t configfs none /sys/kernel/config - # cd /sys/kernel/config/pci_ep/ - # mkdir functions/pci_epf_test/func1 - -The "mkdir func1" above creates the pci-epf-test function device that will -be probed by pci_epf_test driver. - -The PCI endpoint framework populates the directory with the following -configurable fields. - - # ls functions/pci_epf_test/func1 - baseclass_code interrupt_pin progif_code subsys_id - cache_line_size msi_interrupts revid subsys_vendorid - deviceid msix_interrupts subclass_code vendorid - -The PCI endpoint function driver populates these entries with default values -when the device is bound to the driver. The pci-epf-test driver populates -vendorid with 0xffff and interrupt_pin with 0x0001 - - # cat functions/pci_epf_test/func1/vendorid - 0xffff - # cat functions/pci_epf_test/func1/interrupt_pin - 0x0001 - -1.4 Configuring pci-epf-test Device - -The user can configure the pci-epf-test device using configfs entry. In order -to change the vendorid and the number of MSI interrupts used by the function -device, the following commands can be used. - - # echo 0x104c > functions/pci_epf_test/func1/vendorid - # echo 0xb500 > functions/pci_epf_test/func1/deviceid - # echo 16 > functions/pci_epf_test/func1/msi_interrupts - # echo 8 > functions/pci_epf_test/func1/msix_interrupts - -1.5 Binding pci-epf-test Device to EP Controller - -In order for the endpoint function device to be useful, it has to be bound to -a PCI endpoint controller driver. Use the configfs to bind the function -device to one of the controller driver present in the system. - - # ln -s functions/pci_epf_test/func1 controllers/51000000.pcie_ep/ - -Once the above step is completed, the PCI endpoint is ready to establish a link -with the host. - -1.6 Start the Link - -In order for the endpoint device to establish a link with the host, the _start_ -field should be populated with '1'. - - # echo 1 > controllers/51000000.pcie_ep/start - -2. RootComplex Device - -2.1 lspci Output - -Note that the devices listed here correspond to the value populated in 1.4 above - - 00:00.0 PCI bridge: Texas Instruments Device 8888 (rev 01) - 01:00.0 Unassigned class [ff00]: Texas Instruments Device b500 - -2.2 Using Endpoint Test function Device - -pcitest.sh added in tools/pci/ can be used to run all the default PCI endpoint -tests. To compile this tool the following commands should be used: - - # cd - # make -C tools/pci - -or if you desire to compile and install in your system: - - # cd - # make -C tools/pci install - -The tool and script will be located in /usr/bin/ - -2.2.1 pcitest.sh Output - # pcitest.sh - BAR tests - - BAR0: OKAY - BAR1: OKAY - BAR2: OKAY - BAR3: OKAY - BAR4: NOT OKAY - BAR5: NOT OKAY - - Interrupt tests - - SET IRQ TYPE TO LEGACY: OKAY - LEGACY IRQ: NOT OKAY - SET IRQ TYPE TO MSI: OKAY - MSI1: OKAY - MSI2: OKAY - MSI3: OKAY - MSI4: OKAY - MSI5: OKAY - MSI6: OKAY - MSI7: OKAY - MSI8: OKAY - MSI9: OKAY - MSI10: OKAY - MSI11: OKAY - MSI12: OKAY - MSI13: OKAY - MSI14: OKAY - MSI15: OKAY - MSI16: OKAY - MSI17: NOT OKAY - MSI18: NOT OKAY - MSI19: NOT OKAY - MSI20: NOT OKAY - MSI21: NOT OKAY - MSI22: NOT OKAY - MSI23: NOT OKAY - MSI24: NOT OKAY - MSI25: NOT OKAY - MSI26: NOT OKAY - MSI27: NOT OKAY - MSI28: NOT OKAY - MSI29: NOT OKAY - MSI30: NOT OKAY - MSI31: NOT OKAY - MSI32: NOT OKAY - SET IRQ TYPE TO MSI-X: OKAY - MSI-X1: OKAY - MSI-X2: OKAY - MSI-X3: OKAY - MSI-X4: OKAY - MSI-X5: OKAY - MSI-X6: OKAY - MSI-X7: OKAY - MSI-X8: OKAY - MSI-X9: NOT OKAY - MSI-X10: NOT OKAY - MSI-X11: NOT OKAY - MSI-X12: NOT OKAY - MSI-X13: NOT OKAY - MSI-X14: NOT OKAY - MSI-X15: NOT OKAY - MSI-X16: NOT OKAY - [...] - MSI-X2047: NOT OKAY - MSI-X2048: NOT OKAY - - Read Tests - - SET IRQ TYPE TO MSI: OKAY - READ ( 1 bytes): OKAY - READ ( 1024 bytes): OKAY - READ ( 1025 bytes): OKAY - READ (1024000 bytes): OKAY - READ (1024001 bytes): OKAY - - Write Tests - - WRITE ( 1 bytes): OKAY - WRITE ( 1024 bytes): OKAY - WRITE ( 1025 bytes): OKAY - WRITE (1024000 bytes): OKAY - WRITE (1024001 bytes): OKAY - - Copy Tests - - COPY ( 1 bytes): OKAY - COPY ( 1024 bytes): OKAY - COPY ( 1025 bytes): OKAY - COPY (1024000 bytes): OKAY - COPY (1024001 bytes): OKAY diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst new file mode 100644 index 000000000000..f4c6121868c3 --- /dev/null +++ b/Documentation/PCI/index.rst @@ -0,0 +1,18 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +Linux PCI Bus Subsystem +======================= + +.. toctree:: + :maxdepth: 2 + :numbered: + + pci + picebus-howto + pci-iov-howto + msi-howto + acpi-info + pci-error-recovery + pcieaer-howto + endpoint/index diff --git a/Documentation/PCI/msi-howto.rst b/Documentation/PCI/msi-howto.rst new file mode 100644 index 000000000000..994cbb660ade --- /dev/null +++ b/Documentation/PCI/msi-howto.rst @@ -0,0 +1,287 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +========================== +The MSI Driver Guide HOWTO +========================== + +:Authors: Tom L Nguyen; Martine Silbermann; Matthew Wilcox + +:Copyright: 2003, 2008 Intel Corporation + +About this guide +================ + +This guide describes the basics of Message Signaled Interrupts (MSIs), +the advantages of using MSI over traditional interrupt mechanisms, how +to change your driver to use MSI or MSI-X and some basic diagnostics to +try if a device doesn't support MSIs. + + +What are MSIs? +============== + +A Message Signaled Interrupt is a write from the device to a special +address which causes an interrupt to be received by the CPU. + +The MSI capability was first specified in PCI 2.2 and was later enhanced +in PCI 3.0 to allow each interrupt to be masked individually. The MSI-X +capability was also introduced with PCI 3.0. It supports more interrupts +per device than MSI and allows interrupts to be independently configured. + +Devices may support both MSI and MSI-X, but only one can be enabled at +a time. + + +Why use MSIs? +============= + +There are three reasons why using MSIs can give an advantage over +traditional pin-based interrupts. + +Pin-based PCI interrupts are often shared amongst several devices. +To support this, the kernel must call each interrupt handler associated +with an interrupt, which leads to reduced performance for the system as +a whole. MSIs are never shared, so this problem cannot arise. + +When a device writes data to memory, then raises a pin-based interrupt, +it is possible that the interrupt may arrive before all the data has +arrived in memory (this becomes more likely with devices behind PCI-PCI +bridges). In order to ensure that all the data has arrived in memory, +the interrupt handler must read a register on the device which raised +the interrupt. PCI transaction ordering rules require that all the data +arrive in memory before the value may be returned from the register. +Using MSIs avoids this problem as the interrupt-generating write cannot +pass the data writes, so by the time the interrupt is raised, the driver +knows that all the data has arrived in memory. + +PCI devices can only support a single pin-based interrupt per function. +Often drivers have to query the device to find out what event has +occurred, slowing down interrupt handling for the common case. With +MSIs, a device can support more interrupts, allowing each interrupt +to be specialised to a different purpose. One possible design gives +infrequent conditions (such as errors) their own interrupt which allows +the driver to handle the normal interrupt handling path more efficiently. +Other possible designs include giving one interrupt to each packet queue +in a network card or each port in a storage controller. + + +How to use MSIs +=============== + +PCI devices are initialised to use pin-based interrupts. The device +driver has to set up the device to use MSI or MSI-X. Not all machines +support MSIs correctly, and for those machines, the APIs described below +will simply fail and the device will continue to use pin-based interrupts. + +Include kernel support for MSIs +------------------------------- + +To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI +option enabled. This option is only available on some architectures, +and it may depend on some other options also being set. For example, +on x86, you must also enable X86_UP_APIC or SMP in order to see the +CONFIG_PCI_MSI option. + +Using MSI +--------- + +Most of the hard work is done for the driver in the PCI layer. The driver +simply has to request that the PCI layer set up the MSI capability for this +device. + +To automatically use MSI or MSI-X interrupt vectors, use the following +function:: + + int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, + unsigned int max_vecs, unsigned int flags); + +which allocates up to max_vecs interrupt vectors for a PCI device. It +returns the number of vectors allocated or a negative error. If the device +has a requirements for a minimum number of vectors the driver can pass a +min_vecs argument set to this limit, and the PCI core will return -ENOSPC +if it can't meet the minimum number of vectors. + +The flags argument is used to specify which type of interrupt can be used +by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX). +A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for +any possible kind of interrupt. If the PCI_IRQ_AFFINITY flag is set, +pci_alloc_irq_vectors() will spread the interrupts around the available CPUs. + +To get the Linux IRQ numbers passed to request_irq() and free_irq() and the +vectors, use the following function:: + + int pci_irq_vector(struct pci_dev *dev, unsigned int nr); + +Any allocated resources should be freed before removing the device using +the following function:: + + void pci_free_irq_vectors(struct pci_dev *dev); + +If a device supports both MSI-X and MSI capabilities, this API will use the +MSI-X facilities in preference to the MSI facilities. MSI-X supports any +number of interrupts between 1 and 2048. In contrast, MSI is restricted to +a maximum of 32 interrupts (and must be a power of two). In addition, the +MSI interrupt vectors must be allocated consecutively, so the system might +not be able to allocate as many vectors for MSI as it could for MSI-X. On +some platforms, MSI interrupts must all be targeted at the same set of CPUs +whereas MSI-X interrupts can all be targeted at different CPUs. + +If a device supports neither MSI-X or MSI it will fall back to a single +legacy IRQ vector. + +The typical usage of MSI or MSI-X interrupts is to allocate as many vectors +as possible, likely up to the limit supported by the device. If nvec is +larger than the number supported by the device it will automatically be +capped to the supported limit, so there is no need to query the number of +vectors supported beforehand:: + + nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_ALL_TYPES) + if (nvec < 0) + goto out_err; + +If a driver is unable or unwilling to deal with a variable number of MSI +interrupts it can request a particular number of interrupts by passing that +number to pci_alloc_irq_vectors() function as both 'min_vecs' and +'max_vecs' parameters:: + + ret = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_ALL_TYPES); + if (ret < 0) + goto out_err; + +The most notorious example of the request type described above is enabling +the single MSI mode for a device. It could be done by passing two 1s as +'min_vecs' and 'max_vecs':: + + ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES); + if (ret < 0) + goto out_err; + +Some devices might not support using legacy line interrupts, in which case +the driver can specify that only MSI or MSI-X is acceptable:: + + nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_MSI | PCI_IRQ_MSIX); + if (nvec < 0) + goto out_err; + +Legacy APIs +----------- + +The following old APIs to enable and disable MSI or MSI-X interrupts should +not be used in new code:: + + pci_enable_msi() /* deprecated */ + pci_disable_msi() /* deprecated */ + pci_enable_msix_range() /* deprecated */ + pci_enable_msix_exact() /* deprecated */ + pci_disable_msix() /* deprecated */ + +Additionally there are APIs to provide the number of supported MSI or MSI-X +vectors: pci_msi_vec_count() and pci_msix_vec_count(). In general these +should be avoided in favor of letting pci_alloc_irq_vectors() cap the +number of vectors. If you have a legitimate special use case for the count +of vectors we might have to revisit that decision and add a +pci_nr_irq_vectors() helper that handles MSI and MSI-X transparently. + +Considerations when using MSIs +------------------------------ + +Spinlocks +~~~~~~~~~ + +Most device drivers have a per-device spinlock which is taken in the +interrupt handler. With pin-based interrupts or a single MSI, it is not +necessary to disable interrupts (Linux guarantees the same interrupt will +not be re-entered). If a device uses multiple interrupts, the driver +must disable interrupts while the lock is held. If the device sends +a different interrupt, the driver will deadlock trying to recursively +acquire the spinlock. Such deadlocks can be avoided by using +spin_lock_irqsave() or spin_lock_irq() which disable local interrupts +and acquire the lock (see Documentation/kernel-hacking/locking.rst). + +How to tell whether MSI/MSI-X is enabled on a device +---------------------------------------------------- + +Using 'lspci -v' (as root) may show some devices with "MSI", "Message +Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities +has an 'Enable' flag which is followed with either "+" (enabled) +or "-" (disabled). + + +MSI quirks +========== + +Several PCI chipsets or devices are known not to support MSIs. +The PCI stack provides three ways to disable MSIs: + +1. globally +2. on all devices behind a specific bridge +3. on a single device + +Disabling MSIs globally +----------------------- + +Some host chipsets simply don't support MSIs properly. If we're +lucky, the manufacturer knows this and has indicated it in the ACPI +FADT table. In this case, Linux automatically disables MSIs. +Some boards don't include this information in the table and so we have +to detect them ourselves. The complete list of these is found near the +quirk_disable_all_msi() function in drivers/pci/quirks.c. + +If you have a board which has problems with MSIs, you can pass pci=nomsi +on the kernel command line to disable MSIs on all devices. It would be +in your best interests to report the problem to linux-pci@vger.kernel.org +including a full 'lspci -v' so we can add the quirks to the kernel. + +Disabling MSIs below a bridge +----------------------------- + +Some PCI bridges are not able to route MSIs between busses properly. +In this case, MSIs must be disabled on all devices behind the bridge. + +Some bridges allow you to enable MSIs by changing some bits in their +PCI configuration space (especially the Hypertransport chipsets such +as the nVidia nForce and Serverworks HT2000). As with host chipsets, +Linux mostly knows about them and automatically enables MSIs if it can. +If you have a bridge unknown to Linux, you can enable +MSIs in configuration space using whatever method you know works, then +enable MSIs on that bridge by doing:: + + echo 1 > /sys/bus/pci/devices/$bridge/msi_bus + +where $bridge is the PCI address of the bridge you've enabled (eg +0000:00:0e.0). + +To disable MSIs, echo 0 instead of 1. Changing this value should be +done with caution as it could break interrupt handling for all devices +below this bridge. + +Again, please notify linux-pci@vger.kernel.org of any bridges that need +special handling. + +Disabling MSIs on a single device +--------------------------------- + +Some devices are known to have faulty MSI implementations. Usually this +is handled in the individual device driver, but occasionally it's necessary +to handle this with a quirk. Some drivers have an option to disable use +of MSI. While this is a convenient workaround for the driver author, +it is not good practice, and should not be emulated. + +Finding why MSIs are disabled on a device +----------------------------------------- + +From the above three sections, you can see that there are many reasons +why MSIs may not be enabled for a given device. Your first step should +be to examine your dmesg carefully to determine whether MSIs are enabled +for your machine. You should also check your .config to be sure you +have enabled CONFIG_PCI_MSI. + +Then, 'lspci -t' gives the list of bridges above a device. Reading +`/sys/bus/pci/devices/*/msi_bus` will tell you whether MSIs are enabled (1) +or disabled (0). If 0 is found in any of the msi_bus files belonging +to bridges between the PCI root and the device, MSIs are disabled. + +It is also worth checking the device driver to see whether it supports MSIs. +For example, it may contain calls to pci_irq_alloc_vectors() with the +PCI_IRQ_MSI or PCI_IRQ_MSIX flags. diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst new file mode 100644 index 000000000000..83db42092935 --- /dev/null +++ b/Documentation/PCI/pci-error-recovery.rst @@ -0,0 +1,424 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +PCI Error Recovery +================== + + +:Authors: - Linas Vepstas + - Richard Lary + - Mike Mason + + +Many PCI bus controllers are able to detect a variety of hardware +PCI errors on the bus, such as parity errors on the data and address +buses, as well as SERR and PERR errors. Some of the more advanced +chipsets are able to deal with these errors; these include PCI-E chipsets, +and the PCI-host bridges found on IBM Power4, Power5 and Power6-based +pSeries boxes. A typical action taken is to disconnect the affected device, +halting all I/O to it. The goal of a disconnection is to avoid system +corruption; for example, to halt system memory corruption due to DMA's +to "wild" addresses. Typically, a reconnection mechanism is also +offered, so that the affected PCI device(s) are reset and put back +into working condition. The reset phase requires coordination +between the affected device drivers and the PCI controller chip. +This document describes a generic API for notifying device drivers +of a bus disconnection, and then performing error recovery. +This API is currently implemented in the 2.6.16 and later kernels. + +Reporting and recovery is performed in several steps. First, when +a PCI hardware error has resulted in a bus disconnect, that event +is reported as soon as possible to all affected device drivers, +including multiple instances of a device driver on multi-function +cards. This allows device drivers to avoid deadlocking in spinloops, +waiting for some i/o-space register to change, when it never will. +It also gives the drivers a chance to defer incoming I/O as +needed. + +Next, recovery is performed in several stages. Most of the complexity +is forced by the need to handle multi-function devices, that is, +devices that have multiple device drivers associated with them. +In the first stage, each driver is allowed to indicate what type +of reset it desires, the choices being a simple re-enabling of I/O +or requesting a slot reset. + +If any driver requests a slot reset, that is what will be done. + +After a reset and/or a re-enabling of I/O, all drivers are +again notified, so that they may then perform any device setup/config +that may be required. After these have all completed, a final +"resume normal operations" event is sent out. + +The biggest reason for choosing a kernel-based implementation rather +than a user-space implementation was the need to deal with bus +disconnects of PCI devices attached to storage media, and, in particular, +disconnects from devices holding the root file system. If the root +file system is disconnected, a user-space mechanism would have to go +through a large number of contortions to complete recovery. Almost all +of the current Linux file systems are not tolerant of disconnection +from/reconnection to their underlying block device. By contrast, +bus errors are easy to manage in the device driver. Indeed, most +device drivers already handle very similar recovery procedures; +for example, the SCSI-generic layer already provides significant +mechanisms for dealing with SCSI bus errors and SCSI bus resets. + + +Detailed Design +=============== + +Design and implementation details below, based on a chain of +public email discussions with Ben Herrenschmidt, circa 5 April 2005. + +The error recovery API support is exposed to the driver in the form of +a structure of function pointers pointed to by a new field in struct +pci_driver. A driver that fails to provide the structure is "non-aware", +and the actual recovery steps taken are platform dependent. The +arch/powerpc implementation will simulate a PCI hotplug remove/add. + +This structure has the form:: + + struct pci_error_handlers + { + int (*error_detected)(struct pci_dev *dev, enum pci_channel_state); + int (*mmio_enabled)(struct pci_dev *dev); + int (*slot_reset)(struct pci_dev *dev); + void (*resume)(struct pci_dev *dev); + }; + +The possible channel states are:: + + enum pci_channel_state { + pci_channel_io_normal, /* I/O channel is in normal state */ + pci_channel_io_frozen, /* I/O to channel is blocked */ + pci_channel_io_perm_failure, /* PCI card is dead */ + }; + +Possible return values are:: + + enum pci_ers_result { + PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */ + PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */ + PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ + PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ + PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ + }; + +A driver does not have to implement all of these callbacks; however, +if it implements any, it must implement error_detected(). If a callback +is not implemented, the corresponding feature is considered unsupported. +For example, if mmio_enabled() and resume() aren't there, then it +is assumed that the driver is not doing any direct recovery and requires +a slot reset. Typically a driver will want to know about +a slot_reset(). + +The actual steps taken by a platform to recover from a PCI error +event will be platform-dependent, but will follow the general +sequence described below. + +STEP 0: Error Event +------------------- +A PCI bus error is detected by the PCI hardware. On powerpc, the slot +is isolated, in that all I/O is blocked: all reads return 0xffffffff, +all writes are ignored. + + +STEP 1: Notification +-------------------- +Platform calls the error_detected() callback on every instance of +every driver affected by the error. + +At this point, the device might not be accessible anymore, depending on +the platform (the slot will be isolated on powerpc). The driver may +already have "noticed" the error because of a failing I/O, but this +is the proper "synchronization point", that is, it gives the driver +a chance to cleanup, waiting for pending stuff (timers, whatever, etc...) +to complete; it can take semaphores, schedule, etc... everything but +touch the device. Within this function and after it returns, the driver +shouldn't do any new IOs. Called in task context. This is sort of a +"quiesce" point. See note about interrupts at the end of this doc. + +All drivers participating in this system must implement this call. +The driver must return one of the following result codes: + + - PCI_ERS_RESULT_CAN_RECOVER + Driver returns this if it thinks it might be able to recover + the HW by just banging IOs or if it wants to be given + a chance to extract some diagnostic information (see + mmio_enable, below). + - PCI_ERS_RESULT_NEED_RESET + Driver returns this if it can't recover without a + slot reset. + - PCI_ERS_RESULT_DISCONNECT + Driver returns this if it doesn't want to recover at all. + +The next step taken will depend on the result codes returned by the +drivers. + +If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER, +then the platform should re-enable IOs on the slot (or do nothing in +particular, if the platform doesn't isolate slots), and recovery +proceeds to STEP 2 (MMIO Enable). + +If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET), +then recovery proceeds to STEP 4 (Slot Reset). + +If the platform is unable to recover the slot, the next step +is STEP 6 (Permanent Failure). + +.. note:: + + The current powerpc implementation assumes that a device driver will + *not* schedule or semaphore in this routine; the current powerpc + implementation uses one kernel thread to notify all devices; + thus, if one device sleeps/schedules, all devices are affected. + Doing better requires complex multi-threaded logic in the error + recovery implementation (e.g. waiting for all notification threads + to "join" before proceeding with recovery.) This seems excessively + complex and not worth implementing. + + The current powerpc implementation doesn't much care if the device + attempts I/O at this point, or not. I/O's will fail, returning + a value of 0xff on read, and writes will be dropped. If more than + EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH + assumes that the device driver has gone into an infinite loop + and prints an error to syslog. A reboot is then required to + get the device working again. + +STEP 2: MMIO Enabled +-------------------- +The platform re-enables MMIO to the device (but typically not the +DMA), and then calls the mmio_enabled() callback on all affected +device drivers. + +This is the "early recovery" call. IOs are allowed again, but DMA is +not, with some restrictions. This is NOT a callback for the driver to +start operations again, only to peek/poke at the device, extract diagnostic +information, if any, and eventually do things like trigger a device local +reset or some such, but not restart operations. This callback is made if +all drivers on a segment agree that they can try to recover and if no automatic +link reset was performed by the HW. If the platform can't just re-enable IOs +without a slot reset or a link reset, it will not call this callback, and +instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) + +.. note:: + + The following is proposed; no platform implements this yet: + Proposal: All I/O's should be done _synchronously_ from within + this callback, errors triggered by them will be returned via + the normal pci_check_whatever() API, no new error_detected() + callback will be issued due to an error happening here. However, + such an error might cause IOs to be re-blocked for the whole + segment, and thus invalidate the recovery that other devices + on the same segment might have done, forcing the whole segment + into one of the next states, that is, link reset or slot reset. + +The driver should return one of the following result codes: + - PCI_ERS_RESULT_RECOVERED + Driver returns this if it thinks the device is fully + functional and thinks it is ready to start + normal driver operations again. There is no + guarantee that the driver will actually be + allowed to proceed, as another driver on the + same segment might have failed and thus triggered a + slot reset on platforms that support it. + + - PCI_ERS_RESULT_NEED_RESET + Driver returns this if it thinks the device is not + recoverable in its current state and it needs a slot + reset to proceed. + + - PCI_ERS_RESULT_DISCONNECT + Same as above. Total failure, no recovery even after + reset driver dead. (To be defined more precisely) + +The next step taken depends on the results returned by the drivers. +If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform +proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). + +If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform +proceeds to STEP 4 (Slot Reset) + +STEP 3: Link Reset +------------------ +The platform resets the link. This is a PCI-Express specific step +and is done whenever a fatal error has been detected that can be +"solved" by resetting the link. + +STEP 4: Slot Reset +------------------ + +In response to a return value of PCI_ERS_RESULT_NEED_RESET, the +the platform will perform a slot reset on the requesting PCI device(s). +The actual steps taken by a platform to perform a slot reset +will be platform-dependent. Upon completion of slot reset, the +platform will call the device slot_reset() callback. + +Powerpc platforms implement two levels of slot reset: +soft reset(default) and fundamental(optional) reset. + +Powerpc soft reset consists of asserting the adapter #RST line and then +restoring the PCI BAR's and PCI configuration header to a state +that is equivalent to what it would be after a fresh system +power-on followed by power-on BIOS/system firmware initialization. +Soft reset is also known as hot-reset. + +Powerpc fundamental reset is supported by PCI Express cards only +and results in device's state machines, hardware logic, port states and +configuration registers to initialize to their default conditions. + +For most PCI devices, a soft reset will be sufficient for recovery. +Optional fundamental reset is provided to support a limited number +of PCI Express devices for which a soft reset is not sufficient +for recovery. + +If the platform supports PCI hotplug, then the reset might be +performed by toggling the slot electrical power off/on. + +It is important for the platform to restore the PCI config space +to the "fresh poweron" state, rather than the "last state". After +a slot reset, the device driver will almost always use its standard +device initialization routines, and an unusual config space setup +may result in hung devices, kernel panics, or silent data corruption. + +This call gives drivers the chance to re-initialize the hardware +(re-download firmware, etc.). At this point, the driver may assume +that the card is in a fresh state and is fully functional. The slot +is unfrozen and the driver has full access to PCI config space, +memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X) +will also be available. + +Drivers should not restart normal I/O processing operations +at this point. If all device drivers report success on this +callback, the platform will call resume() to complete the sequence, +and let the driver restart normal I/O processing. + +A driver can still return a critical failure for this function if +it can't get the device operational after reset. If the platform +previously tried a soft reset, it might now try a hard reset (power +cycle) and then call slot_reset() again. It the device still can't +be recovered, there is nothing more that can be done; the platform +will typically report a "permanent failure" in such a case. The +device will be considered "dead" in this case. + +Drivers for multi-function cards will need to coordinate among +themselves as to which driver instance will perform any "one-shot" +or global device initialization. For example, the Symbios sym53cxx2 +driver performs device init only from PCI function 0:: + + + if (PCI_FUNC(pdev->devfn) == 0) + + sym_reset_scsi_bus(np, 0); + +Result codes: + - PCI_ERS_RESULT_DISCONNECT + Same as above. + +Drivers for PCI Express cards that require a fundamental reset must +set the needs_freset bit in the pci_dev structure in their probe function. +For example, the QLogic qla2xxx driver sets the needs_freset bit for certain +PCI card types:: + + + /* Set EEH reset type to fundamental if required by hba */ + + if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha)) + + pdev->needs_freset = 1; + + + +Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent +Failure). + +.. note:: + + The current powerpc implementation does not try a power-cycle + reset if the driver returned PCI_ERS_RESULT_DISCONNECT. + However, it probably should. + + +STEP 5: Resume Operations +------------------------- +The platform will call the resume() callback on all affected device +drivers if all drivers on the segment have returned +PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks. +The goal of this callback is to tell the driver to restart activity, +that everything is back and running. This callback does not return +a result code. + +At this point, if a new error happens, the platform will restart +a new error recovery sequence. + +STEP 6: Permanent Failure +------------------------- +A "permanent failure" has occurred, and the platform cannot recover +the device. The platform will call error_detected() with a +pci_channel_state value of pci_channel_io_perm_failure. + +The device driver should, at this point, assume the worst. It should +cancel all pending I/O, refuse all new I/O, returning -EIO to +higher layers. The device driver should then clean up all of its +memory and remove itself from kernel operations, much as it would +during system shutdown. + +The platform will typically notify the system operator of the +permanent failure in some way. If the device is hotplug-capable, +the operator will probably want to remove and replace the device. +Note, however, not all failures are truly "permanent". Some are +caused by over-heating, some by a poorly seated card. Many +PCI error events are caused by software bugs, e.g. DMA's to +wild addresses or bogus split transactions due to programming +errors. See the discussion in powerpc/eeh-pci-error-recovery.txt +for additional detail on real-life experience of the causes of +software errors. + + +Conclusion; General Remarks +--------------------------- +The way the callbacks are called is platform policy. A platform with +no slot reset capability may want to just "ignore" drivers that can't +recover (disconnect them) and try to let other cards on the same segment +recover. Keep in mind that in most real life cases, though, there will +be only one driver per segment. + +Now, a note about interrupts. If you get an interrupt and your +device is dead or has been isolated, there is a problem :) +The current policy is to turn this into a platform policy. +That is, the recovery API only requires that: + + - There is no guarantee that interrupt delivery can proceed from any + device on the segment starting from the error detection and until the + slot_reset callback is called, at which point interrupts are expected + to be fully operational. + + - There is no guarantee that interrupt delivery is stopped, that is, + a driver that gets an interrupt after detecting an error, or that detects + an error within the interrupt handler such that it prevents proper + ack'ing of the interrupt (and thus removal of the source) should just + return IRQ_NOTHANDLED. It's up to the platform to deal with that + condition, typically by masking the IRQ source during the duration of + the error handling. It is expected that the platform "knows" which + interrupts are routed to error-management capable slots and can deal + with temporarily disabling that IRQ number during error processing (this + isn't terribly complex). That means some IRQ latency for other devices + sharing the interrupt, but there is simply no other way. High end + platforms aren't supposed to share interrupts between many devices + anyway :) + +.. note:: + + Implementation details for the powerpc platform are discussed in + the file Documentation/powerpc/eeh-pci-error-recovery.txt + + As of this writing, there is a growing list of device drivers with + patches implementing error recovery. Not all of these patches are in + mainline yet. These may be used as "examples": + + - drivers/scsi/ipr + - drivers/scsi/sym53c8xx_2 + - drivers/scsi/qla2xxx + - drivers/scsi/lpfc + - drivers/next/bnx2.c + - drivers/next/e100.c + - drivers/net/e1000 + - drivers/net/e1000e + - drivers/net/ixgb + - drivers/net/ixgbe + - drivers/net/cxgb3 + - drivers/net/s2io.c + - drivers/net/qlge diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt deleted file mode 100644 index 0b6bb3ef449e..000000000000 --- a/Documentation/PCI/pci-error-recovery.txt +++ /dev/null @@ -1,413 +0,0 @@ - - PCI Error Recovery - ------------------ - February 2, 2006 - - Current document maintainer: - Linas Vepstas - updated by Richard Lary - and Mike Mason on 27-Jul-2009 - - -Many PCI bus controllers are able to detect a variety of hardware -PCI errors on the bus, such as parity errors on the data and address -buses, as well as SERR and PERR errors. Some of the more advanced -chipsets are able to deal with these errors; these include PCI-E chipsets, -and the PCI-host bridges found on IBM Power4, Power5 and Power6-based -pSeries boxes. A typical action taken is to disconnect the affected device, -halting all I/O to it. The goal of a disconnection is to avoid system -corruption; for example, to halt system memory corruption due to DMA's -to "wild" addresses. Typically, a reconnection mechanism is also -offered, so that the affected PCI device(s) are reset and put back -into working condition. The reset phase requires coordination -between the affected device drivers and the PCI controller chip. -This document describes a generic API for notifying device drivers -of a bus disconnection, and then performing error recovery. -This API is currently implemented in the 2.6.16 and later kernels. - -Reporting and recovery is performed in several steps. First, when -a PCI hardware error has resulted in a bus disconnect, that event -is reported as soon as possible to all affected device drivers, -including multiple instances of a device driver on multi-function -cards. This allows device drivers to avoid deadlocking in spinloops, -waiting for some i/o-space register to change, when it never will. -It also gives the drivers a chance to defer incoming I/O as -needed. - -Next, recovery is performed in several stages. Most of the complexity -is forced by the need to handle multi-function devices, that is, -devices that have multiple device drivers associated with them. -In the first stage, each driver is allowed to indicate what type -of reset it desires, the choices being a simple re-enabling of I/O -or requesting a slot reset. - -If any driver requests a slot reset, that is what will be done. - -After a reset and/or a re-enabling of I/O, all drivers are -again notified, so that they may then perform any device setup/config -that may be required. After these have all completed, a final -"resume normal operations" event is sent out. - -The biggest reason for choosing a kernel-based implementation rather -than a user-space implementation was the need to deal with bus -disconnects of PCI devices attached to storage media, and, in particular, -disconnects from devices holding the root file system. If the root -file system is disconnected, a user-space mechanism would have to go -through a large number of contortions to complete recovery. Almost all -of the current Linux file systems are not tolerant of disconnection -from/reconnection to their underlying block device. By contrast, -bus errors are easy to manage in the device driver. Indeed, most -device drivers already handle very similar recovery procedures; -for example, the SCSI-generic layer already provides significant -mechanisms for dealing with SCSI bus errors and SCSI bus resets. - - -Detailed Design ---------------- -Design and implementation details below, based on a chain of -public email discussions with Ben Herrenschmidt, circa 5 April 2005. - -The error recovery API support is exposed to the driver in the form of -a structure of function pointers pointed to by a new field in struct -pci_driver. A driver that fails to provide the structure is "non-aware", -and the actual recovery steps taken are platform dependent. The -arch/powerpc implementation will simulate a PCI hotplug remove/add. - -This structure has the form: -struct pci_error_handlers -{ - int (*error_detected)(struct pci_dev *dev, enum pci_channel_state); - int (*mmio_enabled)(struct pci_dev *dev); - int (*slot_reset)(struct pci_dev *dev); - void (*resume)(struct pci_dev *dev); -}; - -The possible channel states are: -enum pci_channel_state { - pci_channel_io_normal, /* I/O channel is in normal state */ - pci_channel_io_frozen, /* I/O to channel is blocked */ - pci_channel_io_perm_failure, /* PCI card is dead */ -}; - -Possible return values are: -enum pci_ers_result { - PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */ - PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */ - PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ - PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ - PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ -}; - -A driver does not have to implement all of these callbacks; however, -if it implements any, it must implement error_detected(). If a callback -is not implemented, the corresponding feature is considered unsupported. -For example, if mmio_enabled() and resume() aren't there, then it -is assumed that the driver is not doing any direct recovery and requires -a slot reset. Typically a driver will want to know about -a slot_reset(). - -The actual steps taken by a platform to recover from a PCI error -event will be platform-dependent, but will follow the general -sequence described below. - -STEP 0: Error Event -------------------- -A PCI bus error is detected by the PCI hardware. On powerpc, the slot -is isolated, in that all I/O is blocked: all reads return 0xffffffff, -all writes are ignored. - - -STEP 1: Notification --------------------- -Platform calls the error_detected() callback on every instance of -every driver affected by the error. - -At this point, the device might not be accessible anymore, depending on -the platform (the slot will be isolated on powerpc). The driver may -already have "noticed" the error because of a failing I/O, but this -is the proper "synchronization point", that is, it gives the driver -a chance to cleanup, waiting for pending stuff (timers, whatever, etc...) -to complete; it can take semaphores, schedule, etc... everything but -touch the device. Within this function and after it returns, the driver -shouldn't do any new IOs. Called in task context. This is sort of a -"quiesce" point. See note about interrupts at the end of this doc. - -All drivers participating in this system must implement this call. -The driver must return one of the following result codes: - - PCI_ERS_RESULT_CAN_RECOVER: - Driver returns this if it thinks it might be able to recover - the HW by just banging IOs or if it wants to be given - a chance to extract some diagnostic information (see - mmio_enable, below). - - PCI_ERS_RESULT_NEED_RESET: - Driver returns this if it can't recover without a - slot reset. - - PCI_ERS_RESULT_DISCONNECT: - Driver returns this if it doesn't want to recover at all. - -The next step taken will depend on the result codes returned by the -drivers. - -If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER, -then the platform should re-enable IOs on the slot (or do nothing in -particular, if the platform doesn't isolate slots), and recovery -proceeds to STEP 2 (MMIO Enable). - -If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET), -then recovery proceeds to STEP 4 (Slot Reset). - -If the platform is unable to recover the slot, the next step -is STEP 6 (Permanent Failure). - ->>> The current powerpc implementation assumes that a device driver will ->>> *not* schedule or semaphore in this routine; the current powerpc ->>> implementation uses one kernel thread to notify all devices; ->>> thus, if one device sleeps/schedules, all devices are affected. ->>> Doing better requires complex multi-threaded logic in the error ->>> recovery implementation (e.g. waiting for all notification threads ->>> to "join" before proceeding with recovery.) This seems excessively ->>> complex and not worth implementing. - ->>> The current powerpc implementation doesn't much care if the device ->>> attempts I/O at this point, or not. I/O's will fail, returning ->>> a value of 0xff on read, and writes will be dropped. If more than ->>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH ->>> assumes that the device driver has gone into an infinite loop ->>> and prints an error to syslog. A reboot is then required to ->>> get the device working again. - -STEP 2: MMIO Enabled -------------------- -The platform re-enables MMIO to the device (but typically not the -DMA), and then calls the mmio_enabled() callback on all affected -device drivers. - -This is the "early recovery" call. IOs are allowed again, but DMA is -not, with some restrictions. This is NOT a callback for the driver to -start operations again, only to peek/poke at the device, extract diagnostic -information, if any, and eventually do things like trigger a device local -reset or some such, but not restart operations. This callback is made if -all drivers on a segment agree that they can try to recover and if no automatic -link reset was performed by the HW. If the platform can't just re-enable IOs -without a slot reset or a link reset, it will not call this callback, and -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) - ->>> The following is proposed; no platform implements this yet: ->>> Proposal: All I/O's should be done _synchronously_ from within ->>> this callback, errors triggered by them will be returned via ->>> the normal pci_check_whatever() API, no new error_detected() ->>> callback will be issued due to an error happening here. However, ->>> such an error might cause IOs to be re-blocked for the whole ->>> segment, and thus invalidate the recovery that other devices ->>> on the same segment might have done, forcing the whole segment ->>> into one of the next states, that is, link reset or slot reset. - -The driver should return one of the following result codes: - - PCI_ERS_RESULT_RECOVERED - Driver returns this if it thinks the device is fully - functional and thinks it is ready to start - normal driver operations again. There is no - guarantee that the driver will actually be - allowed to proceed, as another driver on the - same segment might have failed and thus triggered a - slot reset on platforms that support it. - - - PCI_ERS_RESULT_NEED_RESET - Driver returns this if it thinks the device is not - recoverable in its current state and it needs a slot - reset to proceed. - - - PCI_ERS_RESULT_DISCONNECT - Same as above. Total failure, no recovery even after - reset driver dead. (To be defined more precisely) - -The next step taken depends on the results returned by the drivers. -If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). - -If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform -proceeds to STEP 4 (Slot Reset) - -STEP 3: Link Reset ------------------- -The platform resets the link. This is a PCI-Express specific step -and is done whenever a fatal error has been detected that can be -"solved" by resetting the link. - -STEP 4: Slot Reset ------------------- - -In response to a return value of PCI_ERS_RESULT_NEED_RESET, the -the platform will perform a slot reset on the requesting PCI device(s). -The actual steps taken by a platform to perform a slot reset -will be platform-dependent. Upon completion of slot reset, the -platform will call the device slot_reset() callback. - -Powerpc platforms implement two levels of slot reset: -soft reset(default) and fundamental(optional) reset. - -Powerpc soft reset consists of asserting the adapter #RST line and then -restoring the PCI BAR's and PCI configuration header to a state -that is equivalent to what it would be after a fresh system -power-on followed by power-on BIOS/system firmware initialization. -Soft reset is also known as hot-reset. - -Powerpc fundamental reset is supported by PCI Express cards only -and results in device's state machines, hardware logic, port states and -configuration registers to initialize to their default conditions. - -For most PCI devices, a soft reset will be sufficient for recovery. -Optional fundamental reset is provided to support a limited number -of PCI Express devices for which a soft reset is not sufficient -for recovery. - -If the platform supports PCI hotplug, then the reset might be -performed by toggling the slot electrical power off/on. - -It is important for the platform to restore the PCI config space -to the "fresh poweron" state, rather than the "last state". After -a slot reset, the device driver will almost always use its standard -device initialization routines, and an unusual config space setup -may result in hung devices, kernel panics, or silent data corruption. - -This call gives drivers the chance to re-initialize the hardware -(re-download firmware, etc.). At this point, the driver may assume -that the card is in a fresh state and is fully functional. The slot -is unfrozen and the driver has full access to PCI config space, -memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X) -will also be available. - -Drivers should not restart normal I/O processing operations -at this point. If all device drivers report success on this -callback, the platform will call resume() to complete the sequence, -and let the driver restart normal I/O processing. - -A driver can still return a critical failure for this function if -it can't get the device operational after reset. If the platform -previously tried a soft reset, it might now try a hard reset (power -cycle) and then call slot_reset() again. It the device still can't -be recovered, there is nothing more that can be done; the platform -will typically report a "permanent failure" in such a case. The -device will be considered "dead" in this case. - -Drivers for multi-function cards will need to coordinate among -themselves as to which driver instance will perform any "one-shot" -or global device initialization. For example, the Symbios sym53cxx2 -driver performs device init only from PCI function 0: - -+ if (PCI_FUNC(pdev->devfn) == 0) -+ sym_reset_scsi_bus(np, 0); - - Result codes: - - PCI_ERS_RESULT_DISCONNECT - Same as above. - -Drivers for PCI Express cards that require a fundamental reset must -set the needs_freset bit in the pci_dev structure in their probe function. -For example, the QLogic qla2xxx driver sets the needs_freset bit for certain -PCI card types: - -+ /* Set EEH reset type to fundamental if required by hba */ -+ if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha)) -+ pdev->needs_freset = 1; -+ - -Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent -Failure). - ->>> The current powerpc implementation does not try a power-cycle ->>> reset if the driver returned PCI_ERS_RESULT_DISCONNECT. ->>> However, it probably should. - - -STEP 5: Resume Operations -------------------------- -The platform will call the resume() callback on all affected device -drivers if all drivers on the segment have returned -PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks. -The goal of this callback is to tell the driver to restart activity, -that everything is back and running. This callback does not return -a result code. - -At this point, if a new error happens, the platform will restart -a new error recovery sequence. - -STEP 6: Permanent Failure -------------------------- -A "permanent failure" has occurred, and the platform cannot recover -the device. The platform will call error_detected() with a -pci_channel_state value of pci_channel_io_perm_failure. - -The device driver should, at this point, assume the worst. It should -cancel all pending I/O, refuse all new I/O, returning -EIO to -higher layers. The device driver should then clean up all of its -memory and remove itself from kernel operations, much as it would -during system shutdown. - -The platform will typically notify the system operator of the -permanent failure in some way. If the device is hotplug-capable, -the operator will probably want to remove and replace the device. -Note, however, not all failures are truly "permanent". Some are -caused by over-heating, some by a poorly seated card. Many -PCI error events are caused by software bugs, e.g. DMA's to -wild addresses or bogus split transactions due to programming -errors. See the discussion in powerpc/eeh-pci-error-recovery.txt -for additional detail on real-life experience of the causes of -software errors. - - -Conclusion; General Remarks ---------------------------- -The way the callbacks are called is platform policy. A platform with -no slot reset capability may want to just "ignore" drivers that can't -recover (disconnect them) and try to let other cards on the same segment -recover. Keep in mind that in most real life cases, though, there will -be only one driver per segment. - -Now, a note about interrupts. If you get an interrupt and your -device is dead or has been isolated, there is a problem :) -The current policy is to turn this into a platform policy. -That is, the recovery API only requires that: - - - There is no guarantee that interrupt delivery can proceed from any -device on the segment starting from the error detection and until the -slot_reset callback is called, at which point interrupts are expected -to be fully operational. - - - There is no guarantee that interrupt delivery is stopped, that is, -a driver that gets an interrupt after detecting an error, or that detects -an error within the interrupt handler such that it prevents proper -ack'ing of the interrupt (and thus removal of the source) should just -return IRQ_NOTHANDLED. It's up to the platform to deal with that -condition, typically by masking the IRQ source during the duration of -the error handling. It is expected that the platform "knows" which -interrupts are routed to error-management capable slots and can deal -with temporarily disabling that IRQ number during error processing (this -isn't terribly complex). That means some IRQ latency for other devices -sharing the interrupt, but there is simply no other way. High end -platforms aren't supposed to share interrupts between many devices -anyway :) - ->>> Implementation details for the powerpc platform are discussed in ->>> the file Documentation/powerpc/eeh-pci-error-recovery.txt - ->>> As of this writing, there is a growing list of device drivers with ->>> patches implementing error recovery. Not all of these patches are in ->>> mainline yet. These may be used as "examples": ->>> ->>> drivers/scsi/ipr ->>> drivers/scsi/sym53c8xx_2 ->>> drivers/scsi/qla2xxx ->>> drivers/scsi/lpfc ->>> drivers/next/bnx2.c ->>> drivers/next/e100.c ->>> drivers/net/e1000 ->>> drivers/net/e1000e ->>> drivers/net/ixgb ->>> drivers/net/ixgbe ->>> drivers/net/cxgb3 ->>> drivers/net/s2io.c ->>> drivers/net/qlge - -The End -------- diff --git a/Documentation/PCI/pci-iov-howto.rst b/Documentation/PCI/pci-iov-howto.rst new file mode 100644 index 000000000000..b9fd003206f1 --- /dev/null +++ b/Documentation/PCI/pci-iov-howto.rst @@ -0,0 +1,172 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +==================================== +PCI Express I/O Virtualization Howto +==================================== + +:Copyright: |copy| 2009 Intel Corporation +:Authors: - Yu Zhao + - Donald Dutile + +Overview +======== + +What is SR-IOV +-------------- + +Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended +capability which makes one physical device appear as multiple virtual +devices. The physical device is referred to as Physical Function (PF) +while the virtual devices are referred to as Virtual Functions (VF). +Allocation of the VF can be dynamically controlled by the PF via +registers encapsulated in the capability. By default, this feature is +not enabled and the PF behaves as traditional PCIe device. Once it's +turned on, each VF's PCI configuration space can be accessed by its own +Bus, Device and Function Number (Routing ID). And each VF also has PCI +Memory Space, which is used to map its register set. VF device driver +operates on the register set so it can be functional and appear as a +real existing PCI device. + +User Guide +========== + +How can I enable SR-IOV capability +---------------------------------- + +Multiple methods are available for SR-IOV enablement. +In the first method, the device driver (PF driver) will control the +enabling and disabling of the capability via API provided by SR-IOV core. +If the hardware has SR-IOV capability, loading its PF driver would +enable it and all VFs associated with the PF. Some PF drivers require +a module parameter to be set to determine the number of VFs to enable. +In the second method, a write to the sysfs file sriov_numvfs will +enable and disable the VFs associated with a PCIe PF. This method +enables per-PF, VF enable/disable values versus the first method, +which applies to all PFs of the same device. Additionally, the +PCI SRIOV core support ensures that enable/disable operations are +valid to reduce duplication in multiple drivers for the same +checks, e.g., check numvfs == 0 if enabling VFs, ensure +numvfs <= totalvfs. +The second method is the recommended method for new/future VF devices. + +How can I use the Virtual Functions +----------------------------------- + +The VF is treated as hot-plugged PCI devices in the kernel, so they +should be able to work in the same way as real PCI devices. The VF +requires device driver that is same as a normal PCI device's. + +Developer Guide +=============== + +SR-IOV API +---------- + +To enable SR-IOV capability: + +(a) For the first method, in the driver:: + + int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); + +'nr_virtfn' is number of VFs to be enabled. + +(b) For the second method, from sysfs:: + + echo 'nr_virtfn' > \ + /sys/bus/pci/devices//sriov_numvfs + +To disable SR-IOV capability: + +(a) For the first method, in the driver:: + + void pci_disable_sriov(struct pci_dev *dev); + +(b) For the second method, from sysfs:: + + echo 0 > \ + /sys/bus/pci/devices//sriov_numvfs + +To enable auto probing VFs by a compatible driver on the host, run +command below before enabling SR-IOV capabilities. This is the +default behavior. +:: + + echo 1 > \ + /sys/bus/pci/devices//sriov_drivers_autoprobe + +To disable auto probing VFs by a compatible driver on the host, run +command below before enabling SR-IOV capabilities. Updating this +entry will not affect VFs which are already probed. +:: + + echo 0 > \ + /sys/bus/pci/devices//sriov_drivers_autoprobe + +Usage example +------------- + +Following piece of code illustrates the usage of the SR-IOV API. +:: + + static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id) + { + pci_enable_sriov(dev, NR_VIRTFN); + + ... + + return 0; + } + + static void dev_remove(struct pci_dev *dev) + { + pci_disable_sriov(dev); + + ... + } + + static int dev_suspend(struct pci_dev *dev, pm_message_t state) + { + ... + + return 0; + } + + static int dev_resume(struct pci_dev *dev) + { + ... + + return 0; + } + + static void dev_shutdown(struct pci_dev *dev) + { + ... + } + + static int dev_sriov_configure(struct pci_dev *dev, int numvfs) + { + if (numvfs > 0) { + ... + pci_enable_sriov(dev, numvfs); + ... + return numvfs; + } + if (numvfs == 0) { + .... + pci_disable_sriov(dev); + ... + return 0; + } + } + + static struct pci_driver dev_driver = { + .name = "SR-IOV Physical Function driver", + .id_table = dev_id_table, + .probe = dev_probe, + .remove = dev_remove, + .suspend = dev_suspend, + .resume = dev_resume, + .shutdown = dev_shutdown, + .sriov_configure = dev_sriov_configure, + }; diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt deleted file mode 100644 index d2a84151e99c..000000000000 --- a/Documentation/PCI/pci-iov-howto.txt +++ /dev/null @@ -1,147 +0,0 @@ - PCI Express I/O Virtualization Howto - Copyright (C) 2009 Intel Corporation - Yu Zhao - - Update: November 2012 - -- sysfs-based SRIOV enable-/disable-ment - Donald Dutile - -1. Overview - -1.1 What is SR-IOV - -Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended -capability which makes one physical device appear as multiple virtual -devices. The physical device is referred to as Physical Function (PF) -while the virtual devices are referred to as Virtual Functions (VF). -Allocation of the VF can be dynamically controlled by the PF via -registers encapsulated in the capability. By default, this feature is -not enabled and the PF behaves as traditional PCIe device. Once it's -turned on, each VF's PCI configuration space can be accessed by its own -Bus, Device and Function Number (Routing ID). And each VF also has PCI -Memory Space, which is used to map its register set. VF device driver -operates on the register set so it can be functional and appear as a -real existing PCI device. - -2. User Guide - -2.1 How can I enable SR-IOV capability - -Multiple methods are available for SR-IOV enablement. -In the first method, the device driver (PF driver) will control the -enabling and disabling of the capability via API provided by SR-IOV core. -If the hardware has SR-IOV capability, loading its PF driver would -enable it and all VFs associated with the PF. Some PF drivers require -a module parameter to be set to determine the number of VFs to enable. -In the second method, a write to the sysfs file sriov_numvfs will -enable and disable the VFs associated with a PCIe PF. This method -enables per-PF, VF enable/disable values versus the first method, -which applies to all PFs of the same device. Additionally, the -PCI SRIOV core support ensures that enable/disable operations are -valid to reduce duplication in multiple drivers for the same -checks, e.g., check numvfs == 0 if enabling VFs, ensure -numvfs <= totalvfs. -The second method is the recommended method for new/future VF devices. - -2.2 How can I use the Virtual Functions - -The VF is treated as hot-plugged PCI devices in the kernel, so they -should be able to work in the same way as real PCI devices. The VF -requires device driver that is same as a normal PCI device's. - -3. Developer Guide - -3.1 SR-IOV API - -To enable SR-IOV capability: -(a) For the first method, in the driver: - int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); - 'nr_virtfn' is number of VFs to be enabled. -(b) For the second method, from sysfs: - echo 'nr_virtfn' > \ - /sys/bus/pci/devices//sriov_numvfs - -To disable SR-IOV capability: -(a) For the first method, in the driver: - void pci_disable_sriov(struct pci_dev *dev); -(b) For the second method, from sysfs: - echo 0 > \ - /sys/bus/pci/devices//sriov_numvfs - -To enable auto probing VFs by a compatible driver on the host, run -command below before enabling SR-IOV capabilities. This is the -default behavior. - echo 1 > \ - /sys/bus/pci/devices//sriov_drivers_autoprobe - -To disable auto probing VFs by a compatible driver on the host, run -command below before enabling SR-IOV capabilities. Updating this -entry will not affect VFs which are already probed. - echo 0 > \ - /sys/bus/pci/devices//sriov_drivers_autoprobe - -3.2 Usage example - -Following piece of code illustrates the usage of the SR-IOV API. - -static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id) -{ - pci_enable_sriov(dev, NR_VIRTFN); - - ... - - return 0; -} - -static void dev_remove(struct pci_dev *dev) -{ - pci_disable_sriov(dev); - - ... -} - -static int dev_suspend(struct pci_dev *dev, pm_message_t state) -{ - ... - - return 0; -} - -static int dev_resume(struct pci_dev *dev) -{ - ... - - return 0; -} - -static void dev_shutdown(struct pci_dev *dev) -{ - ... -} - -static int dev_sriov_configure(struct pci_dev *dev, int numvfs) -{ - if (numvfs > 0) { - ... - pci_enable_sriov(dev, numvfs); - ... - return numvfs; - } - if (numvfs == 0) { - .... - pci_disable_sriov(dev); - ... - return 0; - } -} - -static struct pci_driver dev_driver = { - .name = "SR-IOV Physical Function driver", - .id_table = dev_id_table, - .probe = dev_probe, - .remove = dev_remove, - .suspend = dev_suspend, - .resume = dev_resume, - .shutdown = dev_shutdown, - .sriov_configure = dev_sriov_configure, -}; diff --git a/Documentation/PCI/pci.rst b/Documentation/PCI/pci.rst new file mode 100644 index 000000000000..6864f9a70f5f --- /dev/null +++ b/Documentation/PCI/pci.rst @@ -0,0 +1,578 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================== +How To Write Linux PCI Drivers +============================== + +:Authors: - Martin Mares + - Grant Grundler + +The world of PCI is vast and full of (mostly unpleasant) surprises. +Since each CPU architecture implements different chip-sets and PCI devices +have different requirements (erm, "features"), the result is the PCI support +in the Linux kernel is not as trivial as one would wish. This short paper +tries to introduce all potential driver authors to Linux APIs for +PCI device drivers. + +A more complete resource is the third edition of "Linux Device Drivers" +by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. +LDD3 is available for free (under Creative Commons License) from: +http://lwn.net/Kernel/LDD3/. + +However, keep in mind that all documents are subject to "bit rot". +Refer to the source code if things are not working as described here. + +Please send questions/comments/patches about Linux PCI API to the +"Linux PCI" mailing list. + + +Structure of PCI drivers +======================== +PCI drivers "discover" PCI devices in a system via pci_register_driver(). +Actually, it's the other way around. When the PCI generic code discovers +a new device, the driver with a matching "description" will be notified. +Details on this below. + +pci_register_driver() leaves most of the probing for devices to +the PCI layer and supports online insertion/removal of devices [thus +supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver]. +pci_register_driver() call requires passing in a table of function +pointers and thus dictates the high level structure of a driver. + +Once the driver knows about a PCI device and takes ownership, the +driver generally needs to perform the following initialization: + + - Enable the device + - Request MMIO/IOP resources + - Set the DMA mask size (for both coherent and streaming DMA) + - Allocate and initialize shared control data (pci_allocate_coherent()) + - Access device configuration space (if needed) + - Register IRQ handler (request_irq()) + - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) + - Enable DMA/processing engines + +When done using the device, and perhaps the module needs to be unloaded, +the driver needs to take the follow steps: + + - Disable the device from generating IRQs + - Release the IRQ (free_irq()) + - Stop all DMA activity + - Release DMA buffers (both streaming and coherent) + - Unregister from other subsystems (e.g. scsi or netdev) + - Release MMIO/IOP resources + - Disable the device + +Most of these topics are covered in the following sections. +For the rest look at LDD3 or . + +If the PCI subsystem is not configured (CONFIG_PCI is not set), most of +the PCI functions described below are defined as inline functions either +completely empty or just returning an appropriate error codes to avoid +lots of ifdefs in the drivers. + + +pci_register_driver() call +========================== + +PCI device drivers call ``pci_register_driver()`` during their +initialization with a pointer to a structure describing the driver +(``struct pci_driver``): + +.. kernel-doc:: include/linux/pci.h + :functions: pci_driver + +The ID table is an array of ``struct pci_device_id`` entries ending with an +all-zero entry. Definitions with static const are generally preferred. + +.. kernel-doc:: include/linux/mod_devicetable.h + :functions: pci_device_id + +Most drivers only need ``PCI_DEVICE()`` or ``PCI_DEVICE_CLASS()`` to set up +a pci_device_id table. + +New PCI IDs may be added to a device driver pci_ids table at runtime +as shown below:: + + echo "vendor device subvendor subdevice class class_mask driver_data" > \ + /sys/bus/pci/drivers/{driver}/new_id + +All fields are passed in as hexadecimal values (no leading 0x). +The vendor and device fields are mandatory, the others are optional. Users +need pass only as many optional fields as necessary: + + - subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF) + - class and classmask fields default to 0 + - driver_data defaults to 0UL. + +Note that driver_data must match the value used by any of the pci_device_id +entries defined in the driver. This makes the driver_data field mandatory +if all the pci_device_id entries have a non-zero driver_data value. + +Once added, the driver probe routine will be invoked for any unclaimed +PCI devices listed in its (newly updated) pci_ids list. + +When the driver exits, it just calls pci_unregister_driver() and the PCI layer +automatically calls the remove hook for all devices handled by the driver. + + +"Attributes" for driver functions/data +-------------------------------------- + +Please mark the initialization and cleanup functions where appropriate +(the corresponding macros are defined in ): + + ====== ================================================= + __init Initialization code. Thrown away after the driver + initializes. + __exit Exit code. Ignored for non-modular drivers. + ====== ================================================= + +Tips on when/where to use the above attributes: + - The module_init()/module_exit() functions (and all + initialization functions called _only_ from these) + should be marked __init/__exit. + + - Do not mark the struct pci_driver. + + - Do NOT mark a function if you are not sure which mark to use. + Better to not mark the function than mark the function wrong. + + +How to find PCI devices manually +================================ + +PCI drivers should have a really good reason for not using the +pci_register_driver() interface to search for PCI devices. +The main reason PCI devices are controlled by multiple drivers +is because one PCI device implements several different HW services. +E.g. combined serial/parallel port/floppy controller. + +A manual search may be performed using the following constructs: + +Searching by vendor and device ID:: + + struct pci_dev *dev = NULL; + while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev)) + configure_device(dev); + +Searching by class ID (iterate in a similar way):: + + pci_get_class(CLASS_ID, dev) + +Searching by both vendor/device and subsystem vendor/device ID:: + + pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev). + +You can use the constant PCI_ANY_ID as a wildcard replacement for +VENDOR_ID or DEVICE_ID. This allows searching for any device from a +specific vendor, for example. + +These functions are hotplug-safe. They increment the reference count on +the pci_dev that they return. You must eventually (possibly at module unload) +decrement the reference count on these devices by calling pci_dev_put(). + + +Device Initialization Steps +=========================== + +As noted in the introduction, most PCI drivers need the following steps +for device initialization: + + - Enable the device + - Request MMIO/IOP resources + - Set the DMA mask size (for both coherent and streaming DMA) + - Allocate and initialize shared control data (pci_allocate_coherent()) + - Access device configuration space (if needed) + - Register IRQ handler (request_irq()) + - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) + - Enable DMA/processing engines. + +The driver can access PCI config space registers at any time. +(Well, almost. When running BIST, config space can go away...but +that will just result in a PCI Bus Master Abort and config reads +will return garbage). + + +Enable the PCI device +--------------------- +Before touching any device registers, the driver needs to enable +the PCI device by calling pci_enable_device(). This will: + + - wake up the device if it was in suspended state, + - allocate I/O and memory regions of the device (if BIOS did not), + - allocate an IRQ (if BIOS did not). + +.. note:: + pci_enable_device() can fail! Check the return value. + +.. warning:: + OS BUG: we don't check resource allocations before enabling those + resources. The sequence would make more sense if we called + pci_request_resources() before calling pci_enable_device(). + Currently, the device drivers can't detect the bug when when two + devices have been allocated the same range. This is not a common + problem and unlikely to get fixed soon. + + This has been discussed before but not changed as of 2.6.19: + http://lkml.org/lkml/2006/3/2/194 + + +pci_set_master() will enable DMA by setting the bus master bit +in the PCI_COMMAND register. It also fixes the latency timer value if +it's set to something bogus by the BIOS. pci_clear_master() will +disable DMA by clearing the bus master bit. + +If the PCI device can use the PCI Memory-Write-Invalidate transaction, +call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval +and also ensures that the cache line size register is set correctly. +Check the return value of pci_set_mwi() as not all architectures +or chip-sets may support Memory-Write-Invalidate. Alternatively, +if Mem-Wr-Inval would be nice to have but is not required, call +pci_try_set_mwi() to have the system do its best effort at enabling +Mem-Wr-Inval. + + +Request MMIO/IOP resources +-------------------------- +Memory (MMIO), and I/O port addresses should NOT be read directly +from the PCI device config space. Use the values in the pci_dev structure +as the PCI "bus address" might have been remapped to a "host physical" +address by the arch/chip-set specific kernel support. + +See Documentation/io-mapping.txt for how to access device registers +or device memory. + +The device driver needs to call pci_request_region() to verify +no other device is already using the same address resource. +Conversely, drivers should call pci_release_region() AFTER +calling pci_disable_device(). +The idea is to prevent two devices colliding on the same address range. + +.. tip:: + See OS BUG comment above. Currently (2.6.19), The driver can only + determine MMIO and IO Port resource availability _after_ calling + pci_enable_device(). + +Generic flavors of pci_request_region() are request_mem_region() +(for MMIO ranges) and request_region() (for IO Port ranges). +Use these for address resources that are not described by "normal" PCI +BARs. + +Also see pci_request_selected_regions() below. + + +Set the DMA mask size +--------------------- +.. note:: + If anything below doesn't make sense, please refer to + Documentation/DMA-API.txt. This section is just a reminder that + drivers need to indicate DMA capabilities of the device and is not + an authoritative source for DMA interfaces. + +While all drivers should explicitly indicate the DMA capability +(e.g. 32 or 64 bit) of the PCI bus master, devices with more than +32-bit bus master capability for streaming data need the driver +to "register" this capability by calling pci_set_dma_mask() with +appropriate parameters. In general this allows more efficient DMA +on systems where System RAM exists above 4G _physical_ address. + +Drivers for all PCI-X and PCIe compliant devices must call +pci_set_dma_mask() as they are 64-bit DMA devices. + +Similarly, drivers must also "register" this capability if the device +can directly address "consistent memory" in System RAM above 4G physical +address by calling pci_set_consistent_dma_mask(). +Again, this includes drivers for all PCI-X and PCIe compliant devices. +Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are +64-bit DMA capable for payload ("streaming") data but not control +("consistent") data. + + +Setup shared control data +------------------------- +Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared) +memory. See Documentation/DMA-API.txt for a full description of +the DMA APIs. This section is just a reminder that it needs to be done +before enabling DMA on the device. + + +Initialize device registers +--------------------------- +Some drivers will need specific "capability" fields programmed +or other "vendor specific" register initialized or reset. +E.g. clearing pending interrupts. + + +Register IRQ handler +-------------------- +While calling request_irq() is the last step described here, +this is often just another intermediate step to initialize a device. +This step can often be deferred until the device is opened for use. + +All interrupt handlers for IRQ lines should be registered with IRQF_SHARED +and use the devid to map IRQs to devices (remember that all PCI IRQ lines +can be shared). + +request_irq() will associate an interrupt handler and device handle +with an interrupt number. Historically interrupt numbers represent +IRQ lines which run from the PCI device to the Interrupt controller. +With MSI and MSI-X (more below) the interrupt number is a CPU "vector". + +request_irq() also enables the interrupt. Make sure the device is +quiesced and does not have any interrupts pending before registering +the interrupt handler. + +MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts" +which deliver interrupts to the CPU via a DMA write to a Local APIC. +The fundamental difference between MSI and MSI-X is how multiple +"vectors" get allocated. MSI requires contiguous blocks of vectors +while MSI-X can allocate several individual ones. + +MSI capability can be enabled by calling pci_alloc_irq_vectors() with the +PCI_IRQ_MSI and/or PCI_IRQ_MSIX flags before calling request_irq(). This +causes the PCI support to program CPU vector data into the PCI device +capability registers. Many architectures, chip-sets, or BIOSes do NOT +support MSI or MSI-X and a call to pci_alloc_irq_vectors with just +the PCI_IRQ_MSI and PCI_IRQ_MSIX flags will fail, so try to always +specify PCI_IRQ_LEGACY as well. + +Drivers that have different interrupt handlers for MSI/MSI-X and +legacy INTx should chose the right one based on the msi_enabled +and msix_enabled flags in the pci_dev structure after calling +pci_alloc_irq_vectors. + +There are (at least) two really good reasons for using MSI: + +1) MSI is an exclusive interrupt vector by definition. + This means the interrupt handler doesn't have to verify + its device caused the interrupt. + +2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed + to be visible to the host CPU(s) when the MSI is delivered. This + is important for both data coherency and avoiding stale control data. + This guarantee allows the driver to omit MMIO reads to flush + the DMA stream. + +See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples +of MSI/MSI-X usage. + + +PCI device shutdown +=================== + +When a PCI device driver is being unloaded, most of the following +steps need to be performed: + + - Disable the device from generating IRQs + - Release the IRQ (free_irq()) + - Stop all DMA activity + - Release DMA buffers (both streaming and consistent) + - Unregister from other subsystems (e.g. scsi or netdev) + - Disable device from responding to MMIO/IO Port addresses + - Release MMIO/IO Port resource(s) + + +Stop IRQs on the device +----------------------- +How to do this is chip/device specific. If it's not done, it opens +the possibility of a "screaming interrupt" if (and only if) +the IRQ is shared with another device. + +When the shared IRQ handler is "unhooked", the remaining devices +using the same IRQ line will still need the IRQ enabled. Thus if the +"unhooked" device asserts IRQ line, the system will respond assuming +it was one of the remaining devices asserted the IRQ line. Since none +of the other devices will handle the IRQ, the system will "hang" until +it decides the IRQ isn't going to get handled and masks the IRQ (100,000 +iterations later). Once the shared IRQ is masked, the remaining devices +will stop functioning properly. Not a nice situation. + +This is another reason to use MSI or MSI-X if it's available. +MSI and MSI-X are defined to be exclusive interrupts and thus +are not susceptible to the "screaming interrupt" problem. + + +Release the IRQ +--------------- +Once the device is quiesced (no more IRQs), one can call free_irq(). +This function will return control once any pending IRQs are handled, +"unhook" the drivers IRQ handler from that IRQ, and finally release +the IRQ if no one else is using it. + + +Stop all DMA activity +--------------------- +It's extremely important to stop all DMA operations BEFORE attempting +to deallocate DMA control data. Failure to do so can result in memory +corruption, hangs, and on some chip-sets a hard crash. + +Stopping DMA after stopping the IRQs can avoid races where the +IRQ handler might restart DMA engines. + +While this step sounds obvious and trivial, several "mature" drivers +didn't get this step right in the past. + + +Release DMA buffers +------------------- +Once DMA is stopped, clean up streaming DMA first. +I.e. unmap data buffers and return buffers to "upstream" +owners if there is one. + +Then clean up "consistent" buffers which contain the control data. + +See Documentation/DMA-API.txt for details on unmapping interfaces. + + +Unregister from other subsystems +-------------------------------- +Most low level PCI device drivers support some other subsystem +like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your +driver isn't losing resources from that other subsystem. +If this happens, typically the symptom is an Oops (panic) when +the subsystem attempts to call into a driver that has been unloaded. + + +Disable Device from responding to MMIO/IO Port addresses +-------------------------------------------------------- +io_unmap() MMIO or IO Port resources and then call pci_disable_device(). +This is the symmetric opposite of pci_enable_device(). +Do not access device registers after calling pci_disable_device(). + + +Release MMIO/IO Port Resource(s) +-------------------------------- +Call pci_release_region() to mark the MMIO or IO Port range as available. +Failure to do so usually results in the inability to reload the driver. + + +How to access PCI config space +============================== + +You can use `pci_(read|write)_config_(byte|word|dword)` to access the config +space of a device represented by `struct pci_dev *`. All these functions return +0 when successful or an error code (`PCIBIOS_...`) which can be translated to a +text string by pcibios_strerror. Most drivers expect that accesses to valid PCI +devices don't fail. + +If you don't have a struct pci_dev available, you can call +`pci_bus_(read|write)_config_(byte|word|dword)` to access a given device +and function on that bus. + +If you access fields in the standard portion of the config header, please +use symbolic names of locations and bits declared in . + +If you need to access Extended PCI Capability registers, just call +pci_find_capability() for the particular capability and it will find the +corresponding register block for you. + + +Other interesting functions +=========================== + +============================= ================================================ +pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain, + bus and slot and number. If the device is + found, its reference count is increased. +pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3) +pci_find_capability() Find specified capability in device's capability + list. +pci_resource_start() Returns bus start address for a given PCI region +pci_resource_end() Returns bus end address for a given PCI region +pci_resource_len() Returns the byte length of a PCI region +pci_set_drvdata() Set private driver data pointer for a pci_dev +pci_get_drvdata() Return private driver data pointer for a pci_dev +pci_set_mwi() Enable Memory-Write-Invalidate transactions. +pci_clear_mwi() Disable Memory-Write-Invalidate transactions. +============================= ================================================ + + +Miscellaneous hints +=================== + +When displaying PCI device names to the user (for example when a driver wants +to tell the user what card has it found), please use pci_name(pci_dev). + +Always refer to the PCI devices by a pointer to the pci_dev structure. +All PCI layer functions use this identification and it's the only +reasonable one. Don't use bus/slot/function numbers except for very +special purposes -- on systems with multiple primary buses their semantics +can be pretty complex. + +Don't try to turn on Fast Back to Back writes in your driver. All devices +on the bus need to be capable of doing it, so this is something which needs +to be handled by platform and generic code, not individual drivers. + + +Vendor and device identifications +================================= + +Do not add new device or vendor IDs to include/linux/pci_ids.h unless they +are shared across multiple drivers. You can add private definitions in +your driver if they're helpful, or just use plain hex constants. + +The device IDs are arbitrary hex numbers (vendor controlled) and normally used +only in a single location, the pci_device_id table. + +Please DO submit new vendor/device IDs to http://pci-ids.ucw.cz/. +There are mirrors of the pci.ids file at http://pciids.sourceforge.net/ +and https://github.com/pciutils/pciids. + + +Obsolete functions +================== + +There are several functions which you might come across when trying to +port an old driver to the new PCI interface. They are no longer present +in the kernel as they aren't compatible with hotplug or PCI domains or +having sane locking. + +================= =========================================== +pci_find_device() Superseded by pci_get_device() +pci_find_subsys() Superseded by pci_get_subsys() +pci_find_slot() Superseded by pci_get_domain_bus_and_slot() +pci_get_slot() Superseded by pci_get_domain_bus_and_slot() +================= =========================================== + +The alternative is the traditional PCI device driver that walks PCI +device lists. This is still possible but discouraged. + + +MMIO Space and "Write Posting" +============================== + +Converting a driver from using I/O Port space to using MMIO space +often requires some additional changes. Specifically, "write posting" +needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2) +already do this. I/O Port space guarantees write transactions reach the PCI +device before the CPU can continue. Writes to MMIO space allow the CPU +to continue before the transaction reaches the PCI device. HW weenies +call this "Write Posting" because the write completion is "posted" to +the CPU before the transaction has reached its destination. + +Thus, timing sensitive code should add readl() where the CPU is +expected to wait before doing other work. The classic "bit banging" +sequence works fine for I/O Port space:: + + for (i = 8; --i; val >>= 1) { + outb(val & 1, ioport_reg); /* write bit */ + udelay(10); + } + +The same sequence for MMIO space should be:: + + for (i = 8; --i; val >>= 1) { + writeb(val & 1, mmio_reg); /* write bit */ + readb(safe_mmio_reg); /* flush posted write */ + udelay(10); + } + +It is important that "safe_mmio_reg" not have any side effects that +interferes with the correct operation of the device. + +Another case to watch out for is when resetting a PCI device. Use PCI +Configuration space reads to flush the writel(). This will gracefully +handle the PCI master abort on all platforms if the PCI device is +expected to not respond to a readl(). Most x86 platforms will allow +MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage +(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail"). diff --git a/Documentation/PCI/pci.txt b/Documentation/PCI/pci.txt deleted file mode 100644 index badb26ac33dc..000000000000 --- a/Documentation/PCI/pci.txt +++ /dev/null @@ -1,636 +0,0 @@ - - How To Write Linux PCI Drivers - - by Martin Mares on 07-Feb-2000 - updated by Grant Grundler on 23-Dec-2006 - -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The world of PCI is vast and full of (mostly unpleasant) surprises. -Since each CPU architecture implements different chip-sets and PCI devices -have different requirements (erm, "features"), the result is the PCI support -in the Linux kernel is not as trivial as one would wish. This short paper -tries to introduce all potential driver authors to Linux APIs for -PCI device drivers. - -A more complete resource is the third edition of "Linux Device Drivers" -by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. -LDD3 is available for free (under Creative Commons License) from: - - http://lwn.net/Kernel/LDD3/ - -However, keep in mind that all documents are subject to "bit rot". -Refer to the source code if things are not working as described here. - -Please send questions/comments/patches about Linux PCI API to the -"Linux PCI" mailing list. - - - -0. Structure of PCI drivers -~~~~~~~~~~~~~~~~~~~~~~~~~~~ -PCI drivers "discover" PCI devices in a system via pci_register_driver(). -Actually, it's the other way around. When the PCI generic code discovers -a new device, the driver with a matching "description" will be notified. -Details on this below. - -pci_register_driver() leaves most of the probing for devices to -the PCI layer and supports online insertion/removal of devices [thus -supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver]. -pci_register_driver() call requires passing in a table of function -pointers and thus dictates the high level structure of a driver. - -Once the driver knows about a PCI device and takes ownership, the -driver generally needs to perform the following initialization: - - Enable the device - Request MMIO/IOP resources - Set the DMA mask size (for both coherent and streaming DMA) - Allocate and initialize shared control data (pci_allocate_coherent()) - Access device configuration space (if needed) - Register IRQ handler (request_irq()) - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) - Enable DMA/processing engines - -When done using the device, and perhaps the module needs to be unloaded, -the driver needs to take the follow steps: - Disable the device from generating IRQs - Release the IRQ (free_irq()) - Stop all DMA activity - Release DMA buffers (both streaming and coherent) - Unregister from other subsystems (e.g. scsi or netdev) - Release MMIO/IOP resources - Disable the device - -Most of these topics are covered in the following sections. -For the rest look at LDD3 or . - -If the PCI subsystem is not configured (CONFIG_PCI is not set), most of -the PCI functions described below are defined as inline functions either -completely empty or just returning an appropriate error codes to avoid -lots of ifdefs in the drivers. - - - -1. pci_register_driver() call -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -PCI device drivers call pci_register_driver() during their -initialization with a pointer to a structure describing the driver -(struct pci_driver): - - field name Description - ---------- ------------------------------------------------------ - id_table Pointer to table of device ID's the driver is - interested in. Most drivers should export this - table using MODULE_DEVICE_TABLE(pci,...). - - probe This probing function gets called (during execution - of pci_register_driver() for already existing - devices or later if a new device gets inserted) for - all PCI devices which match the ID table and are not - "owned" by the other drivers yet. This function gets - passed a "struct pci_dev *" for each device whose - entry in the ID table matches the device. The probe - function returns zero when the driver chooses to - take "ownership" of the device or an error code - (negative number) otherwise. - The probe function always gets called from process - context, so it can sleep. - - remove The remove() function gets called whenever a device - being handled by this driver is removed (either during - deregistration of the driver or when it's manually - pulled out of a hot-pluggable slot). - The remove function always gets called from process - context, so it can sleep. - - suspend Put device into low power state. - suspend_late Put device into low power state. - - resume_early Wake device from low power state. - resume Wake device from low power state. - - (Please see Documentation/power/pci.txt for descriptions - of PCI Power Management and the related functions.) - - shutdown Hook into reboot_notifier_list (kernel/sys.c). - Intended to stop any idling DMA operations. - Useful for enabling wake-on-lan (NIC) or changing - the power state of a device before reboot. - e.g. drivers/net/e100.c. - - err_handler See Documentation/PCI/pci-error-recovery.txt - - -The ID table is an array of struct pci_device_id entries ending with an -all-zero entry. Definitions with static const are generally preferred. - -Each entry consists of: - - vendor,device Vendor and device ID to match (or PCI_ANY_ID) - - subvendor, Subsystem vendor and device ID to match (or PCI_ANY_ID) - subdevice, - - class Device class, subclass, and "interface" to match. - See Appendix D of the PCI Local Bus Spec or - include/linux/pci_ids.h for a full list of classes. - Most drivers do not need to specify class/class_mask - as vendor/device is normally sufficient. - - class_mask limit which sub-fields of the class field are compared. - See drivers/scsi/sym53c8xx_2/ for example of usage. - - driver_data Data private to the driver. - Most drivers don't need to use driver_data field. - Best practice is to use driver_data as an index - into a static list of equivalent device types, - instead of using it as a pointer. - - -Most drivers only need PCI_DEVICE() or PCI_DEVICE_CLASS() to set up -a pci_device_id table. - -New PCI IDs may be added to a device driver pci_ids table at runtime -as shown below: - -echo "vendor device subvendor subdevice class class_mask driver_data" > \ -/sys/bus/pci/drivers/{driver}/new_id - -All fields are passed in as hexadecimal values (no leading 0x). -The vendor and device fields are mandatory, the others are optional. Users -need pass only as many optional fields as necessary: - o subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF) - o class and classmask fields default to 0 - o driver_data defaults to 0UL. - -Note that driver_data must match the value used by any of the pci_device_id -entries defined in the driver. This makes the driver_data field mandatory -if all the pci_device_id entries have a non-zero driver_data value. - -Once added, the driver probe routine will be invoked for any unclaimed -PCI devices listed in its (newly updated) pci_ids list. - -When the driver exits, it just calls pci_unregister_driver() and the PCI layer -automatically calls the remove hook for all devices handled by the driver. - - -1.1 "Attributes" for driver functions/data - -Please mark the initialization and cleanup functions where appropriate -(the corresponding macros are defined in ): - - __init Initialization code. Thrown away after the driver - initializes. - __exit Exit code. Ignored for non-modular drivers. - -Tips on when/where to use the above attributes: - o The module_init()/module_exit() functions (and all - initialization functions called _only_ from these) - should be marked __init/__exit. - - o Do not mark the struct pci_driver. - - o Do NOT mark a function if you are not sure which mark to use. - Better to not mark the function than mark the function wrong. - - - -2. How to find PCI devices manually -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -PCI drivers should have a really good reason for not using the -pci_register_driver() interface to search for PCI devices. -The main reason PCI devices are controlled by multiple drivers -is because one PCI device implements several different HW services. -E.g. combined serial/parallel port/floppy controller. - -A manual search may be performed using the following constructs: - -Searching by vendor and device ID: - - struct pci_dev *dev = NULL; - while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev)) - configure_device(dev); - -Searching by class ID (iterate in a similar way): - - pci_get_class(CLASS_ID, dev) - -Searching by both vendor/device and subsystem vendor/device ID: - - pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev). - -You can use the constant PCI_ANY_ID as a wildcard replacement for -VENDOR_ID or DEVICE_ID. This allows searching for any device from a -specific vendor, for example. - -These functions are hotplug-safe. They increment the reference count on -the pci_dev that they return. You must eventually (possibly at module unload) -decrement the reference count on these devices by calling pci_dev_put(). - - - -3. Device Initialization Steps -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -As noted in the introduction, most PCI drivers need the following steps -for device initialization: - - Enable the device - Request MMIO/IOP resources - Set the DMA mask size (for both coherent and streaming DMA) - Allocate and initialize shared control data (pci_allocate_coherent()) - Access device configuration space (if needed) - Register IRQ handler (request_irq()) - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) - Enable DMA/processing engines. - -The driver can access PCI config space registers at any time. -(Well, almost. When running BIST, config space can go away...but -that will just result in a PCI Bus Master Abort and config reads -will return garbage). - - -3.1 Enable the PCI device -~~~~~~~~~~~~~~~~~~~~~~~~~ -Before touching any device registers, the driver needs to enable -the PCI device by calling pci_enable_device(). This will: - o wake up the device if it was in suspended state, - o allocate I/O and memory regions of the device (if BIOS did not), - o allocate an IRQ (if BIOS did not). - -NOTE: pci_enable_device() can fail! Check the return value. - -[ OS BUG: we don't check resource allocations before enabling those - resources. The sequence would make more sense if we called - pci_request_resources() before calling pci_enable_device(). - Currently, the device drivers can't detect the bug when when two - devices have been allocated the same range. This is not a common - problem and unlikely to get fixed soon. - - This has been discussed before but not changed as of 2.6.19: - http://lkml.org/lkml/2006/3/2/194 -] - -pci_set_master() will enable DMA by setting the bus master bit -in the PCI_COMMAND register. It also fixes the latency timer value if -it's set to something bogus by the BIOS. pci_clear_master() will -disable DMA by clearing the bus master bit. - -If the PCI device can use the PCI Memory-Write-Invalidate transaction, -call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval -and also ensures that the cache line size register is set correctly. -Check the return value of pci_set_mwi() as not all architectures -or chip-sets may support Memory-Write-Invalidate. Alternatively, -if Mem-Wr-Inval would be nice to have but is not required, call -pci_try_set_mwi() to have the system do its best effort at enabling -Mem-Wr-Inval. - - -3.2 Request MMIO/IOP resources -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Memory (MMIO), and I/O port addresses should NOT be read directly -from the PCI device config space. Use the values in the pci_dev structure -as the PCI "bus address" might have been remapped to a "host physical" -address by the arch/chip-set specific kernel support. - -See Documentation/io-mapping.txt for how to access device registers -or device memory. - -The device driver needs to call pci_request_region() to verify -no other device is already using the same address resource. -Conversely, drivers should call pci_release_region() AFTER -calling pci_disable_device(). -The idea is to prevent two devices colliding on the same address range. - -[ See OS BUG comment above. Currently (2.6.19), The driver can only - determine MMIO and IO Port resource availability _after_ calling - pci_enable_device(). ] - -Generic flavors of pci_request_region() are request_mem_region() -(for MMIO ranges) and request_region() (for IO Port ranges). -Use these for address resources that are not described by "normal" PCI -BARs. - -Also see pci_request_selected_regions() below. - - -3.3 Set the DMA mask size -~~~~~~~~~~~~~~~~~~~~~~~~~ -[ If anything below doesn't make sense, please refer to - Documentation/DMA-API.txt. This section is just a reminder that - drivers need to indicate DMA capabilities of the device and is not - an authoritative source for DMA interfaces. ] - -While all drivers should explicitly indicate the DMA capability -(e.g. 32 or 64 bit) of the PCI bus master, devices with more than -32-bit bus master capability for streaming data need the driver -to "register" this capability by calling pci_set_dma_mask() with -appropriate parameters. In general this allows more efficient DMA -on systems where System RAM exists above 4G _physical_ address. - -Drivers for all PCI-X and PCIe compliant devices must call -pci_set_dma_mask() as they are 64-bit DMA devices. - -Similarly, drivers must also "register" this capability if the device -can directly address "consistent memory" in System RAM above 4G physical -address by calling pci_set_consistent_dma_mask(). -Again, this includes drivers for all PCI-X and PCIe compliant devices. -Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are -64-bit DMA capable for payload ("streaming") data but not control -("consistent") data. - - -3.4 Setup shared control data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared) -memory. See Documentation/DMA-API.txt for a full description of -the DMA APIs. This section is just a reminder that it needs to be done -before enabling DMA on the device. - - -3.5 Initialize device registers -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Some drivers will need specific "capability" fields programmed -or other "vendor specific" register initialized or reset. -E.g. clearing pending interrupts. - - -3.6 Register IRQ handler -~~~~~~~~~~~~~~~~~~~~~~~~ -While calling request_irq() is the last step described here, -this is often just another intermediate step to initialize a device. -This step can often be deferred until the device is opened for use. - -All interrupt handlers for IRQ lines should be registered with IRQF_SHARED -and use the devid to map IRQs to devices (remember that all PCI IRQ lines -can be shared). - -request_irq() will associate an interrupt handler and device handle -with an interrupt number. Historically interrupt numbers represent -IRQ lines which run from the PCI device to the Interrupt controller. -With MSI and MSI-X (more below) the interrupt number is a CPU "vector". - -request_irq() also enables the interrupt. Make sure the device is -quiesced and does not have any interrupts pending before registering -the interrupt handler. - -MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts" -which deliver interrupts to the CPU via a DMA write to a Local APIC. -The fundamental difference between MSI and MSI-X is how multiple -"vectors" get allocated. MSI requires contiguous blocks of vectors -while MSI-X can allocate several individual ones. - -MSI capability can be enabled by calling pci_alloc_irq_vectors() with the -PCI_IRQ_MSI and/or PCI_IRQ_MSIX flags before calling request_irq(). This -causes the PCI support to program CPU vector data into the PCI device -capability registers. Many architectures, chip-sets, or BIOSes do NOT -support MSI or MSI-X and a call to pci_alloc_irq_vectors with just -the PCI_IRQ_MSI and PCI_IRQ_MSIX flags will fail, so try to always -specify PCI_IRQ_LEGACY as well. - -Drivers that have different interrupt handlers for MSI/MSI-X and -legacy INTx should chose the right one based on the msi_enabled -and msix_enabled flags in the pci_dev structure after calling -pci_alloc_irq_vectors. - -There are (at least) two really good reasons for using MSI: -1) MSI is an exclusive interrupt vector by definition. - This means the interrupt handler doesn't have to verify - its device caused the interrupt. - -2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed - to be visible to the host CPU(s) when the MSI is delivered. This - is important for both data coherency and avoiding stale control data. - This guarantee allows the driver to omit MMIO reads to flush - the DMA stream. - -See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples -of MSI/MSI-X usage. - - - -4. PCI device shutdown -~~~~~~~~~~~~~~~~~~~~~~~ - -When a PCI device driver is being unloaded, most of the following -steps need to be performed: - - Disable the device from generating IRQs - Release the IRQ (free_irq()) - Stop all DMA activity - Release DMA buffers (both streaming and consistent) - Unregister from other subsystems (e.g. scsi or netdev) - Disable device from responding to MMIO/IO Port addresses - Release MMIO/IO Port resource(s) - - -4.1 Stop IRQs on the device -~~~~~~~~~~~~~~~~~~~~~~~~~~~ -How to do this is chip/device specific. If it's not done, it opens -the possibility of a "screaming interrupt" if (and only if) -the IRQ is shared with another device. - -When the shared IRQ handler is "unhooked", the remaining devices -using the same IRQ line will still need the IRQ enabled. Thus if the -"unhooked" device asserts IRQ line, the system will respond assuming -it was one of the remaining devices asserted the IRQ line. Since none -of the other devices will handle the IRQ, the system will "hang" until -it decides the IRQ isn't going to get handled and masks the IRQ (100,000 -iterations later). Once the shared IRQ is masked, the remaining devices -will stop functioning properly. Not a nice situation. - -This is another reason to use MSI or MSI-X if it's available. -MSI and MSI-X are defined to be exclusive interrupts and thus -are not susceptible to the "screaming interrupt" problem. - - -4.2 Release the IRQ -~~~~~~~~~~~~~~~~~~~ -Once the device is quiesced (no more IRQs), one can call free_irq(). -This function will return control once any pending IRQs are handled, -"unhook" the drivers IRQ handler from that IRQ, and finally release -the IRQ if no one else is using it. - - -4.3 Stop all DMA activity -~~~~~~~~~~~~~~~~~~~~~~~~~ -It's extremely important to stop all DMA operations BEFORE attempting -to deallocate DMA control data. Failure to do so can result in memory -corruption, hangs, and on some chip-sets a hard crash. - -Stopping DMA after stopping the IRQs can avoid races where the -IRQ handler might restart DMA engines. - -While this step sounds obvious and trivial, several "mature" drivers -didn't get this step right in the past. - - -4.4 Release DMA buffers -~~~~~~~~~~~~~~~~~~~~~~~ -Once DMA is stopped, clean up streaming DMA first. -I.e. unmap data buffers and return buffers to "upstream" -owners if there is one. - -Then clean up "consistent" buffers which contain the control data. - -See Documentation/DMA-API.txt for details on unmapping interfaces. - - -4.5 Unregister from other subsystems -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Most low level PCI device drivers support some other subsystem -like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your -driver isn't losing resources from that other subsystem. -If this happens, typically the symptom is an Oops (panic) when -the subsystem attempts to call into a driver that has been unloaded. - - -4.6 Disable Device from responding to MMIO/IO Port addresses -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -io_unmap() MMIO or IO Port resources and then call pci_disable_device(). -This is the symmetric opposite of pci_enable_device(). -Do not access device registers after calling pci_disable_device(). - - -4.7 Release MMIO/IO Port Resource(s) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Call pci_release_region() to mark the MMIO or IO Port range as available. -Failure to do so usually results in the inability to reload the driver. - - - -5. How to access PCI config space -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You can use pci_(read|write)_config_(byte|word|dword) to access the config -space of a device represented by struct pci_dev *. All these functions return 0 -when successful or an error code (PCIBIOS_...) which can be translated to a text -string by pcibios_strerror. Most drivers expect that accesses to valid PCI -devices don't fail. - -If you don't have a struct pci_dev available, you can call -pci_bus_(read|write)_config_(byte|word|dword) to access a given device -and function on that bus. - -If you access fields in the standard portion of the config header, please -use symbolic names of locations and bits declared in . - -If you need to access Extended PCI Capability registers, just call -pci_find_capability() for the particular capability and it will find the -corresponding register block for you. - - - -6. Other interesting functions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain, - bus and slot and number. If the device is - found, its reference count is increased. -pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3) -pci_find_capability() Find specified capability in device's capability - list. -pci_resource_start() Returns bus start address for a given PCI region -pci_resource_end() Returns bus end address for a given PCI region -pci_resource_len() Returns the byte length of a PCI region -pci_set_drvdata() Set private driver data pointer for a pci_dev -pci_get_drvdata() Return private driver data pointer for a pci_dev -pci_set_mwi() Enable Memory-Write-Invalidate transactions. -pci_clear_mwi() Disable Memory-Write-Invalidate transactions. - - - -7. Miscellaneous hints -~~~~~~~~~~~~~~~~~~~~~~ - -When displaying PCI device names to the user (for example when a driver wants -to tell the user what card has it found), please use pci_name(pci_dev). - -Always refer to the PCI devices by a pointer to the pci_dev structure. -All PCI layer functions use this identification and it's the only -reasonable one. Don't use bus/slot/function numbers except for very -special purposes -- on systems with multiple primary buses their semantics -can be pretty complex. - -Don't try to turn on Fast Back to Back writes in your driver. All devices -on the bus need to be capable of doing it, so this is something which needs -to be handled by platform and generic code, not individual drivers. - - - -8. Vendor and device identifications -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Do not add new device or vendor IDs to include/linux/pci_ids.h unless they -are shared across multiple drivers. You can add private definitions in -your driver if they're helpful, or just use plain hex constants. - -The device IDs are arbitrary hex numbers (vendor controlled) and normally used -only in a single location, the pci_device_id table. - -Please DO submit new vendor/device IDs to http://pci-ids.ucw.cz/. -There are mirrors of the pci.ids file at http://pciids.sourceforge.net/ -and https://github.com/pciutils/pciids. - - - -9. Obsolete functions -~~~~~~~~~~~~~~~~~~~~~ - -There are several functions which you might come across when trying to -port an old driver to the new PCI interface. They are no longer present -in the kernel as they aren't compatible with hotplug or PCI domains or -having sane locking. - -pci_find_device() Superseded by pci_get_device() -pci_find_subsys() Superseded by pci_get_subsys() -pci_find_slot() Superseded by pci_get_domain_bus_and_slot() -pci_get_slot() Superseded by pci_get_domain_bus_and_slot() - - -The alternative is the traditional PCI device driver that walks PCI -device lists. This is still possible but discouraged. - - - -10. MMIO Space and "Write Posting" -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Converting a driver from using I/O Port space to using MMIO space -often requires some additional changes. Specifically, "write posting" -needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2) -already do this. I/O Port space guarantees write transactions reach the PCI -device before the CPU can continue. Writes to MMIO space allow the CPU -to continue before the transaction reaches the PCI device. HW weenies -call this "Write Posting" because the write completion is "posted" to -the CPU before the transaction has reached its destination. - -Thus, timing sensitive code should add readl() where the CPU is -expected to wait before doing other work. The classic "bit banging" -sequence works fine for I/O Port space: - - for (i = 8; --i; val >>= 1) { - outb(val & 1, ioport_reg); /* write bit */ - udelay(10); - } - -The same sequence for MMIO space should be: - - for (i = 8; --i; val >>= 1) { - writeb(val & 1, mmio_reg); /* write bit */ - readb(safe_mmio_reg); /* flush posted write */ - udelay(10); - } - -It is important that "safe_mmio_reg" not have any side effects that -interferes with the correct operation of the device. - -Another case to watch out for is when resetting a PCI device. Use PCI -Configuration space reads to flush the writel(). This will gracefully -handle the PCI master abort on all platforms if the PCI device is -expected to not respond to a readl(). Most x86 platforms will allow -MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage -(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail"). - diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst new file mode 100644 index 000000000000..18bdefaafd1a --- /dev/null +++ b/Documentation/PCI/pcieaer-howto.rst @@ -0,0 +1,311 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +=========================================================== +The PCI Express Advanced Error Reporting Driver Guide HOWTO +=========================================================== + +:Authors: - T. Long Nguyen + - Yanmin Zhang + +:Copyright: |copy| 2006 Intel Corporation + +Overview +=========== + +About this guide +---------------- + +This guide describes the basics of the PCI Express Advanced Error +Reporting (AER) driver and provides information on how to use it, as +well as how to enable the drivers of endpoint devices to conform with +PCI Express AER driver. + + +What is the PCI Express AER Driver? +----------------------------------- + +PCI Express error signaling can occur on the PCI Express link itself +or on behalf of transactions initiated on the link. PCI Express +defines two error reporting paradigms: the baseline capability and +the Advanced Error Reporting capability. The baseline capability is +required of all PCI Express components providing a minimum defined +set of error reporting requirements. Advanced Error Reporting +capability is implemented with a PCI Express advanced error reporting +extended capability structure providing more robust error reporting. + +The PCI Express AER driver provides the infrastructure to support PCI +Express Advanced Error Reporting capability. The PCI Express AER +driver provides three basic functions: + + - Gathers the comprehensive error information if errors occurred. + - Reports error to the users. + - Performs error recovery actions. + +AER driver only attaches root ports which support PCI-Express AER +capability. + + +User Guide +========== + +Include the PCI Express AER Root Driver into the Linux Kernel +------------------------------------------------------------- + +The PCI Express AER Root driver is a Root Port service driver attached +to the PCI Express Port Bus driver. If a user wants to use it, the driver +has to be compiled. Option CONFIG_PCIEAER supports this capability. It +depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and +CONFIG_PCIEAER = y. + +Load PCI Express AER Root Driver +-------------------------------- + +Some systems have AER support in firmware. Enabling Linux AER support at +the same time the firmware handles AER may result in unpredictable +behavior. Therefore, Linux does not handle AER events unless the firmware +grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 +Specification for details regarding _OSC usage. + +AER error output +---------------- + +When a PCIe AER error is captured, an error message will be output to +console. If it's a correctable error, it is output as a warning. +Otherwise, it is printed as an error. So users could choose different +log level to filter out correctable error messages. + +Below shows an example:: + + 0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID) + 0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000 + 0000:50:00.0: [20] Unsupported Request (First) + 0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 + +In the example, 'Requester ID' means the ID of the device who sends +the error message to root port. Pls. refer to pci express specs for +other fields. + +AER Statistics / Counters +------------------------- + +When PCIe AER errors are captured, the counters / statistics are also exposed +in the form of sysfs attributes which are documented at +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats + +Developer Guide +=============== + +To enable AER aware support requires a software driver to configure +the AER capability structure within its device and to provide callbacks. + +To support AER better, developers need understand how AER does work +firstly. + +PCI Express errors are classified into two types: correctable errors +and uncorrectable errors. This classification is based on the impacts +of those errors, which may result in degraded performance or function +failure. + +Correctable errors pose no impacts on the functionality of the +interface. The PCI Express protocol can recover without any software +intervention or any loss of data. These errors are detected and +corrected by hardware. Unlike correctable errors, uncorrectable +errors impact functionality of the interface. Uncorrectable errors +can cause a particular transaction or a particular PCI Express link +to be unreliable. Depending on those error conditions, uncorrectable +errors are further classified into non-fatal errors and fatal errors. +Non-fatal errors cause the particular transaction to be unreliable, +but the PCI Express link itself is fully functional. Fatal errors, on +the other hand, cause the link to be unreliable. + +When AER is enabled, a PCI Express device will automatically send an +error message to the PCIe root port above it when the device captures +an error. The Root Port, upon receiving an error reporting message, +internally processes and logs the error message in its PCI Express +capability structure. Error information being logged includes storing +the error reporting agent's requestor ID into the Error Source +Identification Registers and setting the error bits of the Root Error +Status Register accordingly. If AER error reporting is enabled in Root +Error Command Register, the Root Port generates an interrupt if an +error is detected. + +Note that the errors as described above are related to the PCI Express +hierarchy and links. These errors do not include any device specific +errors because device specific errors will still get sent directly to +the device driver. + +Configure the AER capability structure +-------------------------------------- + +AER aware drivers of PCI Express component need change the device +control registers to enable AER. They also could change AER registers, +including mask and severity registers. Helper function +pci_enable_pcie_error_reporting could be used to enable AER. See +section 3.3. + +Provide callbacks +----------------- + +callback reset_link to reset pci express link +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This callback is used to reset the pci express physical link when a +fatal error happens. The root port aer service driver provides a +default reset_link function, but different upstream ports might +have different specifications to reset pci express link, so all +upstream ports should provide their own reset_link functions. + +In struct pcie_port_service_driver, a new pointer, reset_link, is +added. +:: + + pci_ers_result_t (*reset_link) (struct pci_dev *dev); + +Section 3.2.2.2 provides more detailed info on when to call +reset_link. + +PCI error-recovery callbacks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The PCI Express AER Root driver uses error callbacks to coordinate +with downstream device drivers associated with a hierarchy in question +when performing error recovery actions. + +Data struct pci_driver has a pointer, err_handler, to point to +pci_error_handlers who consists of a couple of callback function +pointers. AER driver follows the rules defined in +pci-error-recovery.txt except pci express specific parts (e.g. +reset_link). Pls. refer to pci-error-recovery.txt for detailed +definitions of the callbacks. + +Below sections specify when to call the error callback functions. + +Correctable errors +~~~~~~~~~~~~~~~~~~ + +Correctable errors pose no impacts on the functionality of +the interface. The PCI Express protocol can recover without any +software intervention or any loss of data. These errors do not +require any recovery actions. The AER driver clears the device's +correctable error status register accordingly and logs these errors. + +Non-correctable (non-fatal and fatal) errors +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If an error message indicates a non-fatal error, performing link reset +at upstream is not required. The AER driver calls error_detected(dev, +pci_channel_io_normal) to all drivers associated within a hierarchy in +question. for example:: + + EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort + +If Upstream port A captures an AER error, the hierarchy consists of +Downstream port B and EndPoint. + +A driver may return PCI_ERS_RESULT_CAN_RECOVER, +PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on +whether it can recover or the AER driver calls mmio_enabled as next. + +If an error message indicates a fatal error, kernel will broadcast +error_detected(dev, pci_channel_io_frozen) to all drivers within +a hierarchy in question. Then, performing link reset at upstream is +necessary. As different kinds of devices might use different approaches +to reset link, AER port service driver is required to provide the +function to reset link. Firstly, kernel looks for if the upstream +component has an aer driver. If it has, kernel uses the reset_link +callback of the aer driver. If the upstream component has no aer driver +and the port is downstream port, we will perform a hot reset as the +default by setting the Secondary Bus Reset bit of the Bridge Control +register associated with the downstream port. As for upstream ports, +they should provide their own aer service drivers with reset_link +function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and +reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes +to mmio_enabled. + +helper functions +---------------- +:: + + int pci_enable_pcie_error_reporting(struct pci_dev *dev); + +pci_enable_pcie_error_reporting enables the device to send error +messages to root port when an error is detected. Note that devices +don't enable the error reporting by default, so device drivers need +call this function to enable it. + +:: + + int pci_disable_pcie_error_reporting(struct pci_dev *dev); + +pci_disable_pcie_error_reporting disables the device to send error +messages to root port when an error is detected. + +:: + + int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);` + +pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable +error status register. + +Frequent Asked Questions +------------------------ + +Q: + What happens if a PCI Express device driver does not provide an + error recovery handler (pci_driver->err_handler is equal to NULL)? + +A: + The devices attached with the driver won't be recovered. If the + error is fatal, kernel will print out warning messages. Please refer + to section 3 for more information. + +Q: + What happens if an upstream port service driver does not provide + callback reset_link? + +A: + Fatal error recovery will fail if the errors are reported by the + upstream ports who are attached by the service driver. + +Q: + How does this infrastructure deal with driver that is not PCI + Express aware? + +A: + This infrastructure calls the error callback functions of the + driver when an error happens. But if the driver is not aware of + PCI Express, the device might not report its own errors to root + port. + +Q: + What modifications will that driver need to make it compatible + with the PCI Express AER Root driver? + +A: + It could call the helper functions to enable AER in devices and + cleanup uncorrectable status register. Pls. refer to section 3.3. + + +Software error injection +======================== + +Debugging PCIe AER error recovery code is quite difficult because it +is hard to trigger real hardware errors. Software based error +injection can be used to fake various kinds of PCIe errors. + +First you should enable PCIe AER software error injection in kernel +configuration, that is, following item should be in your .config. + +CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m + +After reboot with new kernel or insert the module, a device file named +/dev/aer_inject should be created. + +Then, you need a user space tool named aer-inject, which can be gotten +from: + + https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ + +More information about aer-inject can be found in the document comes +with its source code. diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt deleted file mode 100644 index 48ce7903e3c6..000000000000 --- a/Documentation/PCI/pcieaer-howto.txt +++ /dev/null @@ -1,267 +0,0 @@ - The PCI Express Advanced Error Reporting Driver Guide HOWTO - T. Long Nguyen - Yanmin Zhang - 07/29/2006 - - -1. Overview - -1.1 About this guide - -This guide describes the basics of the PCI Express Advanced Error -Reporting (AER) driver and provides information on how to use it, as -well as how to enable the drivers of endpoint devices to conform with -PCI Express AER driver. - -1.2 Copyright (C) Intel Corporation 2006. - -1.3 What is the PCI Express AER Driver? - -PCI Express error signaling can occur on the PCI Express link itself -or on behalf of transactions initiated on the link. PCI Express -defines two error reporting paradigms: the baseline capability and -the Advanced Error Reporting capability. The baseline capability is -required of all PCI Express components providing a minimum defined -set of error reporting requirements. Advanced Error Reporting -capability is implemented with a PCI Express advanced error reporting -extended capability structure providing more robust error reporting. - -The PCI Express AER driver provides the infrastructure to support PCI -Express Advanced Error Reporting capability. The PCI Express AER -driver provides three basic functions: - -- Gathers the comprehensive error information if errors occurred. -- Reports error to the users. -- Performs error recovery actions. - -AER driver only attaches root ports which support PCI-Express AER -capability. - - -2. User Guide - -2.1 Include the PCI Express AER Root Driver into the Linux Kernel - -The PCI Express AER Root driver is a Root Port service driver attached -to the PCI Express Port Bus driver. If a user wants to use it, the driver -has to be compiled. Option CONFIG_PCIEAER supports this capability. It -depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and -CONFIG_PCIEAER = y. - -2.2 Load PCI Express AER Root Driver - -Some systems have AER support in firmware. Enabling Linux AER support at -the same time the firmware handles AER may result in unpredictable -behavior. Therefore, Linux does not handle AER events unless the firmware -grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 -Specification for details regarding _OSC usage. - -2.3 AER error output - -When a PCIe AER error is captured, an error message will be output to -console. If it's a correctable error, it is output as a warning. -Otherwise, it is printed as an error. So users could choose different -log level to filter out correctable error messages. - -Below shows an example: -0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID) -0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000 -0000:50:00.0: [20] Unsupported Request (First) -0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 - -In the example, 'Requester ID' means the ID of the device who sends -the error message to root port. Pls. refer to pci express specs for -other fields. - -2.4 AER Statistics / Counters - -When PCIe AER errors are captured, the counters / statistics are also exposed -in the form of sysfs attributes which are documented at -Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats - -3. Developer Guide - -To enable AER aware support requires a software driver to configure -the AER capability structure within its device and to provide callbacks. - -To support AER better, developers need understand how AER does work -firstly. - -PCI Express errors are classified into two types: correctable errors -and uncorrectable errors. This classification is based on the impacts -of those errors, which may result in degraded performance or function -failure. - -Correctable errors pose no impacts on the functionality of the -interface. The PCI Express protocol can recover without any software -intervention or any loss of data. These errors are detected and -corrected by hardware. Unlike correctable errors, uncorrectable -errors impact functionality of the interface. Uncorrectable errors -can cause a particular transaction or a particular PCI Express link -to be unreliable. Depending on those error conditions, uncorrectable -errors are further classified into non-fatal errors and fatal errors. -Non-fatal errors cause the particular transaction to be unreliable, -but the PCI Express link itself is fully functional. Fatal errors, on -the other hand, cause the link to be unreliable. - -When AER is enabled, a PCI Express device will automatically send an -error message to the PCIe root port above it when the device captures -an error. The Root Port, upon receiving an error reporting message, -internally processes and logs the error message in its PCI Express -capability structure. Error information being logged includes storing -the error reporting agent's requestor ID into the Error Source -Identification Registers and setting the error bits of the Root Error -Status Register accordingly. If AER error reporting is enabled in Root -Error Command Register, the Root Port generates an interrupt if an -error is detected. - -Note that the errors as described above are related to the PCI Express -hierarchy and links. These errors do not include any device specific -errors because device specific errors will still get sent directly to -the device driver. - -3.1 Configure the AER capability structure - -AER aware drivers of PCI Express component need change the device -control registers to enable AER. They also could change AER registers, -including mask and severity registers. Helper function -pci_enable_pcie_error_reporting could be used to enable AER. See -section 3.3. - -3.2. Provide callbacks - -3.2.1 callback reset_link to reset pci express link - -This callback is used to reset the pci express physical link when a -fatal error happens. The root port aer service driver provides a -default reset_link function, but different upstream ports might -have different specifications to reset pci express link, so all -upstream ports should provide their own reset_link functions. - -In struct pcie_port_service_driver, a new pointer, reset_link, is -added. - -pci_ers_result_t (*reset_link) (struct pci_dev *dev); - -Section 3.2.2.2 provides more detailed info on when to call -reset_link. - -3.2.2 PCI error-recovery callbacks - -The PCI Express AER Root driver uses error callbacks to coordinate -with downstream device drivers associated with a hierarchy in question -when performing error recovery actions. - -Data struct pci_driver has a pointer, err_handler, to point to -pci_error_handlers who consists of a couple of callback function -pointers. AER driver follows the rules defined in -pci-error-recovery.txt except pci express specific parts (e.g. -reset_link). Pls. refer to pci-error-recovery.txt for detailed -definitions of the callbacks. - -Below sections specify when to call the error callback functions. - -3.2.2.1 Correctable errors - -Correctable errors pose no impacts on the functionality of -the interface. The PCI Express protocol can recover without any -software intervention or any loss of data. These errors do not -require any recovery actions. The AER driver clears the device's -correctable error status register accordingly and logs these errors. - -3.2.2.2 Non-correctable (non-fatal and fatal) errors - -If an error message indicates a non-fatal error, performing link reset -at upstream is not required. The AER driver calls error_detected(dev, -pci_channel_io_normal) to all drivers associated within a hierarchy in -question. for example, -EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. -If Upstream port A captures an AER error, the hierarchy consists of -Downstream port B and EndPoint. - -A driver may return PCI_ERS_RESULT_CAN_RECOVER, -PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on -whether it can recover or the AER driver calls mmio_enabled as next. - -If an error message indicates a fatal error, kernel will broadcast -error_detected(dev, pci_channel_io_frozen) to all drivers within -a hierarchy in question. Then, performing link reset at upstream is -necessary. As different kinds of devices might use different approaches -to reset link, AER port service driver is required to provide the -function to reset link. Firstly, kernel looks for if the upstream -component has an aer driver. If it has, kernel uses the reset_link -callback of the aer driver. If the upstream component has no aer driver -and the port is downstream port, we will perform a hot reset as the -default by setting the Secondary Bus Reset bit of the Bridge Control -register associated with the downstream port. As for upstream ports, -they should provide their own aer service drivers with reset_link -function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and -reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes -to mmio_enabled. - -3.3 helper functions - -3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev); -pci_enable_pcie_error_reporting enables the device to send error -messages to root port when an error is detected. Note that devices -don't enable the error reporting by default, so device drivers need -call this function to enable it. - -3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev); -pci_disable_pcie_error_reporting disables the device to send error -messages to root port when an error is detected. - -3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); -pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable -error status register. - -3.4 Frequent Asked Questions - -Q: What happens if a PCI Express device driver does not provide an -error recovery handler (pci_driver->err_handler is equal to NULL)? - -A: The devices attached with the driver won't be recovered. If the -error is fatal, kernel will print out warning messages. Please refer -to section 3 for more information. - -Q: What happens if an upstream port service driver does not provide -callback reset_link? - -A: Fatal error recovery will fail if the errors are reported by the -upstream ports who are attached by the service driver. - -Q: How does this infrastructure deal with driver that is not PCI -Express aware? - -A: This infrastructure calls the error callback functions of the -driver when an error happens. But if the driver is not aware of -PCI Express, the device might not report its own errors to root -port. - -Q: What modifications will that driver need to make it compatible -with the PCI Express AER Root driver? - -A: It could call the helper functions to enable AER in devices and -cleanup uncorrectable status register. Pls. refer to section 3.3. - - -4. Software error injection - -Debugging PCIe AER error recovery code is quite difficult because it -is hard to trigger real hardware errors. Software based error -injection can be used to fake various kinds of PCIe errors. - -First you should enable PCIe AER software error injection in kernel -configuration, that is, following item should be in your .config. - -CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m - -After reboot with new kernel or insert the module, a device file named -/dev/aer_inject should be created. - -Then, you need a user space tool named aer-inject, which can be gotten -from: - https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ - -More information about aer-inject can be found in the document comes -with its source code. diff --git a/Documentation/PCI/picebus-howto.rst b/Documentation/PCI/picebus-howto.rst new file mode 100644 index 000000000000..f882ff62c51f --- /dev/null +++ b/Documentation/PCI/picebus-howto.rst @@ -0,0 +1,220 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +=========================================== +The PCI Express Port Bus Driver Guide HOWTO +=========================================== + +:Author: Tom L Nguyen tom.l.nguyen@intel.com 11/03/2004 +:Copyright: |copy| 2004 Intel Corporation + +About this guide +================ + +This guide describes the basics of the PCI Express Port Bus driver +and provides information on how to enable the service drivers to +register/unregister with the PCI Express Port Bus Driver. + + +What is the PCI Express Port Bus Driver +======================================= + +A PCI Express Port is a logical PCI-PCI Bridge structure. There +are two types of PCI Express Port: the Root Port and the Switch +Port. The Root Port originates a PCI Express link from a PCI Express +Root Complex and the Switch Port connects PCI Express links to +internal logical PCI buses. The Switch Port, which has its secondary +bus representing the switch's internal routing logic, is called the +switch's Upstream Port. The switch's Downstream Port is bridging from +switch's internal routing bus to a bus representing the downstream +PCI Express link from the PCI Express Switch. + +A PCI Express Port can provide up to four distinct functions, +referred to in this document as services, depending on its port type. +PCI Express Port's services include native hotplug support (HP), +power management event support (PME), advanced error reporting +support (AER), and virtual channel support (VC). These services may +be handled by a single complex driver or be individually distributed +and handled by corresponding service drivers. + +Why use the PCI Express Port Bus Driver? +======================================== + +In existing Linux kernels, the Linux Device Driver Model allows a +physical device to be handled by only a single driver. The PCI +Express Port is a PCI-PCI Bridge device with multiple distinct +services. To maintain a clean and simple solution each service +may have its own software service driver. In this case several +service drivers will compete for a single PCI-PCI Bridge device. +For example, if the PCI Express Root Port native hotplug service +driver is loaded first, it claims a PCI-PCI Bridge Root Port. The +kernel therefore does not load other service drivers for that Root +Port. In other words, it is impossible to have multiple service +drivers load and run on a PCI-PCI Bridge device simultaneously +using the current driver model. + +To enable multiple service drivers running simultaneously requires +having a PCI Express Port Bus driver, which manages all populated +PCI Express Ports and distributes all provided service requests +to the corresponding service drivers as required. Some key +advantages of using the PCI Express Port Bus driver are listed below: + + - Allow multiple service drivers to run simultaneously on + a PCI-PCI Bridge Port device. + + - Allow service drivers implemented in an independent + staged approach. + + - Allow one service driver to run on multiple PCI-PCI Bridge + Port devices. + + - Manage and distribute resources of a PCI-PCI Bridge Port + device to requested service drivers. + +Configuring the PCI Express Port Bus Driver vs. Service Drivers +=============================================================== + +Including the PCI Express Port Bus Driver Support into the Kernel +----------------------------------------------------------------- + +Including the PCI Express Port Bus driver depends on whether the PCI +Express support is included in the kernel config. The kernel will +automatically include the PCI Express Port Bus driver as a kernel +driver when the PCI Express support is enabled in the kernel. + +Enabling Service Driver Support +------------------------------- + +PCI device drivers are implemented based on Linux Device Driver Model. +All service drivers are PCI device drivers. As discussed above, it is +impossible to load any service driver once the kernel has loaded the +PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver +Model requires some minimal changes on existing service drivers that +imposes no impact on the functionality of existing service drivers. + +A service driver is required to use the two APIs shown below to +register its service with the PCI Express Port Bus driver (see +section 5.2.1 & 5.2.2). It is important that a service driver +initializes the pcie_port_service_driver data structure, included in +header file /include/linux/pcieport_if.h, before calling these APIs. +Failure to do so will result an identity mismatch, which prevents +the PCI Express Port Bus driver from loading a service driver. + +pcie_port_service_register +~~~~~~~~~~~~~~~~~~~~~~~~~~ +:: + + int pcie_port_service_register(struct pcie_port_service_driver *new) + +This API replaces the Linux Driver Model's pci_register_driver API. A +service driver should always calls pcie_port_service_register at +module init. Note that after service driver being loaded, calls +such as pci_enable_device(dev) and pci_set_master(dev) are no longer +necessary since these calls are executed by the PCI Port Bus driver. + +pcie_port_service_unregister +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +:: + + void pcie_port_service_unregister(struct pcie_port_service_driver *new) + +pcie_port_service_unregister replaces the Linux Driver Model's +pci_unregister_driver. It's always called by service driver when a +module exits. + +Sample Code +~~~~~~~~~~~ + +Below is sample service driver code to initialize the port service +driver data structure. +:: + + static struct pcie_port_service_id service_id[] = { { + .vendor = PCI_ANY_ID, + .device = PCI_ANY_ID, + .port_type = PCIE_RC_PORT, + .service_type = PCIE_PORT_SERVICE_AER, + }, { /* end: all zeroes */ } + }; + + static struct pcie_port_service_driver root_aerdrv = { + .name = (char *)device_name, + .id_table = &service_id[0], + + .probe = aerdrv_load, + .remove = aerdrv_unload, + + .suspend = aerdrv_suspend, + .resume = aerdrv_resume, + }; + +Below is a sample code for registering/unregistering a service +driver. +:: + + static int __init aerdrv_service_init(void) + { + int retval = 0; + + retval = pcie_port_service_register(&root_aerdrv); + if (!retval) { + /* + * FIX ME + */ + } + return retval; + } + + static void __exit aerdrv_service_exit(void) + { + pcie_port_service_unregister(&root_aerdrv); + } + + module_init(aerdrv_service_init); + module_exit(aerdrv_service_exit); + +Possible Resource Conflicts +=========================== + +Since all service drivers of a PCI-PCI Bridge Port device are +allowed to run simultaneously, below lists a few of possible resource +conflicts with proposed solutions. + +MSI and MSI-X Vector Resource +----------------------------- + +Once MSI or MSI-X interrupts are enabled on a device, it stays in this +mode until they are disabled again. Since service drivers of the same +PCI-PCI Bridge port share the same physical device, if an individual +service driver enables or disables MSI/MSI-X mode it may result +unpredictable behavior. + +To avoid this situation all service drivers are not permitted to +switch interrupt mode on its device. The PCI Express Port Bus driver +is responsible for determining the interrupt mode and this should be +transparent to service drivers. Service drivers need to know only +the vector IRQ assigned to the field irq of struct pcie_device, which +is passed in when the PCI Express Port Bus driver probes each service +driver. Service drivers should use (struct pcie_device*)dev->irq to +call request_irq/free_irq. In addition, the interrupt mode is stored +in the field interrupt_mode of struct pcie_device. + +PCI Memory/IO Mapped Regions +---------------------------- + +Service drivers for PCI Express Power Management (PME), Advanced +Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access +PCI configuration space on the PCI Express port. In all cases the +registers accessed are independent of each other. This patch assumes +that all service drivers will be well behaved and not overwrite +other service driver's configuration settings. + +PCI Config Registers +-------------------- + +Each service driver runs its PCI config operations on its own +capability structure except the PCI Express capability structure, in +which Root Control register and Device Control register are shared +between PME and AER. This patch assumes that all service drivers +will be well behaved and not overwrite other service driver's +configuration settings. diff --git a/Documentation/RCU/UP.rst b/Documentation/RCU/UP.rst new file mode 100644 index 000000000000..e26dda27430c --- /dev/null +++ b/Documentation/RCU/UP.rst @@ -0,0 +1,143 @@ +.. _up_doc: + +RCU on Uniprocessor Systems +=========================== + +A common misconception is that, on UP systems, the call_rcu() primitive +may immediately invoke its function. The basis of this misconception +is that since there is only one CPU, it should not be necessary to +wait for anything else to get done, since there are no other CPUs for +anything else to be happening on. Although this approach will *sort of* +work a surprising amount of the time, it is a very bad idea in general. +This document presents three examples that demonstrate exactly how bad +an idea this is. + +Example 1: softirq Suicide +-------------------------- + +Suppose that an RCU-based algorithm scans a linked list containing +elements A, B, and C in process context, and can delete elements from +this same list in softirq context. Suppose that the process-context scan +is referencing element B when it is interrupted by softirq processing, +which deletes element B, and then invokes call_rcu() to free element B +after a grace period. + +Now, if call_rcu() were to directly invoke its arguments, then upon return +from softirq, the list scan would find itself referencing a newly freed +element B. This situation can greatly decrease the life expectancy of +your kernel. + +This same problem can occur if call_rcu() is invoked from a hardware +interrupt handler. + +Example 2: Function-Call Fatality +--------------------------------- + +Of course, one could avert the suicide described in the preceding example +by having call_rcu() directly invoke its arguments only if it was called +from process context. However, this can fail in a similar manner. + +Suppose that an RCU-based algorithm again scans a linked list containing +elements A, B, and C in process contexts, but that it invokes a function +on each element as it is scanned. Suppose further that this function +deletes element B from the list, then passes it to call_rcu() for deferred +freeing. This may be a bit unconventional, but it is perfectly legal +RCU usage, since call_rcu() must wait for a grace period to elapse. +Therefore, in this case, allowing call_rcu() to immediately invoke +its arguments would cause it to fail to make the fundamental guarantee +underlying RCU, namely that call_rcu() defers invoking its arguments until +all RCU read-side critical sections currently executing have completed. + +Quick Quiz #1: + Why is it *not* legal to invoke synchronize_rcu() in this case? + +:ref:`Answers to Quick Quiz ` + +Example 3: Death by Deadlock +---------------------------- + +Suppose that call_rcu() is invoked while holding a lock, and that the +callback function must acquire this same lock. In this case, if +call_rcu() were to directly invoke the callback, the result would +be self-deadlock. + +In some cases, it would possible to restructure to code so that +the call_rcu() is delayed until after the lock is released. However, +there are cases where this can be quite ugly: + +1. If a number of items need to be passed to call_rcu() within + the same critical section, then the code would need to create + a list of them, then traverse the list once the lock was + released. + +2. In some cases, the lock will be held across some kernel API, + so that delaying the call_rcu() until the lock is released + requires that the data item be passed up via a common API. + It is far better to guarantee that callbacks are invoked + with no locks held than to have to modify such APIs to allow + arbitrary data items to be passed back up through them. + +If call_rcu() directly invokes the callback, painful locking restrictions +or API changes would be required. + +Quick Quiz #2: + What locking restriction must RCU callbacks respect? + +:ref:`Answers to Quick Quiz ` + +Summary +------- + +Permitting call_rcu() to immediately invoke its arguments breaks RCU, +even on a UP system. So do not do it! Even on a UP system, the RCU +infrastructure *must* respect grace periods, and *must* invoke callbacks +from a known environment in which no locks are held. + +Note that it *is* safe for synchronize_rcu() to return immediately on +UP systems, including PREEMPT SMP builds running on UP systems. + +Quick Quiz #3: + Why can't synchronize_rcu() return immediately on UP systems running + preemptable RCU? + +.. _answer_quick_quiz_up: + +Answer to Quick Quiz #1: + Why is it *not* legal to invoke synchronize_rcu() in this case? + + Because the calling function is scanning an RCU-protected linked + list, and is therefore within an RCU read-side critical section. + Therefore, the called function has been invoked within an RCU + read-side critical section, and is not permitted to block. + +Answer to Quick Quiz #2: + What locking restriction must RCU callbacks respect? + + Any lock that is acquired within an RCU callback must be acquired + elsewhere using an _bh variant of the spinlock primitive. + For example, if "mylock" is acquired by an RCU callback, then + a process-context acquisition of this lock must use something + like spin_lock_bh() to acquire the lock. Please note that + it is also OK to use _irq variants of spinlocks, for example, + spin_lock_irqsave(). + + If the process-context code were to simply use spin_lock(), + then, since RCU callbacks can be invoked from softirq context, + the callback might be called from a softirq that interrupted + the process-context critical section. This would result in + self-deadlock. + + This restriction might seem gratuitous, since very few RCU + callbacks acquire locks directly. However, a great many RCU + callbacks do acquire locks *indirectly*, for example, via + the kfree() primitive. + +Answer to Quick Quiz #3: + Why can't synchronize_rcu() return immediately on UP systems + running preemptable RCU? + + Because some other task might have been preempted in the middle + of an RCU read-side critical section. If synchronize_rcu() + simply immediately returned, it would prematurely signal the + end of the grace period, which would come as a nasty shock to + that other thread when it started running again. diff --git a/Documentation/RCU/UP.txt b/Documentation/RCU/UP.txt deleted file mode 100644 index 53bde717017b..000000000000 --- a/Documentation/RCU/UP.txt +++ /dev/null @@ -1,133 +0,0 @@ -RCU on Uniprocessor Systems - - -A common misconception is that, on UP systems, the call_rcu() primitive -may immediately invoke its function. The basis of this misconception -is that since there is only one CPU, it should not be necessary to -wait for anything else to get done, since there are no other CPUs for -anything else to be happening on. Although this approach will -sort- -of- -work a surprising amount of the time, it is a very bad idea in general. -This document presents three examples that demonstrate exactly how bad -an idea this is. - - -Example 1: softirq Suicide - -Suppose that an RCU-based algorithm scans a linked list containing -elements A, B, and C in process context, and can delete elements from -this same list in softirq context. Suppose that the process-context scan -is referencing element B when it is interrupted by softirq processing, -which deletes element B, and then invokes call_rcu() to free element B -after a grace period. - -Now, if call_rcu() were to directly invoke its arguments, then upon return -from softirq, the list scan would find itself referencing a newly freed -element B. This situation can greatly decrease the life expectancy of -your kernel. - -This same problem can occur if call_rcu() is invoked from a hardware -interrupt handler. - - -Example 2: Function-Call Fatality - -Of course, one could avert the suicide described in the preceding example -by having call_rcu() directly invoke its arguments only if it was called -from process context. However, this can fail in a similar manner. - -Suppose that an RCU-based algorithm again scans a linked list containing -elements A, B, and C in process contexts, but that it invokes a function -on each element as it is scanned. Suppose further that this function -deletes element B from the list, then passes it to call_rcu() for deferred -freeing. This may be a bit unconventional, but it is perfectly legal -RCU usage, since call_rcu() must wait for a grace period to elapse. -Therefore, in this case, allowing call_rcu() to immediately invoke -its arguments would cause it to fail to make the fundamental guarantee -underlying RCU, namely that call_rcu() defers invoking its arguments until -all RCU read-side critical sections currently executing have completed. - -Quick Quiz #1: why is it -not- legal to invoke synchronize_rcu() in - this case? - - -Example 3: Death by Deadlock - -Suppose that call_rcu() is invoked while holding a lock, and that the -callback function must acquire this same lock. In this case, if -call_rcu() were to directly invoke the callback, the result would -be self-deadlock. - -In some cases, it would possible to restructure to code so that -the call_rcu() is delayed until after the lock is released. However, -there are cases where this can be quite ugly: - -1. If a number of items need to be passed to call_rcu() within - the same critical section, then the code would need to create - a list of them, then traverse the list once the lock was - released. - -2. In some cases, the lock will be held across some kernel API, - so that delaying the call_rcu() until the lock is released - requires that the data item be passed up via a common API. - It is far better to guarantee that callbacks are invoked - with no locks held than to have to modify such APIs to allow - arbitrary data items to be passed back up through them. - -If call_rcu() directly invokes the callback, painful locking restrictions -or API changes would be required. - -Quick Quiz #2: What locking restriction must RCU callbacks respect? - - -Summary - -Permitting call_rcu() to immediately invoke its arguments breaks RCU, -even on a UP system. So do not do it! Even on a UP system, the RCU -infrastructure -must- respect grace periods, and -must- invoke callbacks -from a known environment in which no locks are held. - -Note that it -is- safe for synchronize_rcu() to return immediately on -UP systems, including !PREEMPT SMP builds running on UP systems. - -Quick Quiz #3: Why can't synchronize_rcu() return immediately on - UP systems running preemptable RCU? - - -Answer to Quick Quiz #1: - Why is it -not- legal to invoke synchronize_rcu() in this case? - - Because the calling function is scanning an RCU-protected linked - list, and is therefore within an RCU read-side critical section. - Therefore, the called function has been invoked within an RCU - read-side critical section, and is not permitted to block. - -Answer to Quick Quiz #2: - What locking restriction must RCU callbacks respect? - - Any lock that is acquired within an RCU callback must be - acquired elsewhere using an _irq variant of the spinlock - primitive. For example, if "mylock" is acquired by an - RCU callback, then a process-context acquisition of this - lock must use something like spin_lock_irqsave() to - acquire the lock. - - If the process-context code were to simply use spin_lock(), - then, since RCU callbacks can be invoked from softirq context, - the callback might be called from a softirq that interrupted - the process-context critical section. This would result in - self-deadlock. - - This restriction might seem gratuitous, since very few RCU - callbacks acquire locks directly. However, a great many RCU - callbacks do acquire locks -indirectly-, for example, via - the kfree() primitive. - -Answer to Quick Quiz #3: - Why can't synchronize_rcu() return immediately on UP systems - running preemptable RCU? - - Because some other task might have been preempted in the middle - of an RCU read-side critical section. If synchronize_rcu() - simply immediately returned, it would prematurely signal the - end of the grace period, which would come as a nasty shock to - that other thread when it started running again. diff --git a/Documentation/RCU/index.rst b/Documentation/RCU/index.rst new file mode 100644 index 000000000000..340a9725676c --- /dev/null +++ b/Documentation/RCU/index.rst @@ -0,0 +1,19 @@ +.. _rcu_concepts: + +============ +RCU concepts +============ + +.. toctree:: + :maxdepth: 1 + + rcu + listRCU + UP + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/RCU/listRCU.rst b/Documentation/RCU/listRCU.rst new file mode 100644 index 000000000000..7956ff33042b --- /dev/null +++ b/Documentation/RCU/listRCU.rst @@ -0,0 +1,321 @@ +.. _list_rcu_doc: + +Using RCU to Protect Read-Mostly Linked Lists +============================================= + +One of the best applications of RCU is to protect read-mostly linked lists +("struct list_head" in list.h). One big advantage of this approach +is that all of the required memory barriers are included for you in +the list macros. This document describes several applications of RCU, +with the best fits first. + +Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates +---------------------------------------------------------------------- + +The best applications are cases where, if reader-writer locking were +used, the read-side lock would be dropped before taking any action +based on the results of the search. The most celebrated example is +the routing table. Because the routing table is tracking the state of +equipment outside of the computer, it will at times contain stale data. +Therefore, once the route has been computed, there is no need to hold +the routing table static during transmission of the packet. After all, +you can hold the routing table static all you want, but that won't keep +the external Internet from changing, and it is the state of the external +Internet that really matters. In addition, routing entries are typically +added or deleted, rather than being modified in place. + +A straightforward example of this use of RCU may be found in the +system-call auditing support. For example, a reader-writer locked +implementation of audit_filter_task() might be as follows:: + + static enum audit_state audit_filter_task(struct task_struct *tsk) + { + struct audit_entry *e; + enum audit_state state; + + read_lock(&auditsc_lock); + /* Note: audit_netlink_sem held by caller. */ + list_for_each_entry(e, &audit_tsklist, list) { + if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { + read_unlock(&auditsc_lock); + return state; + } + } + read_unlock(&auditsc_lock); + return AUDIT_BUILD_CONTEXT; + } + +Here the list is searched under the lock, but the lock is dropped before +the corresponding value is returned. By the time that this value is acted +on, the list may well have been modified. This makes sense, since if +you are turning auditing off, it is OK to audit a few extra system calls. + +This means that RCU can be easily applied to the read side, as follows:: + + static enum audit_state audit_filter_task(struct task_struct *tsk) + { + struct audit_entry *e; + enum audit_state state; + + rcu_read_lock(); + /* Note: audit_netlink_sem held by caller. */ + list_for_each_entry_rcu(e, &audit_tsklist, list) { + if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { + rcu_read_unlock(); + return state; + } + } + rcu_read_unlock(); + return AUDIT_BUILD_CONTEXT; + } + +The read_lock() and read_unlock() calls have become rcu_read_lock() +and rcu_read_unlock(), respectively, and the list_for_each_entry() has +become list_for_each_entry_rcu(). The _rcu() list-traversal primitives +insert the read-side memory barriers that are required on DEC Alpha CPUs. + +The changes to the update side are also straightforward. A reader-writer +lock might be used as follows for deletion and insertion:: + + static inline int audit_del_rule(struct audit_rule *rule, + struct list_head *list) + { + struct audit_entry *e; + + write_lock(&auditsc_lock); + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + list_del(&e->list); + write_unlock(&auditsc_lock); + return 0; + } + } + write_unlock(&auditsc_lock); + return -EFAULT; /* No matching rule */ + } + + static inline int audit_add_rule(struct audit_entry *entry, + struct list_head *list) + { + write_lock(&auditsc_lock); + if (entry->rule.flags & AUDIT_PREPEND) { + entry->rule.flags &= ~AUDIT_PREPEND; + list_add(&entry->list, list); + } else { + list_add_tail(&entry->list, list); + } + write_unlock(&auditsc_lock); + return 0; + } + +Following are the RCU equivalents for these two functions:: + + static inline int audit_del_rule(struct audit_rule *rule, + struct list_head *list) + { + struct audit_entry *e; + + /* Do not use the _rcu iterator here, since this is the only + * deletion routine. */ + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + list_del_rcu(&e->list); + call_rcu(&e->rcu, audit_free_rule); + return 0; + } + } + return -EFAULT; /* No matching rule */ + } + + static inline int audit_add_rule(struct audit_entry *entry, + struct list_head *list) + { + if (entry->rule.flags & AUDIT_PREPEND) { + entry->rule.flags &= ~AUDIT_PREPEND; + list_add_rcu(&entry->list, list); + } else { + list_add_tail_rcu(&entry->list, list); + } + return 0; + } + +Normally, the write_lock() and write_unlock() would be replaced by +a spin_lock() and a spin_unlock(), but in this case, all callers hold +audit_netlink_sem, so no additional locking is required. The auditsc_lock +can therefore be eliminated, since use of RCU eliminates the need for +writers to exclude readers. Normally, the write_lock() calls would +be converted into spin_lock() calls. + +The list_del(), list_add(), and list_add_tail() primitives have been +replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu(). +The _rcu() list-manipulation primitives add memory barriers that are +needed on weakly ordered CPUs (most of them!). The list_del_rcu() +primitive omits the pointer poisoning debug-assist code that would +otherwise cause concurrent readers to fail spectacularly. + +So, when readers can tolerate stale data and when entries are either added +or deleted, without in-place modification, it is very easy to use RCU! + +Example 2: Handling In-Place Updates +------------------------------------ + +The system-call auditing code does not update auditing rules in place. +However, if it did, reader-writer-locked code to do so might look as +follows (presumably, the field_count is only permitted to decrease, +otherwise, the added fields would need to be filled in):: + + static inline int audit_upd_rule(struct audit_rule *rule, + struct list_head *list, + __u32 newaction, + __u32 newfield_count) + { + struct audit_entry *e; + struct audit_newentry *ne; + + write_lock(&auditsc_lock); + /* Note: audit_netlink_sem held by caller. */ + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + e->rule.action = newaction; + e->rule.file_count = newfield_count; + write_unlock(&auditsc_lock); + return 0; + } + } + write_unlock(&auditsc_lock); + return -EFAULT; /* No matching rule */ + } + +The RCU version creates a copy, updates the copy, then replaces the old +entry with the newly updated entry. This sequence of actions, allowing +concurrent reads while doing a copy to perform an update, is what gives +RCU ("read-copy update") its name. The RCU code is as follows:: + + static inline int audit_upd_rule(struct audit_rule *rule, + struct list_head *list, + __u32 newaction, + __u32 newfield_count) + { + struct audit_entry *e; + struct audit_newentry *ne; + + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + ne = kmalloc(sizeof(*entry), GFP_ATOMIC); + if (ne == NULL) + return -ENOMEM; + audit_copy_rule(&ne->rule, &e->rule); + ne->rule.action = newaction; + ne->rule.file_count = newfield_count; + list_replace_rcu(&e->list, &ne->list); + call_rcu(&e->rcu, audit_free_rule); + return 0; + } + } + return -EFAULT; /* No matching rule */ + } + +Again, this assumes that the caller holds audit_netlink_sem. Normally, +the reader-writer lock would become a spinlock in this sort of code. + +Example 3: Eliminating Stale Data +--------------------------------- + +The auditing examples above tolerate stale data, as do most algorithms +that are tracking external state. Because there is a delay from the +time the external state changes before Linux becomes aware of the change, +additional RCU-induced staleness is normally not a problem. + +However, there are many examples where stale data cannot be tolerated. +One example in the Linux kernel is the System V IPC (see the ipc_lock() +function in ipc/util.c). This code checks a "deleted" flag under a +per-entry spinlock, and, if the "deleted" flag is set, pretends that the +entry does not exist. For this to be helpful, the search function must +return holding the per-entry spinlock, as ipc_lock() does in fact do. + +Quick Quiz: + Why does the search function need to return holding the per-entry lock for + this deleted-flag technique to be helpful? + +:ref:`Answer to Quick Quiz ` + +If the system-call audit module were to ever need to reject stale data, +one way to accomplish this would be to add a "deleted" flag and a "lock" +spinlock to the audit_entry structure, and modify audit_filter_task() +as follows:: + + static enum audit_state audit_filter_task(struct task_struct *tsk) + { + struct audit_entry *e; + enum audit_state state; + + rcu_read_lock(); + list_for_each_entry_rcu(e, &audit_tsklist, list) { + if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { + spin_lock(&e->lock); + if (e->deleted) { + spin_unlock(&e->lock); + rcu_read_unlock(); + return AUDIT_BUILD_CONTEXT; + } + rcu_read_unlock(); + return state; + } + } + rcu_read_unlock(); + return AUDIT_BUILD_CONTEXT; + } + +Note that this example assumes that entries are only added and deleted. +Additional mechanism is required to deal correctly with the +update-in-place performed by audit_upd_rule(). For one thing, +audit_upd_rule() would need additional memory barriers to ensure +that the list_add_rcu() was really executed before the list_del_rcu(). + +The audit_del_rule() function would need to set the "deleted" +flag under the spinlock as follows:: + + static inline int audit_del_rule(struct audit_rule *rule, + struct list_head *list) + { + struct audit_entry *e; + + /* Do not need to use the _rcu iterator here, since this + * is the only deletion routine. */ + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + spin_lock(&e->lock); + list_del_rcu(&e->list); + e->deleted = 1; + spin_unlock(&e->lock); + call_rcu(&e->rcu, audit_free_rule); + return 0; + } + } + return -EFAULT; /* No matching rule */ + } + +Summary +------- + +Read-mostly list-based data structures that can tolerate stale data are +the most amenable to use of RCU. The simplest case is where entries are +either added or deleted from the data structure (or atomically modified +in place), but non-atomic in-place modifications can be handled by making +a copy, updating the copy, then replacing the original with the copy. +If stale data cannot be tolerated, then a "deleted" flag may be used +in conjunction with a per-entry spinlock in order to allow the search +function to reject newly deleted data. + +.. _answer_quick_quiz_list: + +Answer to Quick Quiz: + Why does the search function need to return holding the per-entry + lock for this deleted-flag technique to be helpful? + + If the search function drops the per-entry lock before returning, + then the caller will be processing stale data in any case. If it + is really OK to be processing stale data, then you don't need a + "deleted" flag. If processing stale data really is a problem, + then you need to hold the per-entry lock across all of the code + that uses the value that was returned. diff --git a/Documentation/RCU/listRCU.txt b/Documentation/RCU/listRCU.txt deleted file mode 100644 index adb5a3782846..000000000000 --- a/Documentation/RCU/listRCU.txt +++ /dev/null @@ -1,315 +0,0 @@ -Using RCU to Protect Read-Mostly Linked Lists - - -One of the best applications of RCU is to protect read-mostly linked lists -("struct list_head" in list.h). One big advantage of this approach -is that all of the required memory barriers are included for you in -the list macros. This document describes several applications of RCU, -with the best fits first. - - -Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates - -The best applications are cases where, if reader-writer locking were -used, the read-side lock would be dropped before taking any action -based on the results of the search. The most celebrated example is -the routing table. Because the routing table is tracking the state of -equipment outside of the computer, it will at times contain stale data. -Therefore, once the route has been computed, there is no need to hold -the routing table static during transmission of the packet. After all, -you can hold the routing table static all you want, but that won't keep -the external Internet from changing, and it is the state of the external -Internet that really matters. In addition, routing entries are typically -added or deleted, rather than being modified in place. - -A straightforward example of this use of RCU may be found in the -system-call auditing support. For example, a reader-writer locked -implementation of audit_filter_task() might be as follows: - - static enum audit_state audit_filter_task(struct task_struct *tsk) - { - struct audit_entry *e; - enum audit_state state; - - read_lock(&auditsc_lock); - /* Note: audit_netlink_sem held by caller. */ - list_for_each_entry(e, &audit_tsklist, list) { - if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { - read_unlock(&auditsc_lock); - return state; - } - } - read_unlock(&auditsc_lock); - return AUDIT_BUILD_CONTEXT; - } - -Here the list is searched under the lock, but the lock is dropped before -the corresponding value is returned. By the time that this value is acted -on, the list may well have been modified. This makes sense, since if -you are turning auditing off, it is OK to audit a few extra system calls. - -This means that RCU can be easily applied to the read side, as follows: - - static enum audit_state audit_filter_task(struct task_struct *tsk) - { - struct audit_entry *e; - enum audit_state state; - - rcu_read_lock(); - /* Note: audit_netlink_sem held by caller. */ - list_for_each_entry_rcu(e, &audit_tsklist, list) { - if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { - rcu_read_unlock(); - return state; - } - } - rcu_read_unlock(); - return AUDIT_BUILD_CONTEXT; - } - -The read_lock() and read_unlock() calls have become rcu_read_lock() -and rcu_read_unlock(), respectively, and the list_for_each_entry() has -become list_for_each_entry_rcu(). The _rcu() list-traversal primitives -insert the read-side memory barriers that are required on DEC Alpha CPUs. - -The changes to the update side are also straightforward. A reader-writer -lock might be used as follows for deletion and insertion: - - static inline int audit_del_rule(struct audit_rule *rule, - struct list_head *list) - { - struct audit_entry *e; - - write_lock(&auditsc_lock); - list_for_each_entry(e, list, list) { - if (!audit_compare_rule(rule, &e->rule)) { - list_del(&e->list); - write_unlock(&auditsc_lock); - return 0; - } - } - write_unlock(&auditsc_lock); - return -EFAULT; /* No matching rule */ - } - - static inline int audit_add_rule(struct audit_entry *entry, - struct list_head *list) - { - write_lock(&auditsc_lock); - if (entry->rule.flags & AUDIT_PREPEND) { - entry->rule.flags &= ~AUDIT_PREPEND; - list_add(&entry->list, list); - } else { - list_add_tail(&entry->list, list); - } - write_unlock(&auditsc_lock); - return 0; - } - -Following are the RCU equivalents for these two functions: - - static inline int audit_del_rule(struct audit_rule *rule, - struct list_head *list) - { - struct audit_entry *e; - - /* Do not use the _rcu iterator here, since this is the only - * deletion routine. */ - list_for_each_entry(e, list, list) { - if (!audit_compare_rule(rule, &e->rule)) { - list_del_rcu(&e->list); - call_rcu(&e->rcu, audit_free_rule); - return 0; - } - } - return -EFAULT; /* No matching rule */ - } - - static inline int audit_add_rule(struct audit_entry *entry, - struct list_head *list) - { - if (entry->rule.flags & AUDIT_PREPEND) { - entry->rule.flags &= ~AUDIT_PREPEND; - list_add_rcu(&entry->list, list); - } else { - list_add_tail_rcu(&entry->list, list); - } - return 0; - } - -Normally, the write_lock() and write_unlock() would be replaced by -a spin_lock() and a spin_unlock(), but in this case, all callers hold -audit_netlink_sem, so no additional locking is required. The auditsc_lock -can therefore be eliminated, since use of RCU eliminates the need for -writers to exclude readers. Normally, the write_lock() calls would -be converted into spin_lock() calls. - -The list_del(), list_add(), and list_add_tail() primitives have been -replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu(). -The _rcu() list-manipulation primitives add memory barriers that are -needed on weakly ordered CPUs (most of them!). The list_del_rcu() -primitive omits the pointer poisoning debug-assist code that would -otherwise cause concurrent readers to fail spectacularly. - -So, when readers can tolerate stale data and when entries are either added -or deleted, without in-place modification, it is very easy to use RCU! - - -Example 2: Handling In-Place Updates - -The system-call auditing code does not update auditing rules in place. -However, if it did, reader-writer-locked code to do so might look as -follows (presumably, the field_count is only permitted to decrease, -otherwise, the added fields would need to be filled in): - - static inline int audit_upd_rule(struct audit_rule *rule, - struct list_head *list, - __u32 newaction, - __u32 newfield_count) - { - struct audit_entry *e; - struct audit_newentry *ne; - - write_lock(&auditsc_lock); - /* Note: audit_netlink_sem held by caller. */ - list_for_each_entry(e, list, list) { - if (!audit_compare_rule(rule, &e->rule)) { - e->rule.action = newaction; - e->rule.file_count = newfield_count; - write_unlock(&auditsc_lock); - return 0; - } - } - write_unlock(&auditsc_lock); - return -EFAULT; /* No matching rule */ - } - -The RCU version creates a copy, updates the copy, then replaces the old -entry with the newly updated entry. This sequence of actions, allowing -concurrent reads while doing a copy to perform an update, is what gives -RCU ("read-copy update") its name. The RCU code is as follows: - - static inline int audit_upd_rule(struct audit_rule *rule, - struct list_head *list, - __u32 newaction, - __u32 newfield_count) - { - struct audit_entry *e; - struct audit_newentry *ne; - - list_for_each_entry(e, list, list) { - if (!audit_compare_rule(rule, &e->rule)) { - ne = kmalloc(sizeof(*entry), GFP_ATOMIC); - if (ne == NULL) - return -ENOMEM; - audit_copy_rule(&ne->rule, &e->rule); - ne->rule.action = newaction; - ne->rule.file_count = newfield_count; - list_replace_rcu(&e->list, &ne->list); - call_rcu(&e->rcu, audit_free_rule); - return 0; - } - } - return -EFAULT; /* No matching rule */ - } - -Again, this assumes that the caller holds audit_netlink_sem. Normally, -the reader-writer lock would become a spinlock in this sort of code. - - -Example 3: Eliminating Stale Data - -The auditing examples above tolerate stale data, as do most algorithms -that are tracking external state. Because there is a delay from the -time the external state changes before Linux becomes aware of the change, -additional RCU-induced staleness is normally not a problem. - -However, there are many examples where stale data cannot be tolerated. -One example in the Linux kernel is the System V IPC (see the ipc_lock() -function in ipc/util.c). This code checks a "deleted" flag under a -per-entry spinlock, and, if the "deleted" flag is set, pretends that the -entry does not exist. For this to be helpful, the search function must -return holding the per-entry spinlock, as ipc_lock() does in fact do. - -Quick Quiz: Why does the search function need to return holding the - per-entry lock for this deleted-flag technique to be helpful? - -If the system-call audit module were to ever need to reject stale data, -one way to accomplish this would be to add a "deleted" flag and a "lock" -spinlock to the audit_entry structure, and modify audit_filter_task() -as follows: - - static enum audit_state audit_filter_task(struct task_struct *tsk) - { - struct audit_entry *e; - enum audit_state state; - - rcu_read_lock(); - list_for_each_entry_rcu(e, &audit_tsklist, list) { - if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { - spin_lock(&e->lock); - if (e->deleted) { - spin_unlock(&e->lock); - rcu_read_unlock(); - return AUDIT_BUILD_CONTEXT; - } - rcu_read_unlock(); - return state; - } - } - rcu_read_unlock(); - return AUDIT_BUILD_CONTEXT; - } - -Note that this example assumes that entries are only added and deleted. -Additional mechanism is required to deal correctly with the -update-in-place performed by audit_upd_rule(). For one thing, -audit_upd_rule() would need additional memory barriers to ensure -that the list_add_rcu() was really executed before the list_del_rcu(). - -The audit_del_rule() function would need to set the "deleted" -flag under the spinlock as follows: - - static inline int audit_del_rule(struct audit_rule *rule, - struct list_head *list) - { - struct audit_entry *e; - - /* Do not need to use the _rcu iterator here, since this - * is the only deletion routine. */ - list_for_each_entry(e, list, list) { - if (!audit_compare_rule(rule, &e->rule)) { - spin_lock(&e->lock); - list_del_rcu(&e->list); - e->deleted = 1; - spin_unlock(&e->lock); - call_rcu(&e->rcu, audit_free_rule); - return 0; - } - } - return -EFAULT; /* No matching rule */ - } - - -Summary - -Read-mostly list-based data structures that can tolerate stale data are -the most amenable to use of RCU. The simplest case is where entries are -either added or deleted from the data structure (or atomically modified -in place), but non-atomic in-place modifications can be handled by making -a copy, updating the copy, then replacing the original with the copy. -If stale data cannot be tolerated, then a "deleted" flag may be used -in conjunction with a per-entry spinlock in order to allow the search -function to reject newly deleted data. - - -Answer to Quick Quiz - Why does the search function need to return holding the per-entry - lock for this deleted-flag technique to be helpful? - - If the search function drops the per-entry lock before returning, - then the caller will be processing stale data in any case. If it - is really OK to be processing stale data, then you don't need a - "deleted" flag. If processing stale data really is a problem, - then you need to hold the per-entry lock across all of the code - that uses the value that was returned. diff --git a/Documentation/RCU/rcu.rst b/Documentation/RCU/rcu.rst new file mode 100644 index 000000000000..8dfb437dacc3 --- /dev/null +++ b/Documentation/RCU/rcu.rst @@ -0,0 +1,92 @@ +.. _rcu_doc: + +RCU Concepts +============ + +The basic idea behind RCU (read-copy update) is to split destructive +operations into two parts, one that prevents anyone from seeing the data +item being destroyed, and one that actually carries out the destruction. +A "grace period" must elapse between the two parts, and this grace period +must be long enough that any readers accessing the item being deleted have +since dropped their references. For example, an RCU-protected deletion +from a linked list would first remove the item from the list, wait for +a grace period to elapse, then free the element. See the +Documentation/RCU/listRCU.rst file for more information on using RCU with +linked lists. + +Frequently Asked Questions +-------------------------- + +- Why would anyone want to use RCU? + + The advantage of RCU's two-part approach is that RCU readers need + not acquire any locks, perform any atomic instructions, write to + shared memory, or (on CPUs other than Alpha) execute any memory + barriers. The fact that these operations are quite expensive + on modern CPUs is what gives RCU its performance advantages + in read-mostly situations. The fact that RCU readers need not + acquire locks can also greatly simplify deadlock-avoidance code. + +- How can the updater tell when a grace period has completed + if the RCU readers give no indication when they are done? + + Just as with spinlocks, RCU readers are not permitted to + block, switch to user-mode execution, or enter the idle loop. + Therefore, as soon as a CPU is seen passing through any of these + three states, we know that that CPU has exited any previous RCU + read-side critical sections. So, if we remove an item from a + linked list, and then wait until all CPUs have switched context, + executed in user mode, or executed in the idle loop, we can + safely free up that item. + + Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the + same effect, but require that the readers manipulate CPU-local + counters. These counters allow limited types of blocking within + RCU read-side critical sections. SRCU also uses CPU-local + counters, and permits general blocking within RCU read-side + critical sections. These variants of RCU detect grace periods + by sampling these counters. + +- If I am running on a uniprocessor kernel, which can only do one + thing at a time, why should I wait for a grace period? + + See the Documentation/RCU/UP.rst file for more information. + +- How can I see where RCU is currently used in the Linux kernel? + + Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu", + "rcu_read_lock_bh", "rcu_read_unlock_bh", "srcu_read_lock", + "srcu_read_unlock", "synchronize_rcu", "synchronize_net", + "synchronize_srcu", and the other RCU primitives. Or grab one + of the cscope databases from: + + (http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html). + +- What guidelines should I follow when writing code that uses RCU? + + See the checklist.txt file in this directory. + +- Why the name "RCU"? + + "RCU" stands for "read-copy update". The file Documentation/RCU/listRCU.rst + has more information on where this name came from, search for + "read-copy update" to find it. + +- I hear that RCU is patented? What is with that? + + Yes, it is. There are several known patents related to RCU, + search for the string "Patent" in RTFP.txt to find them. + Of these, one was allowed to lapse by the assignee, and the + others have been contributed to the Linux kernel under GPL. + There are now also LGPL implementations of user-level RCU + available (http://liburcu.org/). + +- I hear that RCU needs work in order to support realtime kernels? + + Realtime-friendly RCU can be enabled via the CONFIG_PREEMPT_RCU + kernel configuration parameter. + +- Where can I find more information on RCU? + + See the RTFP.txt file in this directory. + Or point your browser at (http://www.rdrop.com/users/paulmck/RCU/). diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt deleted file mode 100644 index c818cf65c5a9..000000000000 --- a/Documentation/RCU/rcu.txt +++ /dev/null @@ -1,89 +0,0 @@ -RCU Concepts - - -The basic idea behind RCU (read-copy update) is to split destructive -operations into two parts, one that prevents anyone from seeing the data -item being destroyed, and one that actually carries out the destruction. -A "grace period" must elapse between the two parts, and this grace period -must be long enough that any readers accessing the item being deleted have -since dropped their references. For example, an RCU-protected deletion -from a linked list would first remove the item from the list, wait for -a grace period to elapse, then free the element. See the listRCU.txt -file for more information on using RCU with linked lists. - - -Frequently Asked Questions - -o Why would anyone want to use RCU? - - The advantage of RCU's two-part approach is that RCU readers need - not acquire any locks, perform any atomic instructions, write to - shared memory, or (on CPUs other than Alpha) execute any memory - barriers. The fact that these operations are quite expensive - on modern CPUs is what gives RCU its performance advantages - in read-mostly situations. The fact that RCU readers need not - acquire locks can also greatly simplify deadlock-avoidance code. - -o How can the updater tell when a grace period has completed - if the RCU readers give no indication when they are done? - - Just as with spinlocks, RCU readers are not permitted to - block, switch to user-mode execution, or enter the idle loop. - Therefore, as soon as a CPU is seen passing through any of these - three states, we know that that CPU has exited any previous RCU - read-side critical sections. So, if we remove an item from a - linked list, and then wait until all CPUs have switched context, - executed in user mode, or executed in the idle loop, we can - safely free up that item. - - Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the - same effect, but require that the readers manipulate CPU-local - counters. These counters allow limited types of blocking within - RCU read-side critical sections. SRCU also uses CPU-local - counters, and permits general blocking within RCU read-side - critical sections. These variants of RCU detect grace periods - by sampling these counters. - -o If I am running on a uniprocessor kernel, which can only do one - thing at a time, why should I wait for a grace period? - - See the UP.txt file in this directory. - -o How can I see where RCU is currently used in the Linux kernel? - - Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu", - "rcu_read_lock_bh", "rcu_read_unlock_bh", "srcu_read_lock", - "srcu_read_unlock", "synchronize_rcu", "synchronize_net", - "synchronize_srcu", and the other RCU primitives. Or grab one - of the cscope databases from: - - http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html - -o What guidelines should I follow when writing code that uses RCU? - - See the checklist.txt file in this directory. - -o Why the name "RCU"? - - "RCU" stands for "read-copy update". The file listRCU.txt has - more information on where this name came from, search for - "read-copy update" to find it. - -o I hear that RCU is patented? What is with that? - - Yes, it is. There are several known patents related to RCU, - search for the string "Patent" in RTFP.txt to find them. - Of these, one was allowed to lapse by the assignee, and the - others have been contributed to the Linux kernel under GPL. - There are now also LGPL implementations of user-level RCU - available (http://liburcu.org/). - -o I hear that RCU needs work in order to support realtime kernels? - - Realtime-friendly RCU can be enabled via the CONFIG_PREEMPT_RCU - kernel configuration parameter. - -o Where can I find more information on RCU? - - See the RTFP.txt file in this directory. - Or point your browser at http://www.rdrop.com/users/paulmck/RCU/. diff --git a/Documentation/RCU/rcuref.txt b/Documentation/RCU/rcuref.txt index 613033ff2b9b..5e6429d66c24 100644 --- a/Documentation/RCU/rcuref.txt +++ b/Documentation/RCU/rcuref.txt @@ -12,6 +12,7 @@ please read on. Reference counting on elements of lists which are protected by traditional reader/writer spinlocks or semaphores are straightforward: +CODE LISTING A: 1. 2. add() search_and_reference() { { @@ -28,7 +29,8 @@ add() search_and_reference() release_referenced() delete() { { ... write_lock(&list_lock); - atomic_dec(&el->rc, relfunc) ... + if(atomic_dec_and_test(&el->rc)) ... + kfree(el); ... remove_element } write_unlock(&list_lock); ... @@ -44,6 +46,7 @@ search_and_reference() could potentially hold reference to an element which has already been deleted from the list/array. Use atomic_inc_not_zero() in this scenario as follows: +CODE LISTING B: 1. 2. add() search_and_reference() { { @@ -79,6 +82,7 @@ search_and_reference() code path. In such cases, the atomic_dec_and_test() may be moved from delete() to el_free() as follows: +CODE LISTING C: 1. 2. add() search_and_reference() { { @@ -114,6 +118,17 @@ element can therefore safely be freed. This in turn guarantees that if any reader finds the element, that reader may safely acquire a reference without checking the value of the reference counter. +A clear advantage of the RCU-based pattern in listing C over the one +in listing B is that any call to search_and_reference() that locates +a given object will succeed in obtaining a reference to that object, +even given a concurrent invocation of delete() for that same object. +Similarly, a clear advantage of both listings B and C over listing A is +that a call to delete() is not delayed even if there are an arbitrarily +large number of calls to search_and_reference() searching for the same +object that delete() was invoked on. Instead, all that is delayed is +the eventual invocation of kfree(), which is usually not a problem on +modern computer systems, even the small ones. + In cases where delete() can sleep, synchronize_rcu() can be called from delete(), so that el_free() can be subsumed into delete as follows: @@ -130,3 +145,7 @@ delete() kfree(el); ... } + +As additional examples in the kernel, the pattern in listing C is used by +reference counting of struct pid, while the pattern in listing B is used by +struct posix_acl. diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt index 1ab70c37921f..13e88fc00f01 100644 --- a/Documentation/RCU/stallwarn.txt +++ b/Documentation/RCU/stallwarn.txt @@ -153,7 +153,7 @@ rcupdate.rcu_task_stall_timeout This boot/sysfs parameter controls the RCU-tasks stall warning interval. A value of zero or less suppresses RCU-tasks stall warnings. A positive value sets the stall-warning interval - in jiffies. An RCU-tasks stall warning starts with the line: + in seconds. An RCU-tasks stall warning starts with the line: INFO: rcu_tasks detected stalls on tasks: diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 981651a8b65d..7e1a8721637a 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -212,7 +212,7 @@ synchronize_rcu() rcu_assign_pointer() - typeof(p) rcu_assign_pointer(p, typeof(p) v); + void rcu_assign_pointer(p, typeof(p) v); Yes, rcu_assign_pointer() -is- implemented as a macro, though it would be cool to be able to declare a function in this manner. @@ -220,9 +220,9 @@ rcu_assign_pointer() The updater uses this function to assign a new value to an RCU-protected pointer, in order to safely communicate the change - in value from the updater to the reader. This function returns - the new value, and also executes any memory-barrier instructions - required for a given CPU architecture. + in value from the updater to the reader. This macro does not + evaluate to an rvalue, but it does execute any memory-barrier + instructions required for a given CPU architecture. Perhaps just as important, it serves to document (1) which pointers are protected by RCU and (2) the point at which a diff --git a/Documentation/accounting/cgroupstats.rst b/Documentation/accounting/cgroupstats.rst new file mode 100644 index 000000000000..b9afc48f4ea2 --- /dev/null +++ b/Documentation/accounting/cgroupstats.rst @@ -0,0 +1,31 @@ +================== +Control Groupstats +================== + +Control Groupstats is inspired by the discussion at +http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as +suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. + +Per cgroup statistics infrastructure re-uses code from the taskstats +interface. A new set of cgroup operations are registered with commands +and attributes specific to cgroups. It should be very easy to +extend per cgroup statistics, by adding members to the cgroupstats +structure. + +The current model for cgroupstats is a pull, a push model (to post +statistics on interesting events), should be very easy to add. Currently +user space requests for statistics by passing the cgroup path. +Statistics about the state of all the tasks in the cgroup is returned to +user space. + +NOTE: We currently rely on delay accounting for extracting information +about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this +information will not be available. + +To extract cgroup statistics a utility very similar to getdelays.c +has been developed, the sample output of the utility is shown below:: + + ~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup/a" + sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0 + ~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup" + sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2 diff --git a/Documentation/accounting/cgroupstats.txt b/Documentation/accounting/cgroupstats.txt deleted file mode 100644 index d16a9849e60e..000000000000 --- a/Documentation/accounting/cgroupstats.txt +++ /dev/null @@ -1,27 +0,0 @@ -Control Groupstats is inspired by the discussion at -http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as -suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. - -Per cgroup statistics infrastructure re-uses code from the taskstats -interface. A new set of cgroup operations are registered with commands -and attributes specific to cgroups. It should be very easy to -extend per cgroup statistics, by adding members to the cgroupstats -structure. - -The current model for cgroupstats is a pull, a push model (to post -statistics on interesting events), should be very easy to add. Currently -user space requests for statistics by passing the cgroup path. -Statistics about the state of all the tasks in the cgroup is returned to -user space. - -NOTE: We currently rely on delay accounting for extracting information -about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this -information will not be available. - -To extract cgroup statistics a utility very similar to getdelays.c -has been developed, the sample output of the utility is shown below - -~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup/a" -sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0 -~/balbir/cgroupstats # ./getdelays -C "/sys/fs/cgroup" -sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2 diff --git a/Documentation/accounting/delay-accounting.rst b/Documentation/accounting/delay-accounting.rst new file mode 100644 index 000000000000..7cc7f5852da0 --- /dev/null +++ b/Documentation/accounting/delay-accounting.rst @@ -0,0 +1,126 @@ +================ +Delay accounting +================ + +Tasks encounter delays in execution when they wait +for some kernel resource to become available e.g. a +runnable task may wait for a free CPU to run on. + +The per-task delay accounting functionality measures +the delays experienced by a task while + +a) waiting for a CPU (while being runnable) +b) completion of synchronous block I/O initiated by the task +c) swapping in pages +d) memory reclaim + +and makes these statistics available to userspace through +the taskstats interface. + +Such delays provide feedback for setting a task's cpu priority, +io priority and rss limit values appropriately. Long delays for +important tasks could be a trigger for raising its corresponding priority. + +The functionality, through its use of the taskstats interface, also provides +delay statistics aggregated for all tasks (or threads) belonging to a +thread group (corresponding to a traditional Unix process). This is a commonly +needed aggregation that is more efficiently done by the kernel. + +Userspace utilities, particularly resource management applications, can also +aggregate delay statistics into arbitrary groups. To enable this, delay +statistics of a task are available both during its lifetime as well as on its +exit, ensuring continuous and complete monitoring can be done. + + +Interface +--------- + +Delay accounting uses the taskstats interface which is described +in detail in a separate document in this directory. Taskstats returns a +generic data structure to userspace corresponding to per-pid and per-tgid +statistics. The delay accounting functionality populates specific fields of +this structure. See + + include/linux/taskstats.h + +for a description of the fields pertaining to delay accounting. +It will generally be in the form of counters returning the cumulative +delay seen for cpu, sync block I/O, swapin, memory reclaim etc. + +Taking the difference of two successive readings of a given +counter (say cpu_delay_total) for a task will give the delay +experienced by the task waiting for the corresponding resource +in that interval. + +When a task exits, records containing the per-task statistics +are sent to userspace without requiring a command. If it is the last exiting +task of a thread group, the per-tgid statistics are also sent. More details +are given in the taskstats interface description. + +The getdelays.c userspace utility in tools/accounting directory allows simple +commands to be run and the corresponding delay statistics to be displayed. It +also serves as an example of using the taskstats interface. + +Usage +----- + +Compile the kernel with:: + + CONFIG_TASK_DELAY_ACCT=y + CONFIG_TASKSTATS=y + +Delay accounting is enabled by default at boot up. +To disable, add:: + + nodelayacct + +to the kernel boot options. The rest of the instructions +below assume this has not been done. + +After the system has booted up, use a utility +similar to getdelays.c to access the delays +seen by a given task or a task group (tgid). +The utility also allows a given command to be +executed and the corresponding delays to be +seen. + +General format of the getdelays command:: + + getdelays [-t tgid] [-p pid] [-c cmd...] + + +Get delays, since system boot, for pid 10:: + + # ./getdelays -p 10 + (output similar to next case) + +Get sum of delays, since system boot, for all pids with tgid 5:: + + # ./getdelays -t 5 + + + CPU count real total virtual total delay total + 7876 92005750 100000000 24001500 + IO count delay total + 0 0 + SWAP count delay total + 0 0 + RECLAIM count delay total + 0 0 + +Get delays seen in executing a given simple command:: + + # ./getdelays -c ls / + + bin data1 data3 data5 dev home media opt root srv sys usr + boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var + + + CPU count real total virtual total delay total + 6 4000250 4000000 0 + IO count delay total + 0 0 + SWAP count delay total + 0 0 + RECLAIM count delay total + 0 0 diff --git a/Documentation/accounting/delay-accounting.txt b/Documentation/accounting/delay-accounting.txt deleted file mode 100644 index 042ea59b5853..000000000000 --- a/Documentation/accounting/delay-accounting.txt +++ /dev/null @@ -1,117 +0,0 @@ -Delay accounting ----------------- - -Tasks encounter delays in execution when they wait -for some kernel resource to become available e.g. a -runnable task may wait for a free CPU to run on. - -The per-task delay accounting functionality measures -the delays experienced by a task while - -a) waiting for a CPU (while being runnable) -b) completion of synchronous block I/O initiated by the task -c) swapping in pages -d) memory reclaim - -and makes these statistics available to userspace through -the taskstats interface. - -Such delays provide feedback for setting a task's cpu priority, -io priority and rss limit values appropriately. Long delays for -important tasks could be a trigger for raising its corresponding priority. - -The functionality, through its use of the taskstats interface, also provides -delay statistics aggregated for all tasks (or threads) belonging to a -thread group (corresponding to a traditional Unix process). This is a commonly -needed aggregation that is more efficiently done by the kernel. - -Userspace utilities, particularly resource management applications, can also -aggregate delay statistics into arbitrary groups. To enable this, delay -statistics of a task are available both during its lifetime as well as on its -exit, ensuring continuous and complete monitoring can be done. - - -Interface ---------- - -Delay accounting uses the taskstats interface which is described -in detail in a separate document in this directory. Taskstats returns a -generic data structure to userspace corresponding to per-pid and per-tgid -statistics. The delay accounting functionality populates specific fields of -this structure. See - include/linux/taskstats.h -for a description of the fields pertaining to delay accounting. -It will generally be in the form of counters returning the cumulative -delay seen for cpu, sync block I/O, swapin, memory reclaim etc. - -Taking the difference of two successive readings of a given -counter (say cpu_delay_total) for a task will give the delay -experienced by the task waiting for the corresponding resource -in that interval. - -When a task exits, records containing the per-task statistics -are sent to userspace without requiring a command. If it is the last exiting -task of a thread group, the per-tgid statistics are also sent. More details -are given in the taskstats interface description. - -The getdelays.c userspace utility in tools/accounting directory allows simple -commands to be run and the corresponding delay statistics to be displayed. It -also serves as an example of using the taskstats interface. - -Usage ------ - -Compile the kernel with - CONFIG_TASK_DELAY_ACCT=y - CONFIG_TASKSTATS=y - -Delay accounting is enabled by default at boot up. -To disable, add - nodelayacct -to the kernel boot options. The rest of the instructions -below assume this has not been done. - -After the system has booted up, use a utility -similar to getdelays.c to access the delays -seen by a given task or a task group (tgid). -The utility also allows a given command to be -executed and the corresponding delays to be -seen. - -General format of the getdelays command - -getdelays [-t tgid] [-p pid] [-c cmd...] - - -Get delays, since system boot, for pid 10 -# ./getdelays -p 10 -(output similar to next case) - -Get sum of delays, since system boot, for all pids with tgid 5 -# ./getdelays -t 5 - - -CPU count real total virtual total delay total - 7876 92005750 100000000 24001500 -IO count delay total - 0 0 -SWAP count delay total - 0 0 -RECLAIM count delay total - 0 0 - -Get delays seen in executing a given simple command -# ./getdelays -c ls / - -bin data1 data3 data5 dev home media opt root srv sys usr -boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var - - -CPU count real total virtual total delay total - 6 4000250 4000000 0 -IO count delay total - 0 0 -SWAP count delay total - 0 0 -RECLAIM count delay total - 0 0 diff --git a/Documentation/accounting/index.rst b/Documentation/accounting/index.rst new file mode 100644 index 000000000000..9369d8bf32be --- /dev/null +++ b/Documentation/accounting/index.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========== +Accounting +========== + +.. toctree:: + :maxdepth: 1 + + cgroupstats + delay-accounting + psi + taskstats + taskstats-struct diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst new file mode 100644 index 000000000000..621111ce5740 --- /dev/null +++ b/Documentation/accounting/psi.rst @@ -0,0 +1,182 @@ +================================ +PSI - Pressure Stall Information +================================ + +:Date: April, 2018 +:Author: Johannes Weiner + +When CPU, memory or IO devices are contended, workloads experience +latency spikes, throughput losses, and run the risk of OOM kills. + +Without an accurate measure of such contention, users are forced to +either play it safe and under-utilize their hardware resources, or +roll the dice and frequently suffer the disruptions resulting from +excessive overcommit. + +The psi feature identifies and quantifies the disruptions caused by +such resource crunches and the time impact it has on complex workloads +or even entire systems. + +Having an accurate measure of productivity losses caused by resource +scarcity aids users in sizing workloads to hardware--or provisioning +hardware according to workload demand. + +As psi aggregates this information in realtime, systems can be managed +dynamically using techniques such as load shedding, migrating jobs to +other systems or data centers, or strategically pausing or killing low +priority or restartable batch jobs. + +This allows maximizing hardware utilization without sacrificing +workload health or risking major disruptions such as OOM kills. + +Pressure interface +================== + +Pressure information for each resource is exported through the +respective file in /proc/pressure/ -- cpu, memory, and io. + +The format for CPU is as such:: + + some avg10=0.00 avg60=0.00 avg300=0.00 total=0 + +and for memory and IO:: + + some avg10=0.00 avg60=0.00 avg300=0.00 total=0 + full avg10=0.00 avg60=0.00 avg300=0.00 total=0 + +The "some" line indicates the share of time in which at least some +tasks are stalled on a given resource. + +The "full" line indicates the share of time in which all non-idle +tasks are stalled on a given resource simultaneously. In this state +actual CPU cycles are going to waste, and a workload that spends +extended time in this state is considered to be thrashing. This has +severe impact on performance, and it's useful to distinguish this +situation from a state where some tasks are stalled but the CPU is +still doing productive work. As such, time spent in this subset of the +stall state is tracked separately and exported in the "full" averages. + +The ratios (in %) are tracked as recent trends over ten, sixty, and +three hundred second windows, which gives insight into short term events +as well as medium and long term trends. The total absolute stall time +(in us) is tracked and exported as well, to allow detection of latency +spikes which wouldn't necessarily make a dent in the time averages, +or to average trends over custom time frames. + +Monitoring for pressure thresholds +================================== + +Users can register triggers and use poll() to be woken up when resource +pressure exceeds certain thresholds. + +A trigger describes the maximum cumulative stall time over a specific +time window, e.g. 100ms of total stall time within any 500ms window to +generate a wakeup event. + +To register a trigger user has to open psi interface file under +/proc/pressure/ representing the resource to be monitored and write the +desired threshold and time window. The open file descriptor should be +used to wait for trigger events using select(), poll() or epoll(). +The following format is used:: + +