KVM/arm fixes for 5.3

- A bunch of switch/case fall-through annotation, fixing one actual bug - Fix PMU reset bug - Add missing exception class debug strings -----BEGIN PGP SIGNATURE----- iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl1Bzw8PHG1hekBrZXJu ZWwub3JnAAoJECPQ0LrRPXpDlXYP/ixqJzqpJetTrvpiUpmLjhp4YwjjOxqyeQvo bWy/EFz8bSWbTZlwAAstFDVmtGenuwaiOakChvV8GH6USYqRsYdvc/sJu0evQplJ JQtOzGhyv1NuM0s9wYBcstAH+YAW+gBK5YFnowreheuidK/1lo3C/EnR2DxCtNal gpV3qQt8qfw3ysGlpC/fDjjOYw4lDkFa6CSx9uk3/587fPBqHANRY/i87nJxmhhX lGeCJcOrY3cy1HhbedFwxVt4Q/ZbHf0UhTfgwvsBYw7BaWmB1ymoEOoktQcUWoKb LL0rBe+OxNQgRnJpn3fMEHiCAmXaI9qE4dohFOl1J3dQvCElcV/jWjkXDD1+KgzW S2XZGB6yxet93Fh1x6xv4i6ATJvmZeTIDUXi9KkjcDiycB9YMCDYY2ejTbQv5VUP V0DghGGDd3d8sY7dEjxwBakuJ6nqKixSouQaNsWuBTm7tVpEVS8yW+hqWs/IVI5b 48SDbxaNpKvx7sAyhuWAjCFbZeIm0hd//JN3JoxazF9i9PKuqnZLbNv/ME6hmzj+ LrETwaAbjsw5Au+ST+OdT2UiauiBm9C6Kg62qagHrKJviuK941+3hjH8aj/e0pYk a0DQxumiyofXPQ0pVe8ZfqlPptONz+EKyAsrOm8AjLJ+bBdRUNHLcZKYj7em7YiE pANc8/T+ =kcDj -----END PGP SIGNATURE----- Merge tag 'kvmarm-fixes-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm fixes for 5.3 - A bunch of switch/case fall-through annotation, fixing one actual bug - Fix PMU reset bug - Add missing exception class debug strings
2019-08-09 16:53:39 +02:00 · 2019-08-09 16:53:39 +02:00 · 0e1c438c44
commit 0e1c438c44
parent c096397c78 cdb2d3ee04
6160 changed files with 774378 additions and 262467 deletions
--- a/.gitignore
+++ b/.gitignore
@ -30,6 +30,7 @@
 *.lz4
 *.lzma
 *.lzo
 *.mod
 *.mod.c
 *.o
 *.o.*
--- a/3
+++ b/3
@ -1770,7 +1770,6 @@ S: USA
 N: Dave Jones
 E: davej@codemonkey.org.uk
 W: http://www.codemonkey.org.uk
 D: Assorted VIA x86 support.
 D: 2.5 AGPGART overhaul.
 D: CPUFREQ maintenance.
@ -3120,7 +3119,7 @@ S: France
 N: Rik van Riel
 E: riel@redhat.com
 W: http://www.surriel.com/
-D: Linux-MM site, Documentation/sysctl/*, swap/mm readaround
+D: Linux-MM site, Documentation/admin-guide/sysctl/*, swap/mm readaround
 D: kswapd fixes, random kernel hacker, rmap VM,
 D: nl.linux.org administrator, minor scheduler additions
 S: Red Hat Boston
--- a/Documentation/ABI/obsolete/sysfs-gpio
+++ b/Documentation/ABI/obsolete/sysfs-gpio
@ -11,7 +11,7 @@ Description:
  Kernel code may export it for complete or partial access.
  GPIOs are identified as they are inside the kernel, using integers in
-  the range 0..INT_MAX.  See Documentation/gpio for more information.
+  the range 0..INT_MAX.  See Documentation/admin-guide/gpio for more information.
    /sys/class/gpio
 	/export ... asks the kernel to export a GPIO to userspace
--- a/Documentation/ABI/removed/sysfs-class-rfkill
+++ b/Documentation/ABI/removed/sysfs-class-rfkill
@ -1,6 +1,6 @@
 rfkill - radio frequency (RF) connector kill switch support
-For details to this subsystem look at Documentation/rfkill.txt.
+For details to this subsystem look at Documentation/driver-api/rfkill.rst.
 What:		/sys/class/rfkill/rfkill[0-9]+/claim
 Date:		09-Jul-2007
--- a/Documentation/ABI/stable/sysfs-class-infiniband
+++ b/Documentation/ABI/stable/sysfs-class-infiniband
@ -423,23 +423,6 @@ Description:
 		(e.g. driver restart on the VM which owns the VF).
 sysfs interface for NetEffect RNIC Low-Level iWARP driver (nes)
 ---------------------------------------------------------------
 What:		/sys/class/infiniband/nesX/hw_rev
 What:		/sys/class/infiniband/nesX/hca_type
 What:		/sys/class/infiniband/nesX/board_id
 Date:		Feb, 2008
 KernelVersion:	v2.6.25
 Contact:	linux-rdma@vger.kernel.org
 Description:
 		hw_rev:		(RO) Hardware revision number
 		hca_type:	(RO) Host Channel Adapter type (NEX020)
 		board_id:	(RO) Manufacturing board id
 sysfs interface for Chelsio T4/T5 RDMA driver (cxgb4)
 -----------------------------------------------------
--- a/Documentation/ABI/stable/sysfs-class-rfkill
+++ b/Documentation/ABI/stable/sysfs-class-rfkill
@ -1,6 +1,6 @@
 rfkill - radio frequency (RF) connector kill switch support
-For details to this subsystem look at Documentation/rfkill.txt.
+For details to this subsystem look at Documentation/driver-api/rfkill.rst.
 For the deprecated /sys/class/rfkill/*/claim knobs of this interface look in
 Documentation/ABI/removed/sysfs-class-rfkill.
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@ -61,7 +61,7 @@ Date:		October 2002
 Contact:	Linux Memory Management list <linux-mm@kvack.org>
 Description:
 		The node's hit/miss statistics, in units of pages.
-		See Documentation/numastat.txt
+		See Documentation/admin-guide/numastat.rst
 What:		/sys/devices/system/node/nodeX/distance
 Date:		October 2002
--- a/Documentation/ABI/stable/sysfs-driver-mlxreg-io
+++ b/Documentation/ABI/stable/sysfs-driver-mlxreg-io
@ -120,3 +120,23 @@ Description:	These files show the system reset cause, as following: ComEx
 		the last reset cause.
 		The files are read only.
 Date:		June 2019
 KernelVersion:	5.3
 Contact:	Vadim Pasternak <vadimpmellanox.com>
 Description:	These files show the system reset cause, as following:
 		COMEX thermal shutdown; wathchdog power off or reset was derived
 		by one of the next components: COMEX, switch board or by Small Form
 		Factor mezzanine, reset requested from ASIC, reset cuased by BIOS
 		reload. Value 1 in file means this is reset cause, 0 - otherwise.
 		Only one of the above causes could be 1 at the same time, representing
 		only last reset cause.
 		The files are read only.
 What:		/sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_comex_thermal
 What:		/sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_comex_wd
 What:		/sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_from_asic
 What:		/sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_reload_bios
 What:		/sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_sff_wd
 What:		/sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_swb_wd
--- a/Documentation/ABI/testing/procfs-diskstats
+++ b/Documentation/ABI/testing/procfs-diskstats
@ -29,4 +29,4 @@ Description:
 		17 - sectors discarded
 		18 - time spent discarding
-		For more details refer to Documentation/iostats.txt
+		For more details refer to Documentation/admin-guide/iostats.rst
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@ -15,7 +15,7 @@ Description:
 		 9 - I/Os currently in progress
 		10 - time spent doing I/Os (ms)
 		11 - weighted time spent doing I/Os (ms)
-		For more details refer Documentation/iostats.txt
+		For more details refer Documentation/admin-guide/iostats.rst
 What:		/sys/block/<disk>/<part>/stat
--- a/Documentation/ABI/testing/sysfs-block-device
+++ b/Documentation/ABI/testing/sysfs-block-device
@ -45,7 +45,7 @@ Description:
 		- Values below -2 are rejected with -EINVAL
 		For more information, see
-		Documentation/laptops/disk-shock-protection.txt
+		Documentation/admin-guide/laptops/disk-shock-protection.rst
 What:		/sys/block/*/device/ncq_prio_enable
--- a/Documentation/ABI/testing/sysfs-class-power
+++ b/Documentation/ABI/testing/sysfs-class-power
@ -376,10 +376,42 @@ Description:
 		supply. Normally this is configured based on the type of
 		connection made (e.g. A configured SDP should output a maximum
 		of 500mA so the input current limit is set to the same value).
 		Use preferably input_power_limit, and for problems that can be
 		solved using power limit use input_current_limit.
 		Access: Read, Write
 		Valid values: Represented in microamps
 What:		/sys/class/power_supply/<supply_name>/input_voltage_limit
 Date:		May 2019
 Contact:	linux-pm@vger.kernel.org
 Description:
 		This entry configures the incoming VBUS voltage limit currently
 		set in the supply. Normally this is configured based on
 		system-level knowledge or user input (e.g. This is part of the
 		Pixel C's thermal management strategy to effectively limit the
 		input power to 5V when the screen is on to meet Google's skin
 		temperature targets). Note that this feature should not be
 		used for safety critical things.
 		Use preferably input_power_limit, and for problems that can be
 		solved using power limit use input_voltage_limit.
 		Access: Read, Write
 		Valid values: Represented in microvolts
 What:		/sys/class/power_supply/<supply_name>/input_power_limit
 Date:		May 2019
 Contact:	linux-pm@vger.kernel.org
 Description:
 		This entry configures the incoming power limit currently set
 		in the supply. Normally this is configured based on
 		system-level knowledge or user input. Use preferably this
 		feature to limit the incoming power and use current/voltage
 		limit only for problems that can be solved using power limit.
 		Access: Read, Write
 		Valid values: Represented in microwatts
 What:		/sys/class/power_supply/<supply_name>/online,
 Date:		May 2007
 Contact:	linux-pm@vger.kernel.org
--- a/Documentation/ABI/testing/sysfs-class-power-wilco
+++ b/Documentation/ABI/testing/sysfs-class-power-wilco
@ -0,0 +1,30 @@
 What:		/sys/class/power_supply/wilco-charger/charge_type
 Date:		April 2019
 KernelVersion:	5.2
 Description:
 		What charging algorithm to use:
 		Standard: Fully charges battery at a standard rate.
 		Adaptive: Battery settings adaptively optimized based on
 			typical battery usage pattern.
 		Fast: Battery charges over a shorter period.
 		Trickle: Extends battery lifespan, intended for users who
 			primarily use their Chromebook while connected to AC.
 		Custom: A low and high threshold percentage is specified.
 			Charging begins when level drops below
 			charge_control_start_threshold, and ceases when
 			level is above charge_control_end_threshold.
 What:		/sys/class/power_supply/wilco-charger/charge_control_start_threshold
 Date:		April 2019
 KernelVersion:	5.2
 Description:
 		Used when charge_type="Custom", as described above. Measured in
 		percentages. The valid range is [50, 95].
 What:		/sys/class/power_supply/wilco-charger/charge_control_end_threshold
 Date:		April 2019
 KernelVersion:	5.2
 Description:
 		Used when charge_type="Custom", as described above. Measured in
 		percentages. The valid range is [55, 100].
--- a/Documentation/ABI/testing/sysfs-class-powercap
+++ b/Documentation/ABI/testing/sysfs-class-powercap
@ -5,7 +5,7 @@ Contact:	linux-pm@vger.kernel.org
 Description:
 		The powercap/ class sub directory belongs to the power cap
 		subsystem. Refer to
-		Documentation/power/powercap/powercap.txt for details.
+		Documentation/power/powercap/powercap.rst for details.
 What:		/sys/class/powercap/<control type>
 Date:		September 2013
--- a/Documentation/ABI/testing/sysfs-class-switchtec
+++ b/Documentation/ABI/testing/sysfs-class-switchtec
@ -1,6 +1,6 @@
 switchtec - Microsemi Switchtec PCI Switch Management Endpoint
-For details on this subsystem look at Documentation/switchtec.txt.
+For details on this subsystem look at Documentation/driver-api/switchtec.rst.
 What: 		/sys/class/switchtec
 Date:		05-Jan-2017
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@ -34,7 +34,7 @@ Description:	CPU topology files that describe kernel limits related to
 		present: cpus that have been identified as being present in
 		the system.
-		See Documentation/cputopology.txt for more information.
+		See Documentation/admin-guide/cputopology.rst for more information.
 What:		/sys/devices/system/cpu/probe
@ -103,7 +103,7 @@ Description:	CPU topology files that describe a logical CPU's relationship
 		thread_siblings_list: human-readable list of cpu#'s hardware
 		threads within the same core as cpu#
-		See Documentation/cputopology.txt for more information.
+		See Documentation/admin-guide/cputopology.rst for more information.
 What:		/sys/devices/system/cpu/cpuidle/current_driver
--- a/Documentation/ABI/testing/sysfs-platform-asus-laptop
+++ b/Documentation/ABI/testing/sysfs-platform-asus-laptop
@ -31,7 +31,7 @@ Description:
 		To control the LED display, use the following :
 		    echo 0x0T000DDD > /sys/devices/platform/asus_laptop/
 		where T control the 3 letters display, and DDD the 3 digits display.
-		The DDD table can be found in Documentation/laptops/asus-laptop.txt
+		The DDD table can be found in Documentation/admin-guide/laptops/asus-laptop.rst
 What:		/sys/devices/platform/asus_laptop/bluetooth
 Date:		January 2007
--- a/Documentation/ABI/testing/sysfs-platform-asus-wmi
+++ b/Documentation/ABI/testing/sysfs-platform-asus-wmi
@ -36,3 +36,13 @@ KernelVersion:	3.5
 Contact:	"AceLan Kao" <acelan.kao@canonical.com>
 Description:
 		Resume on lid open. 1 means on, 0 means off.
 What:		/sys/devices/platform/<platform>/fan_boost_mode
 Date:		Sep 2019
 KernelVersion:	5.3
 Contact:	"Yurii Pavlovskyi" <yurii.pavlovskyi@gmail.com>
 Description:
 		Fan boost mode:
 			* 0 - normal,
 			* 1 - overboost,
 			* 2 - silent
--- a/Documentation/ABI/testing/sysfs-platform-i2c-demux-pinctrl
+++ b/Documentation/ABI/testing/sysfs-platform-i2c-demux-pinctrl
@ -1,7 +1,7 @@
 What:		/sys/devices/platform/<i2c-demux-name>/available_masters
 Date:		January 2016
 KernelVersion:	4.6
-Contact:	Wolfram Sang <wsa@the-dreams.de>
+Contact:	Wolfram Sang <wsa+renesas@sang-engineering.com>
 Description:
 		Reading the file will give you a list of masters which can be
 		selected for a demultiplexed bus. The format is
@ -12,7 +12,7 @@ Description:
 What:		/sys/devices/platform/<i2c-demux-name>/current_master
 Date:		January 2016
 KernelVersion:	4.6
-Contact:	Wolfram Sang <wsa@the-dreams.de>
+Contact:	Wolfram Sang <wsa+renesas@sang-engineering.com>
 Description:
 		This file selects/shows the active I2C master for a demultiplexed
 		bus. It uses the <index> value from the file 'available_masters'.
--- a/Documentation/COPYING-logo
+++ b/Documentation/COPYING-logo
--- a/Documentation/DMA-API-HOWTO.txt
+++ b/Documentation/DMA-API-HOWTO.txt
@ -212,7 +212,7 @@ The standard 64-bit addressing device would do something like this::
 If the device only supports 32-bit addressing for descriptors in the
 coherent allocations, but supports full 64-bits for streaming mappings
-it would look like this:
+it would look like this::
 	if (dma_set_mask(dev, DMA_BIT_MASK(64))) {
 		dev_warn(dev, "mydev: No suitable DMA available\n");
--- a/Documentation/EDID/howto.rst
+++ b/Documentation/EDID/howto.rst
@ -1,58 +0,0 @@
 :orphan:
 ====
 EDID
 ====
 In the good old days when graphics parameters were configured explicitly
 in a file called xorg.conf, even broken hardware could be managed.
 Today, with the advent of Kernel Mode Setting, a graphics board is
 either correctly working because all components follow the standards -
 or the computer is unusable, because the screen remains dark after
 booting or it displays the wrong area. Cases when this happens are:
 - The graphics board does not recognize the monitor.
 - The graphics board is unable to detect any EDID data.
 - The graphics board incorrectly forwards EDID data to the driver.
 - The monitor sends no or bogus EDID data.
 - A KVM sends its own EDID data instead of querying the connected monitor.
 Adding the kernel parameter "nomodeset" helps in most cases, but causes
 restrictions later on.
 As a remedy for such situations, the kernel configuration item
 CONFIG_DRM_LOAD_EDID_FIRMWARE was introduced. It allows to provide an
 individually prepared or corrected EDID data set in the /lib/firmware
 directory from where it is loaded via the firmware interface. The code
 (see drivers/gpu/drm/drm_edid_load.c) contains built-in data sets for
 commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200,
 1680x1050, 1920x1080) as binary blobs, but the kernel source tree does
 not contain code to create these data. In order to elucidate the origin
 of the built-in binary EDID blobs and to facilitate the creation of
 individual data for a specific misbehaving monitor, commented sources
 and a Makefile environment are given here.
 To create binary EDID and C source code files from the existing data
 material, simply type "make".
 If you want to create your own EDID file, copy the file 1024x768.S,
 replace the settings with your own data and add a new target to the
 Makefile. Please note that the EDID data structure expects the timing
 values in a different way as compared to the standard X11 format.
 X11:
  HTimings:
    hdisp hsyncstart hsyncend htotal
  VTimings:
    vdisp vsyncstart vsyncend vtotal
 EDID::
  #define XPIX hdisp
  #define XBLANK htotal-hdisp
  #define XOFFSET hsyncstart-hdisp
  #define XPULSE hsyncend-hsyncstart
  #define YPIX vdisp
  #define YBLANK vtotal-vdisp
  #define YOFFSET vsyncstart-vdisp
  #define YPULSE vsyncend-vsyncstart
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@ -1,270 +0,0 @@
 		The MSI Driver Guide HOWTO
 	Tom L Nguyen tom.l.nguyen@intel.com
 			10/03/2003
 	Revised Feb 12, 2004 by Martine Silbermann
 		email: Martine.Silbermann@hp.com
 	Revised Jun 25, 2004 by Tom L Nguyen
 	Revised Jul  9, 2008 by Matthew Wilcox <willy@linux.intel.com>
 		Copyright 2003, 2008 Intel Corporation
 1. About this guide
 This guide describes the basics of Message Signaled Interrupts (MSIs),
 the advantages of using MSI over traditional interrupt mechanisms, how
 to change your driver to use MSI or MSI-X and some basic diagnostics to
 try if a device doesn't support MSIs.
 2. What are MSIs?
 A Message Signaled Interrupt is a write from the device to a special
 address which causes an interrupt to be received by the CPU.
 The MSI capability was first specified in PCI 2.2 and was later enhanced
 in PCI 3.0 to allow each interrupt to be masked individually.  The MSI-X
 capability was also introduced with PCI 3.0.  It supports more interrupts
 per device than MSI and allows interrupts to be independently configured.
 Devices may support both MSI and MSI-X, but only one can be enabled at
 a time.
 3. Why use MSIs?
 There are three reasons why using MSIs can give an advantage over
 traditional pin-based interrupts.
 Pin-based PCI interrupts are often shared amongst several devices.
 To support this, the kernel must call each interrupt handler associated
 with an interrupt, which leads to reduced performance for the system as
 a whole.  MSIs are never shared, so this problem cannot arise.
 When a device writes data to memory, then raises a pin-based interrupt,
 it is possible that the interrupt may arrive before all the data has
 arrived in memory (this becomes more likely with devices behind PCI-PCI
 bridges).  In order to ensure that all the data has arrived in memory,
 the interrupt handler must read a register on the device which raised
 the interrupt.  PCI transaction ordering rules require that all the data
 arrive in memory before the value may be returned from the register.
 Using MSIs avoids this problem as the interrupt-generating write cannot
 pass the data writes, so by the time the interrupt is raised, the driver
 knows that all the data has arrived in memory.
 PCI devices can only support a single pin-based interrupt per function.
 Often drivers have to query the device to find out what event has
 occurred, slowing down interrupt handling for the common case.  With
 MSIs, a device can support more interrupts, allowing each interrupt
 to be specialised to a different purpose.  One possible design gives
 infrequent conditions (such as errors) their own interrupt which allows
 the driver to handle the normal interrupt handling path more efficiently.
 Other possible designs include giving one interrupt to each packet queue
 in a network card or each port in a storage controller.
 4. How to use MSIs
 PCI devices are initialised to use pin-based interrupts.  The device
 driver has to set up the device to use MSI or MSI-X.  Not all machines
 support MSIs correctly, and for those machines, the APIs described below
 will simply fail and the device will continue to use pin-based interrupts.
 4.1 Include kernel support for MSIs
 To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI
 option enabled.  This option is only available on some architectures,
 and it may depend on some other options also being set.  For example,
 on x86, you must also enable X86_UP_APIC or SMP in order to see the
 CONFIG_PCI_MSI option.
 4.2 Using MSI
 Most of the hard work is done for the driver in the PCI layer.  The driver
 simply has to request that the PCI layer set up the MSI capability for this
 device.
 To automatically use MSI or MSI-X interrupt vectors, use the following
 function:
  int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
 		unsigned int max_vecs, unsigned int flags);
 which allocates up to max_vecs interrupt vectors for a PCI device.  It
 returns the number of vectors allocated or a negative error.  If the device
 has a requirements for a minimum number of vectors the driver can pass a
 min_vecs argument set to this limit, and the PCI core will return -ENOSPC
 if it can't meet the minimum number of vectors.
 The flags argument is used to specify which type of interrupt can be used
 by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX).
 A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for
 any possible kind of interrupt.  If the PCI_IRQ_AFFINITY flag is set,
 pci_alloc_irq_vectors() will spread the interrupts around the available CPUs.
 To get the Linux IRQ numbers passed to request_irq() and free_irq() and the
 vectors, use the following function:
  int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
 Any allocated resources should be freed before removing the device using
 the following function:
  void pci_free_irq_vectors(struct pci_dev *dev);
 If a device supports both MSI-X and MSI capabilities, this API will use the
 MSI-X facilities in preference to the MSI facilities.  MSI-X supports any
 number of interrupts between 1 and 2048.  In contrast, MSI is restricted to
 a maximum of 32 interrupts (and must be a power of two).  In addition, the
 MSI interrupt vectors must be allocated consecutively, so the system might
 not be able to allocate as many vectors for MSI as it could for MSI-X.  On
 some platforms, MSI interrupts must all be targeted at the same set of CPUs
 whereas MSI-X interrupts can all be targeted at different CPUs.
 If a device supports neither MSI-X or MSI it will fall back to a single
 legacy IRQ vector.
 The typical usage of MSI or MSI-X interrupts is to allocate as many vectors
 as possible, likely up to the limit supported by the device.  If nvec is
 larger than the number supported by the device it will automatically be
 capped to the supported limit, so there is no need to query the number of
 vectors supported beforehand:
 	nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_ALL_TYPES)
 	if (nvec < 0)
 		goto out_err;
 If a driver is unable or unwilling to deal with a variable number of MSI
 interrupts it can request a particular number of interrupts by passing that
 number to pci_alloc_irq_vectors() function as both 'min_vecs' and
 'max_vecs' parameters:
 	ret = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_ALL_TYPES);
 	if (ret < 0)
 		goto out_err;
 The most notorious example of the request type described above is enabling
 the single MSI mode for a device.  It could be done by passing two 1s as
 'min_vecs' and 'max_vecs':
 	ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
 	if (ret < 0)
 		goto out_err;
 Some devices might not support using legacy line interrupts, in which case
 the driver can specify that only MSI or MSI-X is acceptable:
 	nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_MSI | PCI_IRQ_MSIX);
 	if (nvec < 0)
 		goto out_err;
 4.3 Legacy APIs
 The following old APIs to enable and disable MSI or MSI-X interrupts should
 not be used in new code:
  pci_enable_msi()		/* deprecated */
  pci_disable_msi()		/* deprecated */
  pci_enable_msix_range()	/* deprecated */
  pci_enable_msix_exact()	/* deprecated */
  pci_disable_msix()		/* deprecated */
 Additionally there are APIs to provide the number of supported MSI or MSI-X
 vectors: pci_msi_vec_count() and pci_msix_vec_count().  In general these
 should be avoided in favor of letting pci_alloc_irq_vectors() cap the
 number of vectors.  If you have a legitimate special use case for the count
 of vectors we might have to revisit that decision and add a
 pci_nr_irq_vectors() helper that handles MSI and MSI-X transparently.
 4.4 Considerations when using MSIs
 4.4.1 Spinlocks
 Most device drivers have a per-device spinlock which is taken in the
 interrupt handler.  With pin-based interrupts or a single MSI, it is not
 necessary to disable interrupts (Linux guarantees the same interrupt will
 not be re-entered).  If a device uses multiple interrupts, the driver
 must disable interrupts while the lock is held.  If the device sends
 a different interrupt, the driver will deadlock trying to recursively
 acquire the spinlock.  Such deadlocks can be avoided by using
 spin_lock_irqsave() or spin_lock_irq() which disable local interrupts
 and acquire the lock (see Documentation/kernel-hacking/locking.rst).
 4.5 How to tell whether MSI/MSI-X is enabled on a device
 Using 'lspci -v' (as root) may show some devices with "MSI", "Message
 Signalled Interrupts" or "MSI-X" capabilities.  Each of these capabilities
 has an 'Enable' flag which is followed with either "+" (enabled)
 or "-" (disabled).
 5. MSI quirks
 Several PCI chipsets or devices are known not to support MSIs.
 The PCI stack provides three ways to disable MSIs:
 1. globally
 2. on all devices behind a specific bridge
 3. on a single device
 5.1. Disabling MSIs globally
 Some host chipsets simply don't support MSIs properly.  If we're
 lucky, the manufacturer knows this and has indicated it in the ACPI
 FADT table.  In this case, Linux automatically disables MSIs.
 Some boards don't include this information in the table and so we have
 to detect them ourselves.  The complete list of these is found near the
 quirk_disable_all_msi() function in drivers/pci/quirks.c.
 If you have a board which has problems with MSIs, you can pass pci=nomsi
 on the kernel command line to disable MSIs on all devices.  It would be
 in your best interests to report the problem to linux-pci@vger.kernel.org
 including a full 'lspci -v' so we can add the quirks to the kernel.
 5.2. Disabling MSIs below a bridge
 Some PCI bridges are not able to route MSIs between busses properly.
 In this case, MSIs must be disabled on all devices behind the bridge.
 Some bridges allow you to enable MSIs by changing some bits in their
 PCI configuration space (especially the Hypertransport chipsets such
 as the nVidia nForce and Serverworks HT2000).  As with host chipsets,
 Linux mostly knows about them and automatically enables MSIs if it can.
 If you have a bridge unknown to Linux, you can enable
 MSIs in configuration space using whatever method you know works, then
 enable MSIs on that bridge by doing:
       echo 1 > /sys/bus/pci/devices/$bridge/msi_bus
 where $bridge is the PCI address of the bridge you've enabled (eg
 0000:00:0e.0).
 To disable MSIs, echo 0 instead of 1.  Changing this value should be
 done with caution as it could break interrupt handling for all devices
 below this bridge.
 Again, please notify linux-pci@vger.kernel.org of any bridges that need
 special handling.
 5.3. Disabling MSIs on a single device
 Some devices are known to have faulty MSI implementations.  Usually this
 is handled in the individual device driver, but occasionally it's necessary
 to handle this with a quirk.  Some drivers have an option to disable use
 of MSI.  While this is a convenient workaround for the driver author,
 it is not good practice, and should not be emulated.
 5.4. Finding why MSIs are disabled on a device
 From the above three sections, you can see that there are many reasons
 why MSIs may not be enabled for a given device.  Your first step should
 be to examine your dmesg carefully to determine whether MSIs are enabled
 for your machine.  You should also check your .config to be sure you
 have enabled CONFIG_PCI_MSI.
 Then, 'lspci -t' gives the list of bridges above a device.  Reading
 /sys/bus/pci/devices/*/msi_bus will tell you whether MSIs are enabled (1)
 or disabled (0).  If 0 is found in any of the msi_bus files belonging
 to bridges between the PCI root and the device, MSIs are disabled.
 It is also worth checking the device driver to see whether it supports MSIs.
 For example, it may contain calls to pci_irq_alloc_vectors() with the
 PCI_IRQ_MSI or PCI_IRQ_MSIX flags.
--- a/Documentation/PCI/PCIEBUS-HOWTO.txt
+++ b/Documentation/PCI/PCIEBUS-HOWTO.txt
@ -1,198 +0,0 @@
 		The PCI Express Port Bus Driver Guide HOWTO
 	Tom L Nguyen tom.l.nguyen@intel.com
 			11/03/2004
 1. About this guide
 This guide describes the basics of the PCI Express Port Bus driver
 and provides information on how to enable the service drivers to
 register/unregister with the PCI Express Port Bus Driver.
 2. Copyright 2004 Intel Corporation
 3. What is the PCI Express Port Bus Driver
 A PCI Express Port is a logical PCI-PCI Bridge structure. There
 are two types of PCI Express Port: the Root Port and the Switch
 Port. The Root Port originates a PCI Express link from a PCI Express
 Root Complex and the Switch Port connects PCI Express links to
 internal logical PCI buses. The Switch Port, which has its secondary
 bus representing the switch's internal routing logic, is called the
 switch's Upstream Port. The switch's Downstream Port is bridging from
 switch's internal routing bus to a bus representing the downstream
 PCI Express link from the PCI Express Switch.
 A PCI Express Port can provide up to four distinct functions,
 referred to in this document as services, depending on its port type.
 PCI Express Port's services include native hotplug support (HP),
 power management event support (PME), advanced error reporting
 support (AER), and virtual channel support (VC). These services may
 be handled by a single complex driver or be individually distributed
 and handled by corresponding service drivers.
 4. Why use the PCI Express Port Bus Driver?
 In existing Linux kernels, the Linux Device Driver Model allows a
 physical device to be handled by only a single driver. The PCI
 Express Port is a PCI-PCI Bridge device with multiple distinct
 services. To maintain a clean and simple solution each service
 may have its own software service driver. In this case several
 service drivers will compete for a single PCI-PCI Bridge device.
 For example, if the PCI Express Root Port native hotplug service
 driver is loaded first, it claims a PCI-PCI Bridge Root Port. The
 kernel therefore does not load other service drivers for that Root
 Port. In other words, it is impossible to have multiple service
 drivers load and run on a PCI-PCI Bridge device simultaneously
 using the current driver model.
 To enable multiple service drivers running simultaneously requires
 having a PCI Express Port Bus driver, which manages all populated
 PCI Express Ports and distributes all provided service requests
 to the corresponding service drivers as required. Some key
 advantages of using the PCI Express Port Bus driver are listed below:
 	- Allow multiple service drivers to run simultaneously on
 	  a PCI-PCI Bridge Port device.
 	- Allow service drivers implemented in an independent
 	  staged approach.
 	- Allow one service driver to run on multiple PCI-PCI Bridge
 	  Port devices.
 	- Manage and distribute resources of a PCI-PCI Bridge Port
 	  device to requested service drivers.
 5. Configuring the PCI Express Port Bus Driver vs. Service Drivers
 5.1 Including the PCI Express Port Bus Driver Support into the Kernel
 Including the PCI Express Port Bus driver depends on whether the PCI
 Express support is included in the kernel config. The kernel will
 automatically include the PCI Express Port Bus driver as a kernel
 driver when the PCI Express support is enabled in the kernel.
 5.2 Enabling Service Driver Support
 PCI device drivers are implemented based on Linux Device Driver Model.
 All service drivers are PCI device drivers. As discussed above, it is
 impossible to load any service driver once the kernel has loaded the
 PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver
 Model requires some minimal changes on existing service drivers that
 imposes no impact on the functionality of existing service drivers.
 A service driver is required to use the two APIs shown below to
 register its service with the PCI Express Port Bus driver (see
 section 5.2.1 & 5.2.2). It is important that a service driver
 initializes the pcie_port_service_driver data structure, included in
 header file /include/linux/pcieport_if.h, before calling these APIs.
 Failure to do so will result an identity mismatch, which prevents
 the PCI Express Port Bus driver from loading a service driver.
 5.2.1 pcie_port_service_register
 int pcie_port_service_register(struct pcie_port_service_driver *new)
 This API replaces the Linux Driver Model's pci_register_driver API. A
 service driver should always calls pcie_port_service_register at
 module init. Note that after service driver being loaded, calls
 such as pci_enable_device(dev) and pci_set_master(dev) are no longer
 necessary since these calls are executed by the PCI Port Bus driver.
 5.2.2 pcie_port_service_unregister
 void pcie_port_service_unregister(struct pcie_port_service_driver *new)
 pcie_port_service_unregister replaces the Linux Driver Model's
 pci_unregister_driver. It's always called by service driver when a
 module exits.
 5.2.3 Sample Code
 Below is sample service driver code to initialize the port service
 driver data structure.
 static struct pcie_port_service_id service_id[] = { {
 	.vendor = PCI_ANY_ID,
 	.device = PCI_ANY_ID,
 	.port_type = PCIE_RC_PORT,
 	.service_type = PCIE_PORT_SERVICE_AER,
 	}, { /* end: all zeroes */ }
 };
 static struct pcie_port_service_driver root_aerdrv = {
 	.name		= (char *)device_name,
 	.id_table	= &service_id[0],
 	.probe		= aerdrv_load,
 	.remove		= aerdrv_unload,
 	.suspend	= aerdrv_suspend,
 	.resume		= aerdrv_resume,
 };
 Below is a sample code for registering/unregistering a service
 driver.
 static int __init aerdrv_service_init(void)
 {
 	int retval = 0;
 	retval = pcie_port_service_register(&root_aerdrv);
 	if (!retval) {
 		/*
 		 * FIX ME
 		 */
 	}
 	return retval;
 }
 static void __exit aerdrv_service_exit(void)
 {
 	pcie_port_service_unregister(&root_aerdrv);
 }
 module_init(aerdrv_service_init);
 module_exit(aerdrv_service_exit);
 6. Possible Resource Conflicts
 Since all service drivers of a PCI-PCI Bridge Port device are
 allowed to run simultaneously, below lists a few of possible resource
 conflicts with proposed solutions.
 6.1 MSI and MSI-X Vector Resource
 Once MSI or MSI-X interrupts are enabled on a device, it stays in this
 mode until they are disabled again.  Since service drivers of the same
 PCI-PCI Bridge port share the same physical device, if an individual
 service driver enables or disables MSI/MSI-X mode it may result
 unpredictable behavior.
 To avoid this situation all service drivers are not permitted to
 switch interrupt mode on its device. The PCI Express Port Bus driver
 is responsible for determining the interrupt mode and this should be
 transparent to service drivers. Service drivers need to know only
 the vector IRQ assigned to the field irq of struct pcie_device, which
 is passed in when the PCI Express Port Bus driver probes each service
 driver. Service drivers should use (struct pcie_device*)dev->irq to
 call request_irq/free_irq. In addition, the interrupt mode is stored
 in the field interrupt_mode of struct pcie_device.
 6.3 PCI Memory/IO Mapped Regions
 Service drivers for PCI Express Power Management (PME), Advanced
 Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access
 PCI configuration space on the PCI Express port. In all cases the
 registers accessed are independent of each other. This patch assumes
 that all service drivers will be well behaved and not overwrite
 other service driver's configuration settings.
 6.4 PCI Config Registers
 Each service driver runs its PCI config operations on its own
 capability structure except the PCI Express capability structure, in
 which Root Control register and Device Control register are shared
 between PME and AER. This patch assumes that all service drivers
 will be well behaved and not overwrite other service driver's
 configuration settings.
--- a/Documentation/PCI/acpi-info.rst
+++ b/Documentation/PCI/acpi-info.rst
@ -0,0 +1,192 @@
 .. SPDX-License-Identifier: GPL-2.0
 ========================================
 ACPI considerations for PCI host bridges
 ========================================
 The general rule is that the ACPI namespace should describe everything the
 OS might use unless there's another way for the OS to find it [1, 2].
 For example, there's no standard hardware mechanism for enumerating PCI
 host bridges, so the ACPI namespace must describe each host bridge, the
 method for accessing PCI config space below it, the address space windows
 the host bridge forwards to PCI (using _CRS), and the routing of legacy
 INTx interrupts (using _PRT).
 PCI devices, which are below the host bridge, generally do not need to be
 described via ACPI.  The OS can discover them via the standard PCI
 enumeration mechanism, using config accesses to discover and identify
 devices and read and size their BARs.  However, ACPI may describe PCI
 devices if it provides power management or hotplug functionality for them
 or if the device has INTx interrupts connected by platform interrupt
 controllers and a _PRT is needed to describe those connections.
 ACPI resource description is done via _CRS objects of devices in the ACPI
 namespace [2].   The _CRS is like a generalized PCI BAR: the OS can read
 _CRS and figure out what resource is being consumed even if it doesn't have
 a driver for the device [3].  That's important because it means an old OS
 can work correctly even on a system with new devices unknown to the OS.
 The new devices might not do anything, but the OS can at least make sure no
 resources conflict with them.
 Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for
 reserving address space.  The static tables are for things the OS needs to
 know early in boot, before it can parse the ACPI namespace.  If a new table
 is defined, an old OS needs to operate correctly even though it ignores the
 table.  _CRS allows that because it is generic and understood by the old
 OS; a static table does not.
 If the OS is expected to manage a non-discoverable device described via
 ACPI, that device will have a specific _HID/_CID that tells the OS what
 driver to bind to it, and the _CRS tells the OS and the driver where the
 device's registers are.
 PCI host bridges are PNP0A03 or PNP0A08 devices.  Their _CRS should
 describe all the address space they consume.  This includes all the windows
 they forward down to the PCI bus, as well as registers of the host bridge
 itself that are not forwarded to PCI.  The host bridge registers include
 things like secondary/subordinate bus registers that determine the bus
 range below the bridge, window registers that describe the apertures, etc.
 These are all device-specific, non-architected things, so the only way a
 PNP0A03/PNP0A08 driver can manage them is via _PRS/_CRS/_SRS, which contain
 the device-specific details.  The host bridge registers also include ECAM
 space, since it is consumed by the host bridge.
 ACPI defines a Consumer/Producer bit to distinguish the bridge registers
 ("Consumer") from the bridge apertures ("Producer") [4, 5], but early
 BIOSes didn't use that bit correctly.  The result is that the current ACPI
 spec defines Consumer/Producer only for the Extended Address Space
 descriptors; the bit should be ignored in the older QWord/DWord/Word
 Address Space descriptors.  Consequently, OSes have to assume all
 QWord/DWord/Word descriptors are windows.
 Prior to the addition of Extended Address Space descriptors, the failure of
 Consumer/Producer meant there was no way to describe bridge registers in
 the PNP0A03/PNP0A08 device itself.  The workaround was to describe the
 bridge registers (including ECAM space) in PNP0C02 catch-all devices [6].
 With the exception of ECAM, the bridge register space is device-specific
 anyway, so the generic PNP0A03/PNP0A08 driver (pci_root.c) has no need to
 know about it.  
 New architectures should be able to use "Consumer" Extended Address Space
 descriptors in the PNP0A03 device for bridge registers, including ECAM,
 although a strict interpretation of [6] might prohibit this.  Old x86 and
 ia64 kernels assume all address space descriptors, including "Consumer"
 Extended Address Space ones, are windows, so it would not be safe to
 describe bridge registers this way on those architectures.
 PNP0C02 "motherboard" devices are basically a catch-all.  There's no
 programming model for them other than "don't use these resources for
 anything else."  So a PNP0C02 _CRS should claim any address space that is
 (1) not claimed by _CRS under any other device object in the ACPI namespace
 and (2) should not be assigned by the OS to something else.
 The PCIe spec requires the Enhanced Configuration Access Method (ECAM)
 unless there's a standard firmware interface for config access, e.g., the
 ia64 SAL interface [7].  A host bridge consumes ECAM memory address space
 and converts memory accesses into PCI configuration accesses.  The spec
 defines the ECAM address space layout and functionality; only the base of
 the address space is device-specific.  An ACPI OS learns the base address
 from either the static MCFG table or a _CBA method in the PNP0A03 device.
 The MCFG table must describe the ECAM space of non-hot pluggable host
 bridges [8].  Since MCFG is a static table and can't be updated by hotplug,
 a _CBA method in the PNP0A03 device describes the ECAM space of a
 hot-pluggable host bridge [9].  Note that for both MCFG and _CBA, the base
 address always corresponds to bus 0, even if the bus range below the bridge
 (which is reported via _CRS) doesn't start at 0.
 [1] ACPI 6.2, sec 6.1:
    For any device that is on a non-enumerable type of bus (for example, an
    ISA bus), OSPM enumerates the devices' identifier(s) and the ACPI
    system firmware must supply an _HID object ... for each device to
    enable OSPM to do that.
 [2] ACPI 6.2, sec 3.7:
    The OS enumerates motherboard devices simply by reading through the
    ACPI Namespace looking for devices with hardware IDs.
    Each device enumerated by ACPI includes ACPI-defined objects in the
    ACPI Namespace that report the hardware resources the device could
    occupy [_PRS], an object that reports the resources that are currently
    used by the device [_CRS], and objects for configuring those resources
    [_SRS].  The information is used by the Plug and Play OS (OSPM) to
    configure the devices.
 [3] ACPI 6.2, sec 6.2:
    OSPM uses device configuration objects to configure hardware resources
    for devices enumerated via ACPI.  Device configuration objects provide
    information about current and possible resource requirements, the
    relationship between shared resources, and methods for configuring
    hardware resources.
    When OSPM enumerates a device, it calls _PRS to determine the resource
    requirements of the device.  It may also call _CRS to find the current
    resource settings for the device.  Using this information, the Plug and
    Play system determines what resources the device should consume and
    sets those resources by calling the device’s _SRS control method.
    In ACPI, devices can consume resources (for example, legacy keyboards),
    provide resources (for example, a proprietary PCI bridge), or do both.
    Unless otherwise specified, resources for a device are assumed to be
    taken from the nearest matching resource above the device in the device
    hierarchy.
 [4] ACPI 6.2, sec 6.4.3.5.1, 2, 3, 4:
    QWord/DWord/Word Address Space Descriptor (.1, .2, .3)
      General Flags: Bit [0] Ignored
    Extended Address Space Descriptor (.4)
      General Flags: Bit [0] Consumer/Producer:
        * 1 – This device consumes this resource
        * 0 – This device produces and consumes this resource
 [5] ACPI 6.2, sec 19.6.43:
    ResourceUsage specifies whether the Memory range is consumed by
    this device (ResourceConsumer) or passed on to child devices
    (ResourceProducer).  If nothing is specified, then
    ResourceConsumer is assumed.
 [6] PCI Firmware 3.2, sec 4.1.2:
    If the operating system does not natively comprehend reserving the
    MMCFG region, the MMCFG region must be reserved by firmware.  The
    address range reported in the MCFG table or by _CBA method (see Section
    4.1.3) must be reserved by declaring a motherboard resource.  For most
    systems, the motherboard resource would appear at the root of the ACPI
    namespace (under \_SB) in a node with a _HID of EISAID (PNP0C02), and
    the resources in this case should not be claimed in the root PCI bus’s
    _CRS.  The resources can optionally be returned in Int15 E820 or
    EFIGetMemoryMap as reserved memory but must always be reported through
    ACPI as a motherboard resource.
 [7] PCI Express 4.0, sec 7.2.2:
    For systems that are PC-compatible, or that do not implement a
    processor-architecture-specific firmware interface standard that allows
    access to the Configuration Space, the ECAM is required as defined in
    this section.
 [8] PCI Firmware 3.2, sec 4.1.2:
    The MCFG table is an ACPI table that is used to communicate the base
    addresses corresponding to the non-hot removable PCI Segment Groups
    range within a PCI Segment Group available to the operating system at
    boot. This is required for the PC-compatible systems.
    The MCFG table is only used to communicate the base addresses
    corresponding to the PCI Segment Groups available to the system at
    boot.
 [9] PCI Firmware 3.2, sec 4.1.3:
    The _CBA (Memory mapped Configuration Base Address) control method is
    an optional ACPI object that returns the 64-bit memory mapped
    configuration base address for the hot plug capable host bridge. The
    base address returned by _CBA is processor-relative address. The _CBA
    control method evaluates to an Integer.
    This control method appears under a host bridge object. When the _CBA
    method appears under an active host bridge object, the operating system
    evaluates this structure to identify the memory mapped configuration
    base address corresponding to the PCI Segment Group for the bus number
    range specified in _CRS method. An ACPI name space object that contains
    the _CBA method must also contain a corresponding _SEG method.
--- a/Documentation/PCI/acpi-info.txt
+++ b/Documentation/PCI/acpi-info.txt
@ -1,187 +0,0 @@
 		ACPI considerations for PCI host bridges
 The general rule is that the ACPI namespace should describe everything the
 OS might use unless there's another way for the OS to find it [1, 2].
 For example, there's no standard hardware mechanism for enumerating PCI
 host bridges, so the ACPI namespace must describe each host bridge, the
 method for accessing PCI config space below it, the address space windows
 the host bridge forwards to PCI (using _CRS), and the routing of legacy
 INTx interrupts (using _PRT).
 PCI devices, which are below the host bridge, generally do not need to be
 described via ACPI.  The OS can discover them via the standard PCI
 enumeration mechanism, using config accesses to discover and identify
 devices and read and size their BARs.  However, ACPI may describe PCI
 devices if it provides power management or hotplug functionality for them
 or if the device has INTx interrupts connected by platform interrupt
 controllers and a _PRT is needed to describe those connections.
 ACPI resource description is done via _CRS objects of devices in the ACPI
 namespace [2].   The _CRS is like a generalized PCI BAR: the OS can read
 _CRS and figure out what resource is being consumed even if it doesn't have
 a driver for the device [3].  That's important because it means an old OS
 can work correctly even on a system with new devices unknown to the OS.
 The new devices might not do anything, but the OS can at least make sure no
 resources conflict with them.
 Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for
 reserving address space.  The static tables are for things the OS needs to
 know early in boot, before it can parse the ACPI namespace.  If a new table
 is defined, an old OS needs to operate correctly even though it ignores the
 table.  _CRS allows that because it is generic and understood by the old
 OS; a static table does not.
 If the OS is expected to manage a non-discoverable device described via
 ACPI, that device will have a specific _HID/_CID that tells the OS what
 driver to bind to it, and the _CRS tells the OS and the driver where the
 device's registers are.
 PCI host bridges are PNP0A03 or PNP0A08 devices.  Their _CRS should
 describe all the address space they consume.  This includes all the windows
 they forward down to the PCI bus, as well as registers of the host bridge
 itself that are not forwarded to PCI.  The host bridge registers include
 things like secondary/subordinate bus registers that determine the bus
 range below the bridge, window registers that describe the apertures, etc.
 These are all device-specific, non-architected things, so the only way a
 PNP0A03/PNP0A08 driver can manage them is via _PRS/_CRS/_SRS, which contain
 the device-specific details.  The host bridge registers also include ECAM
 space, since it is consumed by the host bridge.
 ACPI defines a Consumer/Producer bit to distinguish the bridge registers
 ("Consumer") from the bridge apertures ("Producer") [4, 5], but early
 BIOSes didn't use that bit correctly.  The result is that the current ACPI
 spec defines Consumer/Producer only for the Extended Address Space
 descriptors; the bit should be ignored in the older QWord/DWord/Word
 Address Space descriptors.  Consequently, OSes have to assume all
 QWord/DWord/Word descriptors are windows.
 Prior to the addition of Extended Address Space descriptors, the failure of
 Consumer/Producer meant there was no way to describe bridge registers in
 the PNP0A03/PNP0A08 device itself.  The workaround was to describe the
 bridge registers (including ECAM space) in PNP0C02 catch-all devices [6].
 With the exception of ECAM, the bridge register space is device-specific
 anyway, so the generic PNP0A03/PNP0A08 driver (pci_root.c) has no need to
 know about it.  
 New architectures should be able to use "Consumer" Extended Address Space
 descriptors in the PNP0A03 device for bridge registers, including ECAM,
 although a strict interpretation of [6] might prohibit this.  Old x86 and
 ia64 kernels assume all address space descriptors, including "Consumer"
 Extended Address Space ones, are windows, so it would not be safe to
 describe bridge registers this way on those architectures.
 PNP0C02 "motherboard" devices are basically a catch-all.  There's no
 programming model for them other than "don't use these resources for
 anything else."  So a PNP0C02 _CRS should claim any address space that is
 (1) not claimed by _CRS under any other device object in the ACPI namespace
 and (2) should not be assigned by the OS to something else.
 The PCIe spec requires the Enhanced Configuration Access Method (ECAM)
 unless there's a standard firmware interface for config access, e.g., the
 ia64 SAL interface [7].  A host bridge consumes ECAM memory address space
 and converts memory accesses into PCI configuration accesses.  The spec
 defines the ECAM address space layout and functionality; only the base of
 the address space is device-specific.  An ACPI OS learns the base address
 from either the static MCFG table or a _CBA method in the PNP0A03 device.
 The MCFG table must describe the ECAM space of non-hot pluggable host
 bridges [8].  Since MCFG is a static table and can't be updated by hotplug,
 a _CBA method in the PNP0A03 device describes the ECAM space of a
 hot-pluggable host bridge [9].  Note that for both MCFG and _CBA, the base
 address always corresponds to bus 0, even if the bus range below the bridge
 (which is reported via _CRS) doesn't start at 0.
 [1] ACPI 6.2, sec 6.1:
    For any device that is on a non-enumerable type of bus (for example, an
    ISA bus), OSPM enumerates the devices' identifier(s) and the ACPI
    system firmware must supply an _HID object ... for each device to
    enable OSPM to do that.
 [2] ACPI 6.2, sec 3.7:
    The OS enumerates motherboard devices simply by reading through the
    ACPI Namespace looking for devices with hardware IDs.
    Each device enumerated by ACPI includes ACPI-defined objects in the
    ACPI Namespace that report the hardware resources the device could
    occupy [_PRS], an object that reports the resources that are currently
    used by the device [_CRS], and objects for configuring those resources
    [_SRS].  The information is used by the Plug and Play OS (OSPM) to
    configure the devices.
 [3] ACPI 6.2, sec 6.2:
    OSPM uses device configuration objects to configure hardware resources
    for devices enumerated via ACPI.  Device configuration objects provide
    information about current and possible resource requirements, the
    relationship between shared resources, and methods for configuring
    hardware resources.
    When OSPM enumerates a device, it calls _PRS to determine the resource
    requirements of the device.  It may also call _CRS to find the current
    resource settings for the device.  Using this information, the Plug and
    Play system determines what resources the device should consume and
    sets those resources by calling the device’s _SRS control method.
    In ACPI, devices can consume resources (for example, legacy keyboards),
    provide resources (for example, a proprietary PCI bridge), or do both.
    Unless otherwise specified, resources for a device are assumed to be
    taken from the nearest matching resource above the device in the device
    hierarchy.
 [4] ACPI 6.2, sec 6.4.3.5.1, 2, 3, 4:
    QWord/DWord/Word Address Space Descriptor (.1, .2, .3)
    General Flags: Bit [0] Ignored
    Extended Address Space Descriptor (.4)
    General Flags: Bit [0] Consumer/Producer:
 	1–This device consumes this resource
 	0–This device produces and consumes this resource
 [5] ACPI 6.2, sec 19.6.43:
    ResourceUsage specifies whether the Memory range is consumed by
    this device (ResourceConsumer) or passed on to child devices
    (ResourceProducer).  If nothing is specified, then
    ResourceConsumer is assumed.
 [6] PCI Firmware 3.2, sec 4.1.2:
    If the operating system does not natively comprehend reserving the
    MMCFG region, the MMCFG region must be reserved by firmware.  The
    address range reported in the MCFG table or by _CBA method (see Section
    4.1.3) must be reserved by declaring a motherboard resource.  For most
    systems, the motherboard resource would appear at the root of the ACPI
    namespace (under \_SB) in a node with a _HID of EISAID (PNP0C02), and
    the resources in this case should not be claimed in the root PCI bus’s
    _CRS.  The resources can optionally be returned in Int15 E820 or
    EFIGetMemoryMap as reserved memory but must always be reported through
    ACPI as a motherboard resource.
 [7] PCI Express 4.0, sec 7.2.2:
    For systems that are PC-compatible, or that do not implement a
    processor-architecture-specific firmware interface standard that allows
    access to the Configuration Space, the ECAM is required as defined in
    this section.
 [8] PCI Firmware 3.2, sec 4.1.2:
    The MCFG table is an ACPI table that is used to communicate the base
    addresses corresponding to the non-hot removable PCI Segment Groups
    range within a PCI Segment Group available to the operating system at
    boot. This is required for the PC-compatible systems.
    The MCFG table is only used to communicate the base addresses
    corresponding to the PCI Segment Groups available to the system at
    boot.
 [9] PCI Firmware 3.2, sec 4.1.3:
    The _CBA (Memory mapped Configuration Base Address) control method is
    an optional ACPI object that returns the 64-bit memory mapped
    configuration base address for the hot plug capable host bridge. The
    base address returned by _CBA is processor-relative address. The _CBA
    control method evaluates to an Integer.
    This control method appears under a host bridge object. When the _CBA
    method appears under an active host bridge object, the operating system
    evaluates this structure to identify the memory mapped configuration
    base address corresponding to the PCI Segment Group for the bus number
    range specified in _CRS method. An ACPI name space object that contains
    the _CBA method must also contain a corresponding _SEG method.
--- a/Documentation/PCI/endpoint/index.rst
+++ b/Documentation/PCI/endpoint/index.rst
@ -0,0 +1,13 @@
 .. SPDX-License-Identifier: GPL-2.0
 ======================
 PCI Endpoint Framework
 ======================
 .. toctree::
   :maxdepth: 2
   pci-endpoint
   pci-endpoint-cfs
   pci-test-function
   pci-test-howto
--- a/Documentation/PCI/endpoint/pci-endpoint-cfs.rst
+++ b/Documentation/PCI/endpoint/pci-endpoint-cfs.rst
@ -0,0 +1,118 @@
 .. SPDX-License-Identifier: GPL-2.0
 =======================================
 Configuring PCI Endpoint Using CONFIGFS
 =======================================
 :Author: Kishon Vijay Abraham I <kishon@ti.com>
 The PCI Endpoint Core exposes configfs entry (pci_ep) to configure the
 PCI endpoint function and to bind the endpoint function
 with the endpoint controller. (For introducing other mechanisms to
 configure the PCI Endpoint Function refer to [1]).
 Mounting configfs
 =================
 The PCI Endpoint Core layer creates pci_ep directory in the mounted configfs
 directory. configfs can be mounted using the following command::
 	mount -t configfs none /sys/kernel/config
 Directory Structure
 ===================
 The pci_ep configfs has two directories at its root: controllers and
 functions. Every EPC device present in the system will have an entry in
 the *controllers* directory and and every EPF driver present in the system
 will have an entry in the *functions* directory.
 ::
 	/sys/kernel/config/pci_ep/
 		.. controllers/
 		.. functions/
 Creating EPF Device
 ===================
 Every registered EPF driver will be listed in controllers directory. The
 entries corresponding to EPF driver will be created by the EPF core.
 ::
 	/sys/kernel/config/pci_ep/functions/
 		.. <EPF Driver1>/
 			... <EPF Device 11>/
 			... <EPF Device 21>/
 		.. <EPF Driver2>/
 			... <EPF Device 12>/
 			... <EPF Device 22>/
 In order to create a <EPF device> of the type probed by <EPF Driver>, the
 user has to create a directory inside <EPF DriverN>.
 Every <EPF device> directory consists of the following entries that can be
 used to configure the standard configuration header of the endpoint function.
 (These entries are created by the framework when any new <EPF Device> is
 created)
 ::
 		.. <EPF Driver1>/
 			... <EPF Device 11>/
 				... vendorid
 				... deviceid
 				... revid
 				... progif_code
 				... subclass_code
 				... baseclass_code
 				... cache_line_size
 				... subsys_vendor_id
 				... subsys_id
 				... interrupt_pin
 EPC Device
 ==========
 Every registered EPC device will be listed in controllers directory. The
 entries corresponding to EPC device will be created by the EPC core.
 ::
 	/sys/kernel/config/pci_ep/controllers/
 		.. <EPC Device1>/
 			... <Symlink EPF Device11>/
 			... <Symlink EPF Device12>/
 			... start
 		.. <EPC Device2>/
 			... <Symlink EPF Device21>/
 			... <Symlink EPF Device22>/
 			... start
 The <EPC Device> directory will have a list of symbolic links to
 <EPF Device>. These symbolic links should be created by the user to
 represent the functions present in the endpoint device.
 The <EPC Device> directory will also have a *start* field. Once
 "1" is written to this field, the endpoint device will be ready to
 establish the link with the host. This is usually done after
 all the EPF devices are created and linked with the EPC device.
 ::
 			 | controllers/
 				| <Directory: EPC name>/
 					| <Symbolic Link: Function>
 					| start
 			 | functions/
 				| <Directory: EPF driver>/
 					| <Directory: EPF device>/
 						| vendorid
 						| deviceid
 						| revid
 						| progif_code
 						| subclass_code
 						| baseclass_code
 						| cache_line_size
 						| subsys_vendor_id
 						| subsys_id
 						| interrupt_pin
 						| function
 [1] :doc:`pci-endpoint`
--- a/Documentation/PCI/endpoint/pci-endpoint-cfs.txt
+++ b/Documentation/PCI/endpoint/pci-endpoint-cfs.txt
@ -1,105 +0,0 @@
                   CONFIGURING PCI ENDPOINT USING CONFIGFS
                    Kishon Vijay Abraham I <kishon@ti.com>
 The PCI Endpoint Core exposes configfs entry (pci_ep) to configure the
 PCI endpoint function and to bind the endpoint function
 with the endpoint controller. (For introducing other mechanisms to
 configure the PCI Endpoint Function refer to [1]).
 *) Mounting configfs
 The PCI Endpoint Core layer creates pci_ep directory in the mounted configfs
 directory. configfs can be mounted using the following command.
 	mount -t configfs none /sys/kernel/config
 *) Directory Structure
 The pci_ep configfs has two directories at its root: controllers and
 functions. Every EPC device present in the system will have an entry in
 the *controllers* directory and and every EPF driver present in the system
 will have an entry in the *functions* directory.
 /sys/kernel/config/pci_ep/
 	.. controllers/
 	.. functions/
 *) Creating EPF Device
 Every registered EPF driver will be listed in controllers directory. The
 entries corresponding to EPF driver will be created by the EPF core.
 /sys/kernel/config/pci_ep/functions/
 	.. <EPF Driver1>/
 		... <EPF Device 11>/
 		... <EPF Device 21>/
 	.. <EPF Driver2>/
 		... <EPF Device 12>/
 		... <EPF Device 22>/
 In order to create a <EPF device> of the type probed by <EPF Driver>, the
 user has to create a directory inside <EPF DriverN>.
 Every <EPF device> directory consists of the following entries that can be
 used to configure the standard configuration header of the endpoint function.
 (These entries are created by the framework when any new <EPF Device> is
 created)
 	.. <EPF Driver1>/
 		... <EPF Device 11>/
 			... vendorid
 			... deviceid
 			... revid
 			... progif_code
 			... subclass_code
 			... baseclass_code
 			... cache_line_size
 			... subsys_vendor_id
 			... subsys_id
 			... interrupt_pin
 *) EPC Device
 Every registered EPC device will be listed in controllers directory. The
 entries corresponding to EPC device will be created by the EPC core.
 /sys/kernel/config/pci_ep/controllers/
 	.. <EPC Device1>/
 		... <Symlink EPF Device11>/
 		... <Symlink EPF Device12>/
 		... start
 	.. <EPC Device2>/
 		... <Symlink EPF Device21>/
 		... <Symlink EPF Device22>/
 		... start
 The <EPC Device> directory will have a list of symbolic links to
 <EPF Device>. These symbolic links should be created by the user to
 represent the functions present in the endpoint device.
 The <EPC Device> directory will also have a *start* field. Once
 "1" is written to this field, the endpoint device will be ready to
 establish the link with the host. This is usually done after
 all the EPF devices are created and linked with the EPC device.
 			 | controllers/
 				| <Directory: EPC name>/
 					| <Symbolic Link: Function>
 					| start
 			 | functions/
 				| <Directory: EPF driver>/
 					| <Directory: EPF device>/
 						| vendorid
 						| deviceid
 						| revid
 						| progif_code
 						| subclass_code
 						| baseclass_code
 						| cache_line_size
 						| subsys_vendor_id
 						| subsys_id
 						| interrupt_pin
 						| function
 [1] -> Documentation/PCI/endpoint/pci-endpoint.txt
--- a/Documentation/PCI/endpoint/pci-endpoint.rst
+++ b/Documentation/PCI/endpoint/pci-endpoint.rst
@ -0,0 +1,231 @@
 .. SPDX-License-Identifier: GPL-2.0
 :Author: Kishon Vijay Abraham I <kishon@ti.com>
 This document is a guide to use the PCI Endpoint Framework in order to create
 endpoint controller driver, endpoint function driver, and using configfs
 interface to bind the function driver to the controller driver.
 Introduction
 ============
 Linux has a comprehensive PCI subsystem to support PCI controllers that
 operates in Root Complex mode. The subsystem has capability to scan PCI bus,
 assign memory resources and IRQ resources, load PCI driver (based on
 vendor ID, device ID), support other services like hot-plug, power management,
 advanced error reporting and virtual channels.
 However the PCI controller IP integrated in some SoCs is capable of operating
 either in Root Complex mode or Endpoint mode. PCI Endpoint Framework will
 add endpoint mode support in Linux. This will help to run Linux in an
 EP system which can have a wide variety of use cases from testing or
 validation, co-processor accelerator, etc.
 PCI Endpoint Core
 =================
 The PCI Endpoint Core layer comprises 3 components: the Endpoint Controller
 library, the Endpoint Function library, and the configfs layer to bind the
 endpoint function with the endpoint controller.
 PCI Endpoint Controller(EPC) Library
 ------------------------------------
 The EPC library provides APIs to be used by the controller that can operate
 in endpoint mode. It also provides APIs to be used by function driver/library
 in order to implement a particular endpoint function.
 APIs for the PCI controller Driver
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 This section lists the APIs that the PCI Endpoint core provides to be used
 by the PCI controller driver.
 * devm_pci_epc_create()/pci_epc_create()
   The PCI controller driver should implement the following ops:
 	 * write_header: ops to populate configuration space header
 	 * set_bar: ops to configure the BAR
 	 * clear_bar: ops to reset the BAR
 	 * alloc_addr_space: ops to allocate in PCI controller address space
 	 * free_addr_space: ops to free the allocated address space
 	 * raise_irq: ops to raise a legacy, MSI or MSI-X interrupt
 	 * start: ops to start the PCI link
 	 * stop: ops to stop the PCI link
   The PCI controller driver can then create a new EPC device by invoking
   devm_pci_epc_create()/pci_epc_create().
 * devm_pci_epc_destroy()/pci_epc_destroy()
   The PCI controller driver can destroy the EPC device created by either
   devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or
   pci_epc_destroy().
 * pci_epc_linkup()
   In order to notify all the function devices that the EPC device to which
   they are linked has established a link with the host, the PCI controller
   driver should invoke pci_epc_linkup().
 * pci_epc_mem_init()
   Initialize the pci_epc_mem structure used for allocating EPC addr space.
 * pci_epc_mem_exit()
   Cleanup the pci_epc_mem structure allocated during pci_epc_mem_init().
 APIs for the PCI Endpoint Function Driver
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 This section lists the APIs that the PCI Endpoint core provides to be used
 by the PCI endpoint function driver.
 * pci_epc_write_header()
   The PCI endpoint function driver should use pci_epc_write_header() to
   write the standard configuration header to the endpoint controller.
 * pci_epc_set_bar()
   The PCI endpoint function driver should use pci_epc_set_bar() to configure
   the Base Address Register in order for the host to assign PCI addr space.
   Register space of the function driver is usually configured
   using this API.
 * pci_epc_clear_bar()
   The PCI endpoint function driver should use pci_epc_clear_bar() to reset
   the BAR.
 * pci_epc_raise_irq()
   The PCI endpoint function driver should use pci_epc_raise_irq() to raise
   Legacy Interrupt, MSI or MSI-X Interrupt.
 * pci_epc_mem_alloc_addr()
   The PCI endpoint function driver should use pci_epc_mem_alloc_addr(), to
   allocate memory address from EPC addr space which is required to access
   RC's buffer
 * pci_epc_mem_free_addr()
   The PCI endpoint function driver should use pci_epc_mem_free_addr() to
   free the memory space allocated using pci_epc_mem_alloc_addr().
 Other APIs
 ~~~~~~~~~~
 There are other APIs provided by the EPC library. These are used for binding
 the EPF device with EPC device. pci-ep-cfs.c can be used as reference for
 using these APIs.
 * pci_epc_get()
   Get a reference to the PCI endpoint controller based on the device name of
   the controller.
 * pci_epc_put()
   Release the reference to the PCI endpoint controller obtained using
   pci_epc_get()
 * pci_epc_add_epf()
   Add a PCI endpoint function to a PCI endpoint controller. A PCIe device
   can have up to 8 functions according to the specification.
 * pci_epc_remove_epf()
   Remove the PCI endpoint function from PCI endpoint controller.
 * pci_epc_start()
   The PCI endpoint function driver should invoke pci_epc_start() once it
   has configured the endpoint function and wants to start the PCI link.
 * pci_epc_stop()
   The PCI endpoint function driver should invoke pci_epc_stop() to stop
   the PCI LINK.
 PCI Endpoint Function(EPF) Library
 ----------------------------------
 The EPF library provides APIs to be used by the function driver and the EPC
 library to provide endpoint mode functionality.
 APIs for the PCI Endpoint Function Driver
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 This section lists the APIs that the PCI Endpoint core provides to be used
 by the PCI endpoint function driver.
 * pci_epf_register_driver()
   The PCI Endpoint Function driver should implement the following ops:
 	 * bind: ops to perform when a EPC device has been bound to EPF device
 	 * unbind: ops to perform when a binding has been lost between a EPC
 	   device and EPF device
 	 * linkup: ops to perform when the EPC device has established a
 	   connection with a host system
  The PCI Function driver can then register the PCI EPF driver by using
  pci_epf_register_driver().
 * pci_epf_unregister_driver()
  The PCI Function driver can unregister the PCI EPF driver by using
  pci_epf_unregister_driver().
 * pci_epf_alloc_space()
  The PCI Function driver can allocate space for a particular BAR using
  pci_epf_alloc_space().
 * pci_epf_free_space()
  The PCI Function driver can free the allocated space
  (using pci_epf_alloc_space) by invoking pci_epf_free_space().
 APIs for the PCI Endpoint Controller Library
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 This section lists the APIs that the PCI Endpoint core provides to be used
 by the PCI endpoint controller library.
 * pci_epf_linkup()
   The PCI endpoint controller library invokes pci_epf_linkup() when the
   EPC device has established the connection to the host.
 Other APIs
 ~~~~~~~~~~
 There are other APIs provided by the EPF library. These are used to notify
 the function driver when the EPF device is bound to the EPC device.
 pci-ep-cfs.c can be used as reference for using these APIs.
 * pci_epf_create()
   Create a new PCI EPF device by passing the name of the PCI EPF device.
   This name will be used to bind the the EPF device to a EPF driver.
 * pci_epf_destroy()
   Destroy the created PCI EPF device.
 * pci_epf_bind()
   pci_epf_bind() should be invoked when the EPF device has been bound to
   a EPC device.
 * pci_epf_unbind()
   pci_epf_unbind() should be invoked when the binding between EPC device
   and EPF device is lost.
--- a/Documentation/PCI/endpoint/pci-endpoint.txt
+++ b/Documentation/PCI/endpoint/pci-endpoint.txt
@ -1,215 +0,0 @@
 			    PCI ENDPOINT FRAMEWORK
 		    Kishon Vijay Abraham I <kishon@ti.com>
 This document is a guide to use the PCI Endpoint Framework in order to create
 endpoint controller driver, endpoint function driver, and using configfs
 interface to bind the function driver to the controller driver.
 1. Introduction
 Linux has a comprehensive PCI subsystem to support PCI controllers that
 operates in Root Complex mode. The subsystem has capability to scan PCI bus,
 assign memory resources and IRQ resources, load PCI driver (based on
 vendor ID, device ID), support other services like hot-plug, power management,
 advanced error reporting and virtual channels.
 However the PCI controller IP integrated in some SoCs is capable of operating
 either in Root Complex mode or Endpoint mode. PCI Endpoint Framework will
 add endpoint mode support in Linux. This will help to run Linux in an
 EP system which can have a wide variety of use cases from testing or
 validation, co-processor accelerator, etc.
 2. PCI Endpoint Core
 The PCI Endpoint Core layer comprises 3 components: the Endpoint Controller
 library, the Endpoint Function library, and the configfs layer to bind the
 endpoint function with the endpoint controller.
 2.1 PCI Endpoint Controller(EPC) Library
 The EPC library provides APIs to be used by the controller that can operate
 in endpoint mode. It also provides APIs to be used by function driver/library
 in order to implement a particular endpoint function.
 2.1.1 APIs for the PCI controller Driver
 This section lists the APIs that the PCI Endpoint core provides to be used
 by the PCI controller driver.
 *) devm_pci_epc_create()/pci_epc_create()
   The PCI controller driver should implement the following ops:
 	 * write_header: ops to populate configuration space header
 	 * set_bar: ops to configure the BAR
 	 * clear_bar: ops to reset the BAR
 	 * alloc_addr_space: ops to allocate in PCI controller address space
 	 * free_addr_space: ops to free the allocated address space
 	 * raise_irq: ops to raise a legacy, MSI or MSI-X interrupt
 	 * start: ops to start the PCI link
 	 * stop: ops to stop the PCI link
   The PCI controller driver can then create a new EPC device by invoking
   devm_pci_epc_create()/pci_epc_create().
 *) devm_pci_epc_destroy()/pci_epc_destroy()
   The PCI controller driver can destroy the EPC device created by either
   devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or
   pci_epc_destroy().
 *) pci_epc_linkup()
   In order to notify all the function devices that the EPC device to which
   they are linked has established a link with the host, the PCI controller
   driver should invoke pci_epc_linkup().
 *) pci_epc_mem_init()
   Initialize the pci_epc_mem structure used for allocating EPC addr space.
 *) pci_epc_mem_exit()
   Cleanup the pci_epc_mem structure allocated during pci_epc_mem_init().
 2.1.2 APIs for the PCI Endpoint Function Driver
 This section lists the APIs that the PCI Endpoint core provides to be used
 by the PCI endpoint function driver.
 *) pci_epc_write_header()
   The PCI endpoint function driver should use pci_epc_write_header() to
   write the standard configuration header to the endpoint controller.
 *) pci_epc_set_bar()
   The PCI endpoint function driver should use pci_epc_set_bar() to configure
   the Base Address Register in order for the host to assign PCI addr space.
   Register space of the function driver is usually configured
   using this API.
 *) pci_epc_clear_bar()
   The PCI endpoint function driver should use pci_epc_clear_bar() to reset
   the BAR.
 *) pci_epc_raise_irq()
   The PCI endpoint function driver should use pci_epc_raise_irq() to raise
   Legacy Interrupt, MSI or MSI-X Interrupt.
 *) pci_epc_mem_alloc_addr()
   The PCI endpoint function driver should use pci_epc_mem_alloc_addr(), to
   allocate memory address from EPC addr space which is required to access
   RC's buffer
 *) pci_epc_mem_free_addr()
   The PCI endpoint function driver should use pci_epc_mem_free_addr() to
   free the memory space allocated using pci_epc_mem_alloc_addr().
 2.1.3 Other APIs
 There are other APIs provided by the EPC library. These are used for binding
 the EPF device with EPC device. pci-ep-cfs.c can be used as reference for
 using these APIs.
 *) pci_epc_get()
   Get a reference to the PCI endpoint controller based on the device name of
   the controller.
 *) pci_epc_put()
   Release the reference to the PCI endpoint controller obtained using
   pci_epc_get()
 *) pci_epc_add_epf()
   Add a PCI endpoint function to a PCI endpoint controller. A PCIe device
   can have up to 8 functions according to the specification.
 *) pci_epc_remove_epf()
   Remove the PCI endpoint function from PCI endpoint controller.
 *) pci_epc_start()
   The PCI endpoint function driver should invoke pci_epc_start() once it
   has configured the endpoint function and wants to start the PCI link.
 *) pci_epc_stop()
   The PCI endpoint function driver should invoke pci_epc_stop() to stop
   the PCI LINK.
 2.2 PCI Endpoint Function(EPF) Library
 The EPF library provides APIs to be used by the function driver and the EPC
 library to provide endpoint mode functionality.
 2.2.1 APIs for the PCI Endpoint Function Driver
 This section lists the APIs that the PCI Endpoint core provides to be used
 by the PCI endpoint function driver.
 *) pci_epf_register_driver()
   The PCI Endpoint Function driver should implement the following ops:
 	 * bind: ops to perform when a EPC device has been bound to EPF device
 	 * unbind: ops to perform when a binding has been lost between a EPC
 	   device and EPF device
 	 * linkup: ops to perform when the EPC device has established a
 	   connection with a host system
  The PCI Function driver can then register the PCI EPF driver by using
  pci_epf_register_driver().
 *) pci_epf_unregister_driver()
  The PCI Function driver can unregister the PCI EPF driver by using
  pci_epf_unregister_driver().
 *) pci_epf_alloc_space()
  The PCI Function driver can allocate space for a particular BAR using
  pci_epf_alloc_space().
 *) pci_epf_free_space()
  The PCI Function driver can free the allocated space
  (using pci_epf_alloc_space) by invoking pci_epf_free_space().
 2.2.2 APIs for the PCI Endpoint Controller Library
 This section lists the APIs that the PCI Endpoint core provides to be used
 by the PCI endpoint controller library.
 *) pci_epf_linkup()
   The PCI endpoint controller library invokes pci_epf_linkup() when the
   EPC device has established the connection to the host.
 2.2.2 Other APIs
 There are other APIs provided by the EPF library. These are used to notify
 the function driver when the EPF device is bound to the EPC device.
 pci-ep-cfs.c can be used as reference for using these APIs.
 *) pci_epf_create()
   Create a new PCI EPF device by passing the name of the PCI EPF device.
   This name will be used to bind the the EPF device to a EPF driver.
 *) pci_epf_destroy()
   Destroy the created PCI EPF device.
 *) pci_epf_bind()
   pci_epf_bind() should be invoked when the EPF device has been bound to
   a EPC device.
 *) pci_epf_unbind()
   pci_epf_unbind() should be invoked when the binding between EPC device
   and EPF device is lost.
--- a/Documentation/PCI/endpoint/pci-test-function.rst
+++ b/Documentation/PCI/endpoint/pci-test-function.rst
@ -0,0 +1,103 @@
 .. SPDX-License-Identifier: GPL-2.0
 =================
 PCI Test Function
 =================
 :Author: Kishon Vijay Abraham I <kishon@ti.com>
 Traditionally PCI RC has always been validated by using standard
 PCI cards like ethernet PCI cards or USB PCI cards or SATA PCI cards.
 However with the addition of EP-core in linux kernel, it is possible
 to configure a PCI controller that can operate in EP mode to work as
 a test device.
 The PCI endpoint test device is a virtual device (defined in software)
 used to test the endpoint functionality and serve as a sample driver
 for other PCI endpoint devices (to use the EP framework).
 The PCI endpoint test device has the following registers:
 	1) PCI_ENDPOINT_TEST_MAGIC
 	2) PCI_ENDPOINT_TEST_COMMAND
 	3) PCI_ENDPOINT_TEST_STATUS
 	4) PCI_ENDPOINT_TEST_SRC_ADDR
 	5) PCI_ENDPOINT_TEST_DST_ADDR
 	6) PCI_ENDPOINT_TEST_SIZE
 	7) PCI_ENDPOINT_TEST_CHECKSUM
 	8) PCI_ENDPOINT_TEST_IRQ_TYPE
 	9) PCI_ENDPOINT_TEST_IRQ_NUMBER
 * PCI_ENDPOINT_TEST_MAGIC
 This register will be used to test BAR0. A known pattern will be written
 and read back from MAGIC register to verify BAR0.
 * PCI_ENDPOINT_TEST_COMMAND
 This register will be used by the host driver to indicate the function
 that the endpoint device must perform.
 ========	================================================================
 Bitfield	Description
 ========	================================================================
 Bit 0		raise legacy IRQ
 Bit 1		raise MSI IRQ
 Bit 2		raise MSI-X IRQ
 Bit 3		read command (read data from RC buffer)
 Bit 4		write command (write data to RC buffer)
 Bit 5		copy command (copy data from one RC buffer to another RC buffer)
 ========	================================================================
 * PCI_ENDPOINT_TEST_STATUS
 This register reflects the status of the PCI endpoint device.
 ========	==============================
 Bitfield	Description
 ========	==============================
 Bit 0		read success
 Bit 1		read fail
 Bit 2		write success
 Bit 3		write fail
 Bit 4		copy success
 Bit 5		copy fail
 Bit 6		IRQ raised
 Bit 7		source address is invalid
 Bit 8		destination address is invalid
 ========	==============================
 * PCI_ENDPOINT_TEST_SRC_ADDR
 This register contains the source address (RC buffer address) for the
 COPY/READ command.
 * PCI_ENDPOINT_TEST_DST_ADDR
 This register contains the destination address (RC buffer address) for
 the COPY/WRITE command.
 * PCI_ENDPOINT_TEST_IRQ_TYPE
 This register contains the interrupt type (Legacy/MSI) triggered
 for the READ/WRITE/COPY and raise IRQ (Legacy/MSI) commands.
 Possible types:
 ======	==
 Legacy	0
 MSI	1
 MSI-X	2
 ======	==
 * PCI_ENDPOINT_TEST_IRQ_NUMBER
 This register contains the triggered ID interrupt.
 Admissible values:
 ======	===========
 Legacy	0
 MSI	[1 .. 32]
 MSI-X	[1 .. 2048]
 ======	===========
--- a/Documentation/PCI/endpoint/pci-test-function.txt
+++ b/Documentation/PCI/endpoint/pci-test-function.txt
@ -1,87 +0,0 @@
 				PCI TEST
 		    Kishon Vijay Abraham I <kishon@ti.com>
 Traditionally PCI RC has always been validated by using standard
 PCI cards like ethernet PCI cards or USB PCI cards or SATA PCI cards.
 However with the addition of EP-core in linux kernel, it is possible
 to configure a PCI controller that can operate in EP mode to work as
 a test device.
 The PCI endpoint test device is a virtual device (defined in software)
 used to test the endpoint functionality and serve as a sample driver
 for other PCI endpoint devices (to use the EP framework).
 The PCI endpoint test device has the following registers:
 	1) PCI_ENDPOINT_TEST_MAGIC
 	2) PCI_ENDPOINT_TEST_COMMAND
 	3) PCI_ENDPOINT_TEST_STATUS
 	4) PCI_ENDPOINT_TEST_SRC_ADDR
 	5) PCI_ENDPOINT_TEST_DST_ADDR
 	6) PCI_ENDPOINT_TEST_SIZE
 	7) PCI_ENDPOINT_TEST_CHECKSUM
 	8) PCI_ENDPOINT_TEST_IRQ_TYPE
 	9) PCI_ENDPOINT_TEST_IRQ_NUMBER
 *) PCI_ENDPOINT_TEST_MAGIC
 This register will be used to test BAR0. A known pattern will be written
 and read back from MAGIC register to verify BAR0.
 *) PCI_ENDPOINT_TEST_COMMAND:
 This register will be used by the host driver to indicate the function
 that the endpoint device must perform.
 Bitfield Description:
  Bit 0		: raise legacy IRQ
  Bit 1		: raise MSI IRQ
  Bit 2		: raise MSI-X IRQ
  Bit 3		: read command (read data from RC buffer)
  Bit 4		: write command (write data to RC buffer)
  Bit 5		: copy command (copy data from one RC buffer to another
 		  RC buffer)
 *) PCI_ENDPOINT_TEST_STATUS
 This register reflects the status of the PCI endpoint device.
 Bitfield Description:
  Bit 0		: read success
  Bit 1		: read fail
  Bit 2		: write success
  Bit 3		: write fail
  Bit 4		: copy success
  Bit 5		: copy fail
  Bit 6		: IRQ raised
  Bit 7		: source address is invalid
  Bit 8		: destination address is invalid
 *) PCI_ENDPOINT_TEST_SRC_ADDR
 This register contains the source address (RC buffer address) for the
 COPY/READ command.
 *) PCI_ENDPOINT_TEST_DST_ADDR
 This register contains the destination address (RC buffer address) for
 the COPY/WRITE command.
 *) PCI_ENDPOINT_TEST_IRQ_TYPE
 This register contains the interrupt type (Legacy/MSI) triggered
 for the READ/WRITE/COPY and raise IRQ (Legacy/MSI) commands.
 Possible types:
 - Legacy	: 0
 - MSI		: 1
 - MSI-X	: 2
 *) PCI_ENDPOINT_TEST_IRQ_NUMBER
 This register contains the triggered ID interrupt.
 Admissible values:
 - Legacy	: 0
 - MSI		: [1 .. 32]
 - MSI-X	: [1 .. 2048]
--- a/Documentation/PCI/endpoint/pci-test-howto.rst
+++ b/Documentation/PCI/endpoint/pci-test-howto.rst
@ -0,0 +1,235 @@
 .. SPDX-License-Identifier: GPL-2.0
 ===================
 PCI Test User Guide
 ===================
 :Author: Kishon Vijay Abraham I <kishon@ti.com>
 This document is a guide to help users use pci-epf-test function driver
 and pci_endpoint_test host driver for testing PCI. The list of steps to
 be followed in the host side and EP side is given below.
 Endpoint Device
 ===============
 Endpoint Controller Devices
 ---------------------------
 To find the list of endpoint controller devices in the system::
 	# ls /sys/class/pci_epc/
 	  51000000.pcie_ep
 If PCI_ENDPOINT_CONFIGFS is enabled::
 	# ls /sys/kernel/config/pci_ep/controllers
 	  51000000.pcie_ep
 Endpoint Function Drivers
 -------------------------
 To find the list of endpoint function drivers in the system::
 	# ls /sys/bus/pci-epf/drivers
 	  pci_epf_test
 If PCI_ENDPOINT_CONFIGFS is enabled::
 	# ls /sys/kernel/config/pci_ep/functions
 	  pci_epf_test
 Creating pci-epf-test Device
 ----------------------------
 PCI endpoint function device can be created using the configfs. To create
 pci-epf-test device, the following commands can be used::
 	# mount -t configfs none /sys/kernel/config
 	# cd /sys/kernel/config/pci_ep/
 	# mkdir functions/pci_epf_test/func1
 The "mkdir func1" above creates the pci-epf-test function device that will
 be probed by pci_epf_test driver.
 The PCI endpoint framework populates the directory with the following
 configurable fields::
 	# ls functions/pci_epf_test/func1
 	  baseclass_code	interrupt_pin	progif_code	subsys_id
 	  cache_line_size	msi_interrupts	revid		subsys_vendorid
 	  deviceid          	msix_interrupts	subclass_code	vendorid
 The PCI endpoint function driver populates these entries with default values
 when the device is bound to the driver. The pci-epf-test driver populates
 vendorid with 0xffff and interrupt_pin with 0x0001::
 	# cat functions/pci_epf_test/func1/vendorid
 	  0xffff
 	# cat functions/pci_epf_test/func1/interrupt_pin
 	  0x0001
 Configuring pci-epf-test Device
 -------------------------------
 The user can configure the pci-epf-test device using configfs entry. In order
 to change the vendorid and the number of MSI interrupts used by the function
 device, the following commands can be used::
 	# echo 0x104c > functions/pci_epf_test/func1/vendorid
 	# echo 0xb500 > functions/pci_epf_test/func1/deviceid
 	# echo 16 > functions/pci_epf_test/func1/msi_interrupts
 	# echo 8 > functions/pci_epf_test/func1/msix_interrupts
 Binding pci-epf-test Device to EP Controller
 --------------------------------------------
 In order for the endpoint function device to be useful, it has to be bound to
 a PCI endpoint controller driver. Use the configfs to bind the function
 device to one of the controller driver present in the system::
 	# ln -s functions/pci_epf_test/func1 controllers/51000000.pcie_ep/
 Once the above step is completed, the PCI endpoint is ready to establish a link
 with the host.
 Start the Link
 --------------
 In order for the endpoint device to establish a link with the host, the _start_
 field should be populated with '1'::
 	# echo 1 > controllers/51000000.pcie_ep/start
 RootComplex Device
 ==================
 lspci Output
 ------------
 Note that the devices listed here correspond to the value populated in 1.4
 above::
 	00:00.0 PCI bridge: Texas Instruments Device 8888 (rev 01)
 	01:00.0 Unassigned class [ff00]: Texas Instruments Device b500
 Using Endpoint Test function Device
 -----------------------------------
 pcitest.sh added in tools/pci/ can be used to run all the default PCI endpoint
 tests. To compile this tool the following commands should be used::
 	# cd <kernel-dir>
 	# make -C tools/pci
 or if you desire to compile and install in your system::
 	# cd <kernel-dir>
 	# make -C tools/pci install
 The tool and script will be located in <rootfs>/usr/bin/
 pcitest.sh Output
 ~~~~~~~~~~~~~~~~~
 ::
 	# pcitest.sh
 	BAR tests
 	BAR0:           OKAY
 	BAR1:           OKAY
 	BAR2:           OKAY
 	BAR3:           OKAY
 	BAR4:           NOT OKAY
 	BAR5:           NOT OKAY
 	Interrupt tests
 	SET IRQ TYPE TO LEGACY:         OKAY
 	LEGACY IRQ:     NOT OKAY
 	SET IRQ TYPE TO MSI:            OKAY
 	MSI1:           OKAY
 	MSI2:           OKAY
 	MSI3:           OKAY
 	MSI4:           OKAY
 	MSI5:           OKAY
 	MSI6:           OKAY
 	MSI7:           OKAY
 	MSI8:           OKAY
 	MSI9:           OKAY
 	MSI10:          OKAY
 	MSI11:          OKAY
 	MSI12:          OKAY
 	MSI13:          OKAY
 	MSI14:          OKAY
 	MSI15:          OKAY
 	MSI16:          OKAY
 	MSI17:          NOT OKAY
 	MSI18:          NOT OKAY
 	MSI19:          NOT OKAY
 	MSI20:          NOT OKAY
 	MSI21:          NOT OKAY
 	MSI22:          NOT OKAY
 	MSI23:          NOT OKAY
 	MSI24:          NOT OKAY
 	MSI25:          NOT OKAY
 	MSI26:          NOT OKAY
 	MSI27:          NOT OKAY
 	MSI28:          NOT OKAY
 	MSI29:          NOT OKAY
 	MSI30:          NOT OKAY
 	MSI31:          NOT OKAY
 	MSI32:          NOT OKAY
 	SET IRQ TYPE TO MSI-X:          OKAY
 	MSI-X1:         OKAY
 	MSI-X2:         OKAY
 	MSI-X3:         OKAY
 	MSI-X4:         OKAY
 	MSI-X5:         OKAY
 	MSI-X6:         OKAY
 	MSI-X7:         OKAY
 	MSI-X8:         OKAY
 	MSI-X9:         NOT OKAY
 	MSI-X10:        NOT OKAY
 	MSI-X11:        NOT OKAY
 	MSI-X12:        NOT OKAY
 	MSI-X13:        NOT OKAY
 	MSI-X14:        NOT OKAY
 	MSI-X15:        NOT OKAY
 	MSI-X16:        NOT OKAY
 	[...]
 	MSI-X2047:      NOT OKAY
 	MSI-X2048:      NOT OKAY
 	Read Tests
 	SET IRQ TYPE TO MSI:            OKAY
 	READ (      1 bytes):           OKAY
 	READ (   1024 bytes):           OKAY
 	READ (   1025 bytes):           OKAY
 	READ (1024000 bytes):           OKAY
 	READ (1024001 bytes):           OKAY
 	Write Tests
 	WRITE (      1 bytes):          OKAY
 	WRITE (   1024 bytes):          OKAY
 	WRITE (   1025 bytes):          OKAY
 	WRITE (1024000 bytes):          OKAY
 	WRITE (1024001 bytes):          OKAY
 	Copy Tests
 	COPY (      1 bytes):           OKAY
 	COPY (   1024 bytes):           OKAY
 	COPY (   1025 bytes):           OKAY
 	COPY (1024000 bytes):           OKAY
 	COPY (1024001 bytes):           OKAY
--- a/Documentation/PCI/endpoint/pci-test-howto.txt
+++ b/Documentation/PCI/endpoint/pci-test-howto.txt
@ -1,206 +0,0 @@
 			    PCI TEST USERGUIDE
 		    Kishon Vijay Abraham I <kishon@ti.com>
 This document is a guide to help users use pci-epf-test function driver
 and pci_endpoint_test host driver for testing PCI. The list of steps to
 be followed in the host side and EP side is given below.
 1. Endpoint Device
 1.1 Endpoint Controller Devices
 To find the list of endpoint controller devices in the system:
 	# ls /sys/class/pci_epc/
 	  51000000.pcie_ep
 If PCI_ENDPOINT_CONFIGFS is enabled
 	# ls /sys/kernel/config/pci_ep/controllers
 	  51000000.pcie_ep
 1.2 Endpoint Function Drivers
 To find the list of endpoint function drivers in the system:
 	# ls /sys/bus/pci-epf/drivers
 	  pci_epf_test
 If PCI_ENDPOINT_CONFIGFS is enabled
 	# ls /sys/kernel/config/pci_ep/functions
 	  pci_epf_test
 1.3 Creating pci-epf-test Device
 PCI endpoint function device can be created using the configfs. To create
 pci-epf-test device, the following commands can be used
 	# mount -t configfs none /sys/kernel/config
 	# cd /sys/kernel/config/pci_ep/
 	# mkdir functions/pci_epf_test/func1
 The "mkdir func1" above creates the pci-epf-test function device that will
 be probed by pci_epf_test driver.
 The PCI endpoint framework populates the directory with the following
 configurable fields.
 	# ls functions/pci_epf_test/func1
 	  baseclass_code	interrupt_pin	progif_code	subsys_id
 	  cache_line_size	msi_interrupts	revid		subsys_vendorid
 	  deviceid          	msix_interrupts	subclass_code	vendorid
 The PCI endpoint function driver populates these entries with default values
 when the device is bound to the driver. The pci-epf-test driver populates
 vendorid with 0xffff and interrupt_pin with 0x0001
 	# cat functions/pci_epf_test/func1/vendorid
 	  0xffff
 	# cat functions/pci_epf_test/func1/interrupt_pin
 	  0x0001
 1.4 Configuring pci-epf-test Device
 The user can configure the pci-epf-test device using configfs entry. In order
 to change the vendorid and the number of MSI interrupts used by the function
 device, the following commands can be used.
 	# echo 0x104c > functions/pci_epf_test/func1/vendorid
 	# echo 0xb500 > functions/pci_epf_test/func1/deviceid
 	# echo 16 > functions/pci_epf_test/func1/msi_interrupts
 	# echo 8 > functions/pci_epf_test/func1/msix_interrupts
 1.5 Binding pci-epf-test Device to EP Controller
 In order for the endpoint function device to be useful, it has to be bound to
 a PCI endpoint controller driver. Use the configfs to bind the function
 device to one of the controller driver present in the system.
 	# ln -s functions/pci_epf_test/func1 controllers/51000000.pcie_ep/
 Once the above step is completed, the PCI endpoint is ready to establish a link
 with the host.
 1.6 Start the Link
 In order for the endpoint device to establish a link with the host, the _start_
 field should be populated with '1'.
 	# echo 1 > controllers/51000000.pcie_ep/start
 2. RootComplex Device
 2.1 lspci Output
 Note that the devices listed here correspond to the value populated in 1.4 above
 	00:00.0 PCI bridge: Texas Instruments Device 8888 (rev 01)
 	01:00.0 Unassigned class [ff00]: Texas Instruments Device b500
 2.2 Using Endpoint Test function Device
 pcitest.sh added in tools/pci/ can be used to run all the default PCI endpoint
 tests. To compile this tool the following commands should be used:
 	# cd <kernel-dir>
 	# make -C tools/pci
 or if you desire to compile and install in your system:
 	# cd <kernel-dir>
 	# make -C tools/pci install
 The tool and script will be located in <rootfs>/usr/bin/
 2.2.1 pcitest.sh Output
 	# pcitest.sh
 	BAR tests
 	BAR0:           OKAY
 	BAR1:           OKAY
 	BAR2:           OKAY
 	BAR3:           OKAY
 	BAR4:           NOT OKAY
 	BAR5:           NOT OKAY
 	Interrupt tests
 	SET IRQ TYPE TO LEGACY:         OKAY
 	LEGACY IRQ:     NOT OKAY
 	SET IRQ TYPE TO MSI:            OKAY
 	MSI1:           OKAY
 	MSI2:           OKAY
 	MSI3:           OKAY
 	MSI4:           OKAY
 	MSI5:           OKAY
 	MSI6:           OKAY
 	MSI7:           OKAY
 	MSI8:           OKAY
 	MSI9:           OKAY
 	MSI10:          OKAY
 	MSI11:          OKAY
 	MSI12:          OKAY
 	MSI13:          OKAY
 	MSI14:          OKAY
 	MSI15:          OKAY
 	MSI16:          OKAY
 	MSI17:          NOT OKAY
 	MSI18:          NOT OKAY
 	MSI19:          NOT OKAY
 	MSI20:          NOT OKAY
 	MSI21:          NOT OKAY
 	MSI22:          NOT OKAY
 	MSI23:          NOT OKAY
 	MSI24:          NOT OKAY
 	MSI25:          NOT OKAY
 	MSI26:          NOT OKAY
 	MSI27:          NOT OKAY
 	MSI28:          NOT OKAY
 	MSI29:          NOT OKAY
 	MSI30:          NOT OKAY
 	MSI31:          NOT OKAY
 	MSI32:          NOT OKAY
 	SET IRQ TYPE TO MSI-X:          OKAY
 	MSI-X1:         OKAY
 	MSI-X2:         OKAY
 	MSI-X3:         OKAY
 	MSI-X4:         OKAY
 	MSI-X5:         OKAY
 	MSI-X6:         OKAY
 	MSI-X7:         OKAY
 	MSI-X8:         OKAY
 	MSI-X9:         NOT OKAY
 	MSI-X10:        NOT OKAY
 	MSI-X11:        NOT OKAY
 	MSI-X12:        NOT OKAY
 	MSI-X13:        NOT OKAY
 	MSI-X14:        NOT OKAY
 	MSI-X15:        NOT OKAY
 	MSI-X16:        NOT OKAY
 	[...]
 	MSI-X2047:      NOT OKAY
 	MSI-X2048:      NOT OKAY
 	Read Tests
 	SET IRQ TYPE TO MSI:            OKAY
 	READ (      1 bytes):           OKAY
 	READ (   1024 bytes):           OKAY
 	READ (   1025 bytes):           OKAY
 	READ (1024000 bytes):           OKAY
 	READ (1024001 bytes):           OKAY
 	Write Tests
 	WRITE (      1 bytes):          OKAY
 	WRITE (   1024 bytes):          OKAY
 	WRITE (   1025 bytes):          OKAY
 	WRITE (1024000 bytes):          OKAY
 	WRITE (1024001 bytes):          OKAY
 	Copy Tests
 	COPY (      1 bytes):           OKAY
 	COPY (   1024 bytes):           OKAY
 	COPY (   1025 bytes):           OKAY
 	COPY (1024000 bytes):           OKAY
 	COPY (1024001 bytes):           OKAY
--- a/Documentation/PCI/index.rst
+++ b/Documentation/PCI/index.rst
@ -0,0 +1,18 @@
 .. SPDX-License-Identifier: GPL-2.0
 =======================
 Linux PCI Bus Subsystem
 =======================
 .. toctree::
   :maxdepth: 2
   :numbered:
   pci
   picebus-howto
   pci-iov-howto
   msi-howto
   acpi-info
   pci-error-recovery
   pcieaer-howto
   endpoint/index
--- a/Documentation/PCI/msi-howto.rst
+++ b/Documentation/PCI/msi-howto.rst
@ -0,0 +1,287 @@
 .. SPDX-License-Identifier: GPL-2.0
 .. include:: <isonum.txt>
 ==========================
 The MSI Driver Guide HOWTO
 ==========================
 :Authors: Tom L Nguyen; Martine Silbermann; Matthew Wilcox
 :Copyright: 2003, 2008 Intel Corporation
 About this guide
 ================
 This guide describes the basics of Message Signaled Interrupts (MSIs),
 the advantages of using MSI over traditional interrupt mechanisms, how
 to change your driver to use MSI or MSI-X and some basic diagnostics to
 try if a device doesn't support MSIs.
 What are MSIs?
 ==============
 A Message Signaled Interrupt is a write from the device to a special
 address which causes an interrupt to be received by the CPU.
 The MSI capability was first specified in PCI 2.2 and was later enhanced
 in PCI 3.0 to allow each interrupt to be masked individually.  The MSI-X
 capability was also introduced with PCI 3.0.  It supports more interrupts
 per device than MSI and allows interrupts to be independently configured.
 Devices may support both MSI and MSI-X, but only one can be enabled at
 a time.
 Why use MSIs?
 =============
 There are three reasons why using MSIs can give an advantage over
 traditional pin-based interrupts.
 Pin-based PCI interrupts are often shared amongst several devices.
 To support this, the kernel must call each interrupt handler associated
 with an interrupt, which leads to reduced performance for the system as
 a whole.  MSIs are never shared, so this problem cannot arise.
 When a device writes data to memory, then raises a pin-based interrupt,
 it is possible that the interrupt may arrive before all the data has
 arrived in memory (this becomes more likely with devices behind PCI-PCI
 bridges).  In order to ensure that all the data has arrived in memory,
 the interrupt handler must read a register on the device which raised
 the interrupt.  PCI transaction ordering rules require that all the data
 arrive in memory before the value may be returned from the register.
 Using MSIs avoids this problem as the interrupt-generating write cannot
 pass the data writes, so by the time the interrupt is raised, the driver
 knows that all the data has arrived in memory.
 PCI devices can only support a single pin-based interrupt per function.
 Often drivers have to query the device to find out what event has
 occurred, slowing down interrupt handling for the common case.  With
 MSIs, a device can support more interrupts, allowing each interrupt
 to be specialised to a different purpose.  One possible design gives
 infrequent conditions (such as errors) their own interrupt which allows
 the driver to handle the normal interrupt handling path more efficiently.
 Other possible designs include giving one interrupt to each packet queue
 in a network card or each port in a storage controller.
 How to use MSIs
 ===============
 PCI devices are initialised to use pin-based interrupts.  The device
 driver has to set up the device to use MSI or MSI-X.  Not all machines
 support MSIs correctly, and for those machines, the APIs described below
 will simply fail and the device will continue to use pin-based interrupts.
 Include kernel support for MSIs
 -------------------------------
 To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI
 option enabled.  This option is only available on some architectures,
 and it may depend on some other options also being set.  For example,
 on x86, you must also enable X86_UP_APIC or SMP in order to see the
 CONFIG_PCI_MSI option.
 Using MSI
 ---------
 Most of the hard work is done for the driver in the PCI layer.  The driver
 simply has to request that the PCI layer set up the MSI capability for this
 device.
 To automatically use MSI or MSI-X interrupt vectors, use the following
 function::
  int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
 		unsigned int max_vecs, unsigned int flags);
 which allocates up to max_vecs interrupt vectors for a PCI device.  It
 returns the number of vectors allocated or a negative error.  If the device
 has a requirements for a minimum number of vectors the driver can pass a
 min_vecs argument set to this limit, and the PCI core will return -ENOSPC
 if it can't meet the minimum number of vectors.
 The flags argument is used to specify which type of interrupt can be used
 by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX).
 A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for
 any possible kind of interrupt.  If the PCI_IRQ_AFFINITY flag is set,
 pci_alloc_irq_vectors() will spread the interrupts around the available CPUs.
 To get the Linux IRQ numbers passed to request_irq() and free_irq() and the
 vectors, use the following function::
  int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
 Any allocated resources should be freed before removing the device using
 the following function::
  void pci_free_irq_vectors(struct pci_dev *dev);
 If a device supports both MSI-X and MSI capabilities, this API will use the
 MSI-X facilities in preference to the MSI facilities.  MSI-X supports any
 number of interrupts between 1 and 2048.  In contrast, MSI is restricted to
 a maximum of 32 interrupts (and must be a power of two).  In addition, the
 MSI interrupt vectors must be allocated consecutively, so the system might
 not be able to allocate as many vectors for MSI as it could for MSI-X.  On
 some platforms, MSI interrupts must all be targeted at the same set of CPUs
 whereas MSI-X interrupts can all be targeted at different CPUs.
 If a device supports neither MSI-X or MSI it will fall back to a single
 legacy IRQ vector.
 The typical usage of MSI or MSI-X interrupts is to allocate as many vectors
 as possible, likely up to the limit supported by the device.  If nvec is
 larger than the number supported by the device it will automatically be
 capped to the supported limit, so there is no need to query the number of
 vectors supported beforehand::
 	nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_ALL_TYPES)
 	if (nvec < 0)
 		goto out_err;
 If a driver is unable or unwilling to deal with a variable number of MSI
 interrupts it can request a particular number of interrupts by passing that
 number to pci_alloc_irq_vectors() function as both 'min_vecs' and
 'max_vecs' parameters::
 	ret = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_ALL_TYPES);
 	if (ret < 0)
 		goto out_err;
 The most notorious example of the request type described above is enabling
 the single MSI mode for a device.  It could be done by passing two 1s as
 'min_vecs' and 'max_vecs'::
 	ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
 	if (ret < 0)
 		goto out_err;
 Some devices might not support using legacy line interrupts, in which case
 the driver can specify that only MSI or MSI-X is acceptable::
 	nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_MSI | PCI_IRQ_MSIX);
 	if (nvec < 0)
 		goto out_err;
 Legacy APIs
 -----------
 The following old APIs to enable and disable MSI or MSI-X interrupts should
 not be used in new code::
  pci_enable_msi()		/* deprecated */
  pci_disable_msi()		/* deprecated */
  pci_enable_msix_range()	/* deprecated */
  pci_enable_msix_exact()	/* deprecated */
  pci_disable_msix()		/* deprecated */
 Additionally there are APIs to provide the number of supported MSI or MSI-X
 vectors: pci_msi_vec_count() and pci_msix_vec_count().  In general these
 should be avoided in favor of letting pci_alloc_irq_vectors() cap the
 number of vectors.  If you have a legitimate special use case for the count
 of vectors we might have to revisit that decision and add a
 pci_nr_irq_vectors() helper that handles MSI and MSI-X transparently.
 Considerations when using MSIs
 ------------------------------
 Spinlocks
 ~~~~~~~~~
 Most device drivers have a per-device spinlock which is taken in the
 interrupt handler.  With pin-based interrupts or a single MSI, it is not
 necessary to disable interrupts (Linux guarantees the same interrupt will
 not be re-entered).  If a device uses multiple interrupts, the driver
 must disable interrupts while the lock is held.  If the device sends
 a different interrupt, the driver will deadlock trying to recursively
 acquire the spinlock.  Such deadlocks can be avoided by using
 spin_lock_irqsave() or spin_lock_irq() which disable local interrupts
 and acquire the lock (see Documentation/kernel-hacking/locking.rst).
 How to tell whether MSI/MSI-X is enabled on a device
 ----------------------------------------------------
 Using 'lspci -v' (as root) may show some devices with "MSI", "Message
 Signalled Interrupts" or "MSI-X" capabilities.  Each of these capabilities
 has an 'Enable' flag which is followed with either "+" (enabled)
 or "-" (disabled).
 MSI quirks
 ==========
 Several PCI chipsets or devices are known not to support MSIs.
 The PCI stack provides three ways to disable MSIs:
 1. globally
 2. on all devices behind a specific bridge
 3. on a single device
 Disabling MSIs globally
 -----------------------
 Some host chipsets simply don't support MSIs properly.  If we're
 lucky, the manufacturer knows this and has indicated it in the ACPI
 FADT table.  In this case, Linux automatically disables MSIs.
 Some boards don't include this information in the table and so we have
 to detect them ourselves.  The complete list of these is found near the
 quirk_disable_all_msi() function in drivers/pci/quirks.c.
 If you have a board which has problems with MSIs, you can pass pci=nomsi
 on the kernel command line to disable MSIs on all devices.  It would be
 in your best interests to report the problem to linux-pci@vger.kernel.org
 including a full 'lspci -v' so we can add the quirks to the kernel.
 Disabling MSIs below a bridge
 -----------------------------
 Some PCI bridges are not able to route MSIs between busses properly.
 In this case, MSIs must be disabled on all devices behind the bridge.
 Some bridges allow you to enable MSIs by changing some bits in their
 PCI configuration space (especially the Hypertransport chipsets such
 as the nVidia nForce and Serverworks HT2000).  As with host chipsets,
 Linux mostly knows about them and automatically enables MSIs if it can.
 If you have a bridge unknown to Linux, you can enable
 MSIs in configuration space using whatever method you know works, then
 enable MSIs on that bridge by doing::
       echo 1 > /sys/bus/pci/devices/$bridge/msi_bus
 where $bridge is the PCI address of the bridge you've enabled (eg
 0000:00:0e.0).
 To disable MSIs, echo 0 instead of 1.  Changing this value should be
 done with caution as it could break interrupt handling for all devices
 below this bridge.
 Again, please notify linux-pci@vger.kernel.org of any bridges that need
 special handling.
 Disabling MSIs on a single device
 ---------------------------------
 Some devices are known to have faulty MSI implementations.  Usually this
 is handled in the individual device driver, but occasionally it's necessary
 to handle this with a quirk.  Some drivers have an option to disable use
 of MSI.  While this is a convenient workaround for the driver author,
 it is not good practice, and should not be emulated.
 Finding why MSIs are disabled on a device
 -----------------------------------------
 From the above three sections, you can see that there are many reasons
 why MSIs may not be enabled for a given device.  Your first step should
 be to examine your dmesg carefully to determine whether MSIs are enabled
 for your machine.  You should also check your .config to be sure you
 have enabled CONFIG_PCI_MSI.
 Then, 'lspci -t' gives the list of bridges above a device. Reading
 `/sys/bus/pci/devices/*/msi_bus` will tell you whether MSIs are enabled (1)
 or disabled (0).  If 0 is found in any of the msi_bus files belonging
 to bridges between the PCI root and the device, MSIs are disabled.
 It is also worth checking the device driver to see whether it supports MSIs.
 For example, it may contain calls to pci_irq_alloc_vectors() with the
 PCI_IRQ_MSI or PCI_IRQ_MSIX flags.
--- a/Documentation/PCI/pci-error-recovery.rst
+++ b/Documentation/PCI/pci-error-recovery.rst
@ -0,0 +1,424 @@
 .. SPDX-License-Identifier: GPL-2.0
 ==================
 PCI Error Recovery
 ==================
 :Authors: - Linas Vepstas <linasvepstas@gmail.com>
          - Richard Lary <rlary@us.ibm.com>
          - Mike Mason <mmlnx@us.ibm.com>
 Many PCI bus controllers are able to detect a variety of hardware
 PCI errors on the bus, such as parity errors on the data and address
 buses, as well as SERR and PERR errors.  Some of the more advanced
 chipsets are able to deal with these errors; these include PCI-E chipsets,
 and the PCI-host bridges found on IBM Power4, Power5 and Power6-based
 pSeries boxes. A typical action taken is to disconnect the affected device,
 halting all I/O to it.  The goal of a disconnection is to avoid system
 corruption; for example, to halt system memory corruption due to DMA's
 to "wild" addresses. Typically, a reconnection mechanism is also
 offered, so that the affected PCI device(s) are reset and put back
 into working condition. The reset phase requires coordination
 between the affected device drivers and the PCI controller chip.
 This document describes a generic API for notifying device drivers
 of a bus disconnection, and then performing error recovery.
 This API is currently implemented in the 2.6.16 and later kernels.
 Reporting and recovery is performed in several steps. First, when
 a PCI hardware error has resulted in a bus disconnect, that event
 is reported as soon as possible to all affected device drivers,
 including multiple instances of a device driver on multi-function
 cards. This allows device drivers to avoid deadlocking in spinloops,
 waiting for some i/o-space register to change, when it never will.
 It also gives the drivers a chance to defer incoming I/O as
 needed.
 Next, recovery is performed in several stages. Most of the complexity
 is forced by the need to handle multi-function devices, that is,
 devices that have multiple device drivers associated with them.
 In the first stage, each driver is allowed to indicate what type
 of reset it desires, the choices being a simple re-enabling of I/O
 or requesting a slot reset.
 If any driver requests a slot reset, that is what will be done.
 After a reset and/or a re-enabling of I/O, all drivers are
 again notified, so that they may then perform any device setup/config
 that may be required.  After these have all completed, a final
 "resume normal operations" event is sent out.
 The biggest reason for choosing a kernel-based implementation rather
 than a user-space implementation was the need to deal with bus
 disconnects of PCI devices attached to storage media, and, in particular,
 disconnects from devices holding the root file system.  If the root
 file system is disconnected, a user-space mechanism would have to go
 through a large number of contortions to complete recovery. Almost all
 of the current Linux file systems are not tolerant of disconnection
 from/reconnection to their underlying block device. By contrast,
 bus errors are easy to manage in the device driver. Indeed, most
 device drivers already handle very similar recovery procedures;
 for example, the SCSI-generic layer already provides significant
 mechanisms for dealing with SCSI bus errors and SCSI bus resets.
 Detailed Design
 ===============
 Design and implementation details below, based on a chain of
 public email discussions with Ben Herrenschmidt, circa 5 April 2005.
 The error recovery API support is exposed to the driver in the form of
 a structure of function pointers pointed to by a new field in struct
 pci_driver. A driver that fails to provide the structure is "non-aware",
 and the actual recovery steps taken are platform dependent.  The
 arch/powerpc implementation will simulate a PCI hotplug remove/add.
 This structure has the form::
 	struct pci_error_handlers
 	{
 		int (*error_detected)(struct pci_dev *dev, enum pci_channel_state);
 		int (*mmio_enabled)(struct pci_dev *dev);
 		int (*slot_reset)(struct pci_dev *dev);
 		void (*resume)(struct pci_dev *dev);
 	};
 The possible channel states are::
 	enum pci_channel_state {
 		pci_channel_io_normal,  /* I/O channel is in normal state */
 		pci_channel_io_frozen,  /* I/O to channel is blocked */
 		pci_channel_io_perm_failure, /* PCI card is dead */
 	};
 Possible return values are::
 	enum pci_ers_result {
 		PCI_ERS_RESULT_NONE,        /* no result/none/not supported in device driver */
 		PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */
 		PCI_ERS_RESULT_NEED_RESET,  /* Device driver wants slot to be reset. */
 		PCI_ERS_RESULT_DISCONNECT,  /* Device has completely failed, is unrecoverable */
 		PCI_ERS_RESULT_RECOVERED,   /* Device driver is fully recovered and operational */
 	};
 A driver does not have to implement all of these callbacks; however,
 if it implements any, it must implement error_detected(). If a callback
 is not implemented, the corresponding feature is considered unsupported.
 For example, if mmio_enabled() and resume() aren't there, then it
 is assumed that the driver is not doing any direct recovery and requires
 a slot reset.  Typically a driver will want to know about
 a slot_reset().
 The actual steps taken by a platform to recover from a PCI error
 event will be platform-dependent, but will follow the general
 sequence described below.
 STEP 0: Error Event
 -------------------
 A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
 is isolated, in that all I/O is blocked: all reads return 0xffffffff,
 all writes are ignored.
 STEP 1: Notification
 --------------------
 Platform calls the error_detected() callback on every instance of
 every driver affected by the error.
 At this point, the device might not be accessible anymore, depending on
 the platform (the slot will be isolated on powerpc). The driver may
 already have "noticed" the error because of a failing I/O, but this
 is the proper "synchronization point", that is, it gives the driver
 a chance to cleanup, waiting for pending stuff (timers, whatever, etc...)
 to complete; it can take semaphores, schedule, etc... everything but
 touch the device. Within this function and after it returns, the driver
 shouldn't do any new IOs. Called in task context. This is sort of a
 "quiesce" point. See note about interrupts at the end of this doc.
 All drivers participating in this system must implement this call.
 The driver must return one of the following result codes:
  - PCI_ERS_RESULT_CAN_RECOVER
      Driver returns this if it thinks it might be able to recover
      the HW by just banging IOs or if it wants to be given
      a chance to extract some diagnostic information (see
      mmio_enable, below).
  - PCI_ERS_RESULT_NEED_RESET
      Driver returns this if it can't recover without a
      slot reset.
  - PCI_ERS_RESULT_DISCONNECT
      Driver returns this if it doesn't want to recover at all.
 The next step taken will depend on the result codes returned by the
 drivers.
 If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER,
 then the platform should re-enable IOs on the slot (or do nothing in
 particular, if the platform doesn't isolate slots), and recovery
 proceeds to STEP 2 (MMIO Enable).
 If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
 then recovery proceeds to STEP 4 (Slot Reset).
 If the platform is unable to recover the slot, the next step
 is STEP 6 (Permanent Failure).
 .. note::
   The current powerpc implementation assumes that a device driver will
   *not* schedule or semaphore in this routine; the current powerpc
   implementation uses one kernel thread to notify all devices;
   thus, if one device sleeps/schedules, all devices are affected.
   Doing better requires complex multi-threaded logic in the error
   recovery implementation (e.g. waiting for all notification threads
   to "join" before proceeding with recovery.)  This seems excessively
   complex and not worth implementing.
   The current powerpc implementation doesn't much care if the device
   attempts I/O at this point, or not.  I/O's will fail, returning
   a value of 0xff on read, and writes will be dropped. If more than
   EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
   assumes that the device driver has gone into an infinite loop
   and prints an error to syslog.  A reboot is then required to
   get the device working again.
 STEP 2: MMIO Enabled
 --------------------
 The platform re-enables MMIO to the device (but typically not the
 DMA), and then calls the mmio_enabled() callback on all affected
 device drivers.
 This is the "early recovery" call. IOs are allowed again, but DMA is
 not, with some restrictions. This is NOT a callback for the driver to
 start operations again, only to peek/poke at the device, extract diagnostic
 information, if any, and eventually do things like trigger a device local
 reset or some such, but not restart operations. This callback is made if
 all drivers on a segment agree that they can try to recover and if no automatic
 link reset was performed by the HW. If the platform can't just re-enable IOs
 without a slot reset or a link reset, it will not call this callback, and
 instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
 .. note::
   The following is proposed; no platform implements this yet:
   Proposal: All I/O's should be done _synchronously_ from within
   this callback, errors triggered by them will be returned via
   the normal pci_check_whatever() API, no new error_detected()
   callback will be issued due to an error happening here. However,
   such an error might cause IOs to be re-blocked for the whole
   segment, and thus invalidate the recovery that other devices
   on the same segment might have done, forcing the whole segment
   into one of the next states, that is, link reset or slot reset.
 The driver should return one of the following result codes:
  - PCI_ERS_RESULT_RECOVERED
      Driver returns this if it thinks the device is fully
      functional and thinks it is ready to start
      normal driver operations again. There is no
      guarantee that the driver will actually be
      allowed to proceed, as another driver on the
      same segment might have failed and thus triggered a
      slot reset on platforms that support it.
  - PCI_ERS_RESULT_NEED_RESET
      Driver returns this if it thinks the device is not
      recoverable in its current state and it needs a slot
      reset to proceed.
  - PCI_ERS_RESULT_DISCONNECT
      Same as above. Total failure, no recovery even after
      reset driver dead. (To be defined more precisely)
 The next step taken depends on the results returned by the drivers.
 If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
 proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
 If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
 proceeds to STEP 4 (Slot Reset)
 STEP 3: Link Reset
 ------------------
 The platform resets the link.  This is a PCI-Express specific step
 and is done whenever a fatal error has been detected that can be
 "solved" by resetting the link.
 STEP 4: Slot Reset
 ------------------
 In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
 the platform will perform a slot reset on the requesting PCI device(s).
 The actual steps taken by a platform to perform a slot reset
 will be platform-dependent. Upon completion of slot reset, the
 platform will call the device slot_reset() callback.
 Powerpc platforms implement two levels of slot reset:
 soft reset(default) and fundamental(optional) reset.
 Powerpc soft reset consists of asserting the adapter #RST line and then
 restoring the PCI BAR's and PCI configuration header to a state
 that is equivalent to what it would be after a fresh system
 power-on followed by power-on BIOS/system firmware initialization.
 Soft reset is also known as hot-reset.
 Powerpc fundamental reset is supported by PCI Express cards only
 and results in device's state machines, hardware logic, port states and
 configuration registers to initialize to their default conditions.
 For most PCI devices, a soft reset will be sufficient for recovery.
 Optional fundamental reset is provided to support a limited number
 of PCI Express devices for which a soft reset is not sufficient
 for recovery.
 If the platform supports PCI hotplug, then the reset might be
 performed by toggling the slot electrical power off/on.
 It is important for the platform to restore the PCI config space
 to the "fresh poweron" state, rather than the "last state". After
 a slot reset, the device driver will almost always use its standard
 device initialization routines, and an unusual config space setup
 may result in hung devices, kernel panics, or silent data corruption.
 This call gives drivers the chance to re-initialize the hardware
 (re-download firmware, etc.).  At this point, the driver may assume
 that the card is in a fresh state and is fully functional. The slot
 is unfrozen and the driver has full access to PCI config space,
 memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X)
 will also be available.
 Drivers should not restart normal I/O processing operations
 at this point.  If all device drivers report success on this
 callback, the platform will call resume() to complete the sequence,
 and let the driver restart normal I/O processing.
 A driver can still return a critical failure for this function if
 it can't get the device operational after reset.  If the platform
 previously tried a soft reset, it might now try a hard reset (power
 cycle) and then call slot_reset() again.  It the device still can't
 be recovered, there is nothing more that can be done;  the platform
 will typically report a "permanent failure" in such a case.  The
 device will be considered "dead" in this case.
 Drivers for multi-function cards will need to coordinate among
 themselves as to which driver instance will perform any "one-shot"
 or global device initialization. For example, the Symbios sym53cxx2
 driver performs device init only from PCI function 0::
 	+       if (PCI_FUNC(pdev->devfn) == 0)
 	+               sym_reset_scsi_bus(np, 0);
 Result codes:
 	- PCI_ERS_RESULT_DISCONNECT
 	  Same as above.
 Drivers for PCI Express cards that require a fundamental reset must
 set the needs_freset bit in the pci_dev structure in their probe function.
 For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
 PCI card types::
 	+	/* Set EEH reset type to fundamental if required by hba  */
 	+	if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha))
 	+		pdev->needs_freset = 1;
 	+
 Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
 Failure).
 .. note::
   The current powerpc implementation does not try a power-cycle
   reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
   However, it probably should.
 STEP 5: Resume Operations
 -------------------------
 The platform will call the resume() callback on all affected device
 drivers if all drivers on the segment have returned
 PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks.
 The goal of this callback is to tell the driver to restart activity,
 that everything is back and running. This callback does not return
 a result code.
 At this point, if a new error happens, the platform will restart
 a new error recovery sequence.
 STEP 6: Permanent Failure
 -------------------------
 A "permanent failure" has occurred, and the platform cannot recover
 the device.  The platform will call error_detected() with a
 pci_channel_state value of pci_channel_io_perm_failure.
 The device driver should, at this point, assume the worst. It should
 cancel all pending I/O, refuse all new I/O, returning -EIO to
 higher layers. The device driver should then clean up all of its
 memory and remove itself from kernel operations, much as it would
 during system shutdown.
 The platform will typically notify the system operator of the
 permanent failure in some way.  If the device is hotplug-capable,
 the operator will probably want to remove and replace the device.
 Note, however, not all failures are truly "permanent". Some are
 caused by over-heating, some by a poorly seated card. Many
 PCI error events are caused by software bugs, e.g. DMA's to
 wild addresses or bogus split transactions due to programming
 errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
 for additional detail on real-life experience of the causes of
 software errors.
 Conclusion; General Remarks
 ---------------------------
 The way the callbacks are called is platform policy. A platform with
 no slot reset capability may want to just "ignore" drivers that can't
 recover (disconnect them) and try to let other cards on the same segment
 recover. Keep in mind that in most real life cases, though, there will
 be only one driver per segment.
 Now, a note about interrupts. If you get an interrupt and your
 device is dead or has been isolated, there is a problem :)
 The current policy is to turn this into a platform policy.
 That is, the recovery API only requires that:
 - There is no guarantee that interrupt delivery can proceed from any
   device on the segment starting from the error detection and until the
   slot_reset callback is called, at which point interrupts are expected
   to be fully operational.
 - There is no guarantee that interrupt delivery is stopped, that is,
   a driver that gets an interrupt after detecting an error, or that detects
   an error within the interrupt handler such that it prevents proper
   ack'ing of the interrupt (and thus removal of the source) should just
   return IRQ_NOTHANDLED. It's up to the platform to deal with that
   condition, typically by masking the IRQ source during the duration of
   the error handling. It is expected that the platform "knows" which
   interrupts are routed to error-management capable slots and can deal
   with temporarily disabling that IRQ number during error processing (this
   isn't terribly complex). That means some IRQ latency for other devices
   sharing the interrupt, but there is simply no other way. High end
   platforms aren't supposed to share interrupts between many devices
   anyway :)
 .. note::
   Implementation details for the powerpc platform are discussed in
   the file Documentation/powerpc/eeh-pci-error-recovery.txt
   As of this writing, there is a growing list of device drivers with
   patches implementing error recovery. Not all of these patches are in
   mainline yet. These may be used as "examples":
   - drivers/scsi/ipr
   - drivers/scsi/sym53c8xx_2
   - drivers/scsi/qla2xxx
   - drivers/scsi/lpfc
   - drivers/next/bnx2.c
   - drivers/next/e100.c
   - drivers/net/e1000
   - drivers/net/e1000e
   - drivers/net/ixgb
   - drivers/net/ixgbe
   - drivers/net/cxgb3
   - drivers/net/s2io.c
   - drivers/net/qlge
--- a/Documentation/PCI/pci-error-recovery.txt
+++ b/Documentation/PCI/pci-error-recovery.txt
@ -1,413 +0,0 @@
                       PCI Error Recovery
                       ------------------
                        February 2, 2006
                 Current document maintainer:
             Linas Vepstas <linasvepstas@gmail.com>
          updated by Richard Lary <rlary@us.ibm.com>
       and Mike Mason <mmlnx@us.ibm.com> on 27-Jul-2009
 Many PCI bus controllers are able to detect a variety of hardware
 PCI errors on the bus, such as parity errors on the data and address
 buses, as well as SERR and PERR errors.  Some of the more advanced
 chipsets are able to deal with these errors; these include PCI-E chipsets,
 and the PCI-host bridges found on IBM Power4, Power5 and Power6-based
 pSeries boxes. A typical action taken is to disconnect the affected device,
 halting all I/O to it.  The goal of a disconnection is to avoid system
 corruption; for example, to halt system memory corruption due to DMA's
 to "wild" addresses. Typically, a reconnection mechanism is also
 offered, so that the affected PCI device(s) are reset and put back
 into working condition. The reset phase requires coordination
 between the affected device drivers and the PCI controller chip.
 This document describes a generic API for notifying device drivers
 of a bus disconnection, and then performing error recovery.
 This API is currently implemented in the 2.6.16 and later kernels.
 Reporting and recovery is performed in several steps. First, when
 a PCI hardware error has resulted in a bus disconnect, that event
 is reported as soon as possible to all affected device drivers,
 including multiple instances of a device driver on multi-function
 cards. This allows device drivers to avoid deadlocking in spinloops,
 waiting for some i/o-space register to change, when it never will.
 It also gives the drivers a chance to defer incoming I/O as
 needed.
 Next, recovery is performed in several stages. Most of the complexity
 is forced by the need to handle multi-function devices, that is,
 devices that have multiple device drivers associated with them.
 In the first stage, each driver is allowed to indicate what type
 of reset it desires, the choices being a simple re-enabling of I/O
 or requesting a slot reset.
 If any driver requests a slot reset, that is what will be done.
 After a reset and/or a re-enabling of I/O, all drivers are
 again notified, so that they may then perform any device setup/config
 that may be required.  After these have all completed, a final
 "resume normal operations" event is sent out.
 The biggest reason for choosing a kernel-based implementation rather
 than a user-space implementation was the need to deal with bus
 disconnects of PCI devices attached to storage media, and, in particular,
 disconnects from devices holding the root file system.  If the root
 file system is disconnected, a user-space mechanism would have to go
 through a large number of contortions to complete recovery. Almost all
 of the current Linux file systems are not tolerant of disconnection
 from/reconnection to their underlying block device. By contrast,
 bus errors are easy to manage in the device driver. Indeed, most
 device drivers already handle very similar recovery procedures;
 for example, the SCSI-generic layer already provides significant
 mechanisms for dealing with SCSI bus errors and SCSI bus resets.
 Detailed Design
 ---------------
 Design and implementation details below, based on a chain of
 public email discussions with Ben Herrenschmidt, circa 5 April 2005.
 The error recovery API support is exposed to the driver in the form of
 a structure of function pointers pointed to by a new field in struct
 pci_driver. A driver that fails to provide the structure is "non-aware",
 and the actual recovery steps taken are platform dependent.  The
 arch/powerpc implementation will simulate a PCI hotplug remove/add.
 This structure has the form:
 struct pci_error_handlers
 {
 	int (*error_detected)(struct pci_dev *dev, enum pci_channel_state);
 	int (*mmio_enabled)(struct pci_dev *dev);
 	int (*slot_reset)(struct pci_dev *dev);
 	void (*resume)(struct pci_dev *dev);
 };
 The possible channel states are:
 enum pci_channel_state {
 	pci_channel_io_normal,  /* I/O channel is in normal state */
 	pci_channel_io_frozen,  /* I/O to channel is blocked */
 	pci_channel_io_perm_failure, /* PCI card is dead */
 };
 Possible return values are:
 enum pci_ers_result {
 	PCI_ERS_RESULT_NONE,        /* no result/none/not supported in device driver */
 	PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */
 	PCI_ERS_RESULT_NEED_RESET,  /* Device driver wants slot to be reset. */
 	PCI_ERS_RESULT_DISCONNECT,  /* Device has completely failed, is unrecoverable */
 	PCI_ERS_RESULT_RECOVERED,   /* Device driver is fully recovered and operational */
 };
 A driver does not have to implement all of these callbacks; however,
 if it implements any, it must implement error_detected(). If a callback
 is not implemented, the corresponding feature is considered unsupported.
 For example, if mmio_enabled() and resume() aren't there, then it
 is assumed that the driver is not doing any direct recovery and requires
 a slot reset.  Typically a driver will want to know about
 a slot_reset().
 The actual steps taken by a platform to recover from a PCI error
 event will be platform-dependent, but will follow the general
 sequence described below.
 STEP 0: Error Event
 -------------------
 A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
 is isolated, in that all I/O is blocked: all reads return 0xffffffff,
 all writes are ignored.
 STEP 1: Notification
 --------------------
 Platform calls the error_detected() callback on every instance of
 every driver affected by the error.
 At this point, the device might not be accessible anymore, depending on
 the platform (the slot will be isolated on powerpc). The driver may
 already have "noticed" the error because of a failing I/O, but this
 is the proper "synchronization point", that is, it gives the driver
 a chance to cleanup, waiting for pending stuff (timers, whatever, etc...)
 to complete; it can take semaphores, schedule, etc... everything but
 touch the device. Within this function and after it returns, the driver
 shouldn't do any new IOs. Called in task context. This is sort of a
 "quiesce" point. See note about interrupts at the end of this doc.
 All drivers participating in this system must implement this call.
 The driver must return one of the following result codes:
 		- PCI_ERS_RESULT_CAN_RECOVER:
 		  Driver returns this if it thinks it might be able to recover
 		  the HW by just banging IOs or if it wants to be given
 		  a chance to extract some diagnostic information (see
 		  mmio_enable, below).
 		- PCI_ERS_RESULT_NEED_RESET:
 		  Driver returns this if it can't recover without a
 		  slot reset.
 		- PCI_ERS_RESULT_DISCONNECT:
 		  Driver returns this if it doesn't want to recover at all.
 The next step taken will depend on the result codes returned by the
 drivers.
 If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER,
 then the platform should re-enable IOs on the slot (or do nothing in
 particular, if the platform doesn't isolate slots), and recovery
 proceeds to STEP 2 (MMIO Enable).
 If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
 then recovery proceeds to STEP 4 (Slot Reset).
 If the platform is unable to recover the slot, the next step
 is STEP 6 (Permanent Failure).
 >>> The current powerpc implementation assumes that a device driver will
 >>> *not* schedule or semaphore in this routine; the current powerpc
 >>> implementation uses one kernel thread to notify all devices;
 >>> thus, if one device sleeps/schedules, all devices are affected.
 >>> Doing better requires complex multi-threaded logic in the error
 >>> recovery implementation (e.g. waiting for all notification threads
 >>> to "join" before proceeding with recovery.)  This seems excessively
 >>> complex and not worth implementing.
 >>> The current powerpc implementation doesn't much care if the device
 >>> attempts I/O at this point, or not.  I/O's will fail, returning
 >>> a value of 0xff on read, and writes will be dropped. If more than
 >>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
 >>> assumes that the device driver has gone into an infinite loop
 >>> and prints an error to syslog.  A reboot is then required to
 >>> get the device working again.
 STEP 2: MMIO Enabled
 -------------------
 The platform re-enables MMIO to the device (but typically not the
 DMA), and then calls the mmio_enabled() callback on all affected
 device drivers.
 This is the "early recovery" call. IOs are allowed again, but DMA is
 not, with some restrictions. This is NOT a callback for the driver to
 start operations again, only to peek/poke at the device, extract diagnostic
 information, if any, and eventually do things like trigger a device local
 reset or some such, but not restart operations. This callback is made if
 all drivers on a segment agree that they can try to recover and if no automatic
 link reset was performed by the HW. If the platform can't just re-enable IOs
 without a slot reset or a link reset, it will not call this callback, and
 instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
 >>> The following is proposed; no platform implements this yet:
 >>> Proposal: All I/O's should be done _synchronously_ from within
 >>> this callback, errors triggered by them will be returned via
 >>> the normal pci_check_whatever() API, no new error_detected()
 >>> callback will be issued due to an error happening here. However,
 >>> such an error might cause IOs to be re-blocked for the whole
 >>> segment, and thus invalidate the recovery that other devices
 >>> on the same segment might have done, forcing the whole segment
 >>> into one of the next states, that is, link reset or slot reset.
 The driver should return one of the following result codes:
 		- PCI_ERS_RESULT_RECOVERED
 		  Driver returns this if it thinks the device is fully
 		  functional and thinks it is ready to start
 		  normal driver operations again. There is no
 		  guarantee that the driver will actually be
 		  allowed to proceed, as another driver on the
 		  same segment might have failed and thus triggered a
 		  slot reset on platforms that support it.
 		- PCI_ERS_RESULT_NEED_RESET
 		  Driver returns this if it thinks the device is not
 		  recoverable in its current state and it needs a slot
 		  reset to proceed.
 		- PCI_ERS_RESULT_DISCONNECT
 		  Same as above. Total failure, no recovery even after
 		  reset driver dead. (To be defined more precisely)
 The next step taken depends on the results returned by the drivers.
 If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
 proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
 If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
 proceeds to STEP 4 (Slot Reset)
 STEP 3: Link Reset
 ------------------
 The platform resets the link.  This is a PCI-Express specific step
 and is done whenever a fatal error has been detected that can be
 "solved" by resetting the link.
 STEP 4: Slot Reset
 ------------------
 In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
 the platform will perform a slot reset on the requesting PCI device(s).
 The actual steps taken by a platform to perform a slot reset
 will be platform-dependent. Upon completion of slot reset, the
 platform will call the device slot_reset() callback.
 Powerpc platforms implement two levels of slot reset:
 soft reset(default) and fundamental(optional) reset.
 Powerpc soft reset consists of asserting the adapter #RST line and then
 restoring the PCI BAR's and PCI configuration header to a state
 that is equivalent to what it would be after a fresh system
 power-on followed by power-on BIOS/system firmware initialization.
 Soft reset is also known as hot-reset.
 Powerpc fundamental reset is supported by PCI Express cards only
 and results in device's state machines, hardware logic, port states and
 configuration registers to initialize to their default conditions.
 For most PCI devices, a soft reset will be sufficient for recovery.
 Optional fundamental reset is provided to support a limited number
 of PCI Express devices for which a soft reset is not sufficient
 for recovery.
 If the platform supports PCI hotplug, then the reset might be
 performed by toggling the slot electrical power off/on.
 It is important for the platform to restore the PCI config space
 to the "fresh poweron" state, rather than the "last state". After
 a slot reset, the device driver will almost always use its standard
 device initialization routines, and an unusual config space setup
 may result in hung devices, kernel panics, or silent data corruption.
 This call gives drivers the chance to re-initialize the hardware
 (re-download firmware, etc.).  At this point, the driver may assume
 that the card is in a fresh state and is fully functional. The slot
 is unfrozen and the driver has full access to PCI config space,
 memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X)
 will also be available.
 Drivers should not restart normal I/O processing operations
 at this point.  If all device drivers report success on this
 callback, the platform will call resume() to complete the sequence,
 and let the driver restart normal I/O processing.
 A driver can still return a critical failure for this function if
 it can't get the device operational after reset.  If the platform
 previously tried a soft reset, it might now try a hard reset (power
 cycle) and then call slot_reset() again.  It the device still can't
 be recovered, there is nothing more that can be done;  the platform
 will typically report a "permanent failure" in such a case.  The
 device will be considered "dead" in this case.
 Drivers for multi-function cards will need to coordinate among
 themselves as to which driver instance will perform any "one-shot"
 or global device initialization. For example, the Symbios sym53cxx2
 driver performs device init only from PCI function 0:
 +       if (PCI_FUNC(pdev->devfn) == 0)
 +               sym_reset_scsi_bus(np, 0);
 	Result codes:
 		- PCI_ERS_RESULT_DISCONNECT
 		Same as above.
 Drivers for PCI Express cards that require a fundamental reset must
 set the needs_freset bit in the pci_dev structure in their probe function.
 For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
 PCI card types:
 +	/* Set EEH reset type to fundamental if required by hba  */
 +	if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha))
 +		pdev->needs_freset = 1;
 +
 Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
 Failure).
 >>> The current powerpc implementation does not try a power-cycle
 >>> reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
 >>> However, it probably should.
 STEP 5: Resume Operations
 -------------------------
 The platform will call the resume() callback on all affected device
 drivers if all drivers on the segment have returned
 PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks.
 The goal of this callback is to tell the driver to restart activity,
 that everything is back and running. This callback does not return
 a result code.
 At this point, if a new error happens, the platform will restart
 a new error recovery sequence.
 STEP 6: Permanent Failure
 -------------------------
 A "permanent failure" has occurred, and the platform cannot recover
 the device.  The platform will call error_detected() with a
 pci_channel_state value of pci_channel_io_perm_failure.
 The device driver should, at this point, assume the worst. It should
 cancel all pending I/O, refuse all new I/O, returning -EIO to
 higher layers. The device driver should then clean up all of its
 memory and remove itself from kernel operations, much as it would
 during system shutdown.
 The platform will typically notify the system operator of the
 permanent failure in some way.  If the device is hotplug-capable,
 the operator will probably want to remove and replace the device.
 Note, however, not all failures are truly "permanent". Some are
 caused by over-heating, some by a poorly seated card. Many
 PCI error events are caused by software bugs, e.g. DMA's to
 wild addresses or bogus split transactions due to programming
 errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
 for additional detail on real-life experience of the causes of
 software errors.
 Conclusion; General Remarks
 ---------------------------
 The way the callbacks are called is platform policy. A platform with
 no slot reset capability may want to just "ignore" drivers that can't
 recover (disconnect them) and try to let other cards on the same segment
 recover. Keep in mind that in most real life cases, though, there will
 be only one driver per segment.
 Now, a note about interrupts. If you get an interrupt and your
 device is dead or has been isolated, there is a problem :)
 The current policy is to turn this into a platform policy.
 That is, the recovery API only requires that:
 - There is no guarantee that interrupt delivery can proceed from any
 device on the segment starting from the error detection and until the
 slot_reset callback is called, at which point interrupts are expected
 to be fully operational.
 - There is no guarantee that interrupt delivery is stopped, that is,
 a driver that gets an interrupt after detecting an error, or that detects
 an error within the interrupt handler such that it prevents proper
 ack'ing of the interrupt (and thus removal of the source) should just
 return IRQ_NOTHANDLED. It's up to the platform to deal with that
 condition, typically by masking the IRQ source during the duration of
 the error handling. It is expected that the platform "knows" which
 interrupts are routed to error-management capable slots and can deal
 with temporarily disabling that IRQ number during error processing (this
 isn't terribly complex). That means some IRQ latency for other devices
 sharing the interrupt, but there is simply no other way. High end
 platforms aren't supposed to share interrupts between many devices
 anyway :)
 >>> Implementation details for the powerpc platform are discussed in
 >>> the file Documentation/powerpc/eeh-pci-error-recovery.txt
 >>> As of this writing, there is a growing list of device drivers with
 >>> patches implementing error recovery. Not all of these patches are in
 >>> mainline yet. These may be used as "examples":
 >>>
 >>> drivers/scsi/ipr
 >>> drivers/scsi/sym53c8xx_2
 >>> drivers/scsi/qla2xxx
 >>> drivers/scsi/lpfc
 >>> drivers/next/bnx2.c
 >>> drivers/next/e100.c
 >>> drivers/net/e1000
 >>> drivers/net/e1000e
 >>> drivers/net/ixgb
 >>> drivers/net/ixgbe
 >>> drivers/net/cxgb3
 >>> drivers/net/s2io.c
 >>> drivers/net/qlge
 The End
 -------
--- a/Documentation/PCI/pci-iov-howto.rst
+++ b/Documentation/PCI/pci-iov-howto.rst
@ -0,0 +1,172 @@
 .. SPDX-License-Identifier: GPL-2.0
 .. include:: <isonum.txt>
 ====================================
 PCI Express I/O Virtualization Howto
 ====================================
 :Copyright: |copy| 2009 Intel Corporation
 :Authors: - Yu Zhao <yu.zhao@intel.com>
          - Donald Dutile <ddutile@redhat.com>
 Overview
 ========
 What is SR-IOV
 --------------
 Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
 capability which makes one physical device appear as multiple virtual
 devices. The physical device is referred to as Physical Function (PF)
 while the virtual devices are referred to as Virtual Functions (VF).
 Allocation of the VF can be dynamically controlled by the PF via
 registers encapsulated in the capability. By default, this feature is
 not enabled and the PF behaves as traditional PCIe device. Once it's
 turned on, each VF's PCI configuration space can be accessed by its own
 Bus, Device and Function Number (Routing ID). And each VF also has PCI
 Memory Space, which is used to map its register set. VF device driver
 operates on the register set so it can be functional and appear as a
 real existing PCI device.
 User Guide
 ==========
 How can I enable SR-IOV capability
 ----------------------------------
 Multiple methods are available for SR-IOV enablement.
 In the first method, the device driver (PF driver) will control the
 enabling and disabling of the capability via API provided by SR-IOV core.
 If the hardware has SR-IOV capability, loading its PF driver would
 enable it and all VFs associated with the PF.  Some PF drivers require
 a module parameter to be set to determine the number of VFs to enable.
 In the second method, a write to the sysfs file sriov_numvfs will
 enable and disable the VFs associated with a PCIe PF.  This method
 enables per-PF, VF enable/disable values versus the first method,
 which applies to all PFs of the same device.  Additionally, the
 PCI SRIOV core support ensures that enable/disable operations are
 valid to reduce duplication in multiple drivers for the same
 checks, e.g., check numvfs == 0 if enabling VFs, ensure
 numvfs <= totalvfs.
 The second method is the recommended method for new/future VF devices.
 How can I use the Virtual Functions
 -----------------------------------
 The VF is treated as hot-plugged PCI devices in the kernel, so they
 should be able to work in the same way as real PCI devices. The VF
 requires device driver that is same as a normal PCI device's.
 Developer Guide
 ===============
 SR-IOV API
 ----------
 To enable SR-IOV capability:
 (a) For the first method, in the driver::
 	int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 'nr_virtfn' is number of VFs to be enabled.
 (b) For the second method, from sysfs::
 	echo 'nr_virtfn' > \
        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
 To disable SR-IOV capability:
 (a) For the first method, in the driver::
 	void pci_disable_sriov(struct pci_dev *dev);
 (b) For the second method, from sysfs::
 	echo  0 > \
        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
 To enable auto probing VFs by a compatible driver on the host, run
 command below before enabling SR-IOV capabilities. This is the
 default behavior.
 ::
 	echo 1 > \
        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
 To disable auto probing VFs by a compatible driver on the host, run
 command below before enabling SR-IOV capabilities. Updating this
 entry will not affect VFs which are already probed.
 ::
 	echo  0 > \
        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
 Usage example
 -------------
 Following piece of code illustrates the usage of the SR-IOV API.
 ::
 	static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
 	{
 		pci_enable_sriov(dev, NR_VIRTFN);
 		...
 		return 0;
 	}
 	static void dev_remove(struct pci_dev *dev)
 	{
 		pci_disable_sriov(dev);
 		...
 	}
 	static int dev_suspend(struct pci_dev *dev, pm_message_t state)
 	{
 		...
 		return 0;
 	}
 	static int dev_resume(struct pci_dev *dev)
 	{
 		...
 		return 0;
 	}
 	static void dev_shutdown(struct pci_dev *dev)
 	{
 		...
 	}
 	static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
 	{
 		if (numvfs > 0) {
 			...
 			pci_enable_sriov(dev, numvfs);
 			...
 			return numvfs;
 		}
 		if (numvfs == 0) {
 			....
 			pci_disable_sriov(dev);
 			...
 			return 0;
 		}
 	}
 	static struct pci_driver dev_driver = {
 		.name =		"SR-IOV Physical Function driver",
 		.id_table =	dev_id_table,
 		.probe =	dev_probe,
 		.remove =	dev_remove,
 		.suspend =	dev_suspend,
 		.resume =	dev_resume,
 		.shutdown =	dev_shutdown,
 		.sriov_configure = dev_sriov_configure,
 	};
--- a/Documentation/PCI/pci-iov-howto.txt
+++ b/Documentation/PCI/pci-iov-howto.txt
@ -1,147 +0,0 @@
 		PCI Express I/O Virtualization Howto
 		Copyright (C) 2009 Intel Corporation
 		    Yu Zhao <yu.zhao@intel.com>
 		Update: November 2012
 			-- sysfs-based SRIOV enable-/disable-ment
 		Donald Dutile <ddutile@redhat.com>
 1. Overview
 1.1 What is SR-IOV
 Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
 capability which makes one physical device appear as multiple virtual
 devices. The physical device is referred to as Physical Function (PF)
 while the virtual devices are referred to as Virtual Functions (VF).
 Allocation of the VF can be dynamically controlled by the PF via
 registers encapsulated in the capability. By default, this feature is
 not enabled and the PF behaves as traditional PCIe device. Once it's
 turned on, each VF's PCI configuration space can be accessed by its own
 Bus, Device and Function Number (Routing ID). And each VF also has PCI
 Memory Space, which is used to map its register set. VF device driver
 operates on the register set so it can be functional and appear as a
 real existing PCI device.
 2. User Guide
 2.1 How can I enable SR-IOV capability
 Multiple methods are available for SR-IOV enablement.
 In the first method, the device driver (PF driver) will control the
 enabling and disabling of the capability via API provided by SR-IOV core.
 If the hardware has SR-IOV capability, loading its PF driver would
 enable it and all VFs associated with the PF.  Some PF drivers require
 a module parameter to be set to determine the number of VFs to enable.
 In the second method, a write to the sysfs file sriov_numvfs will
 enable and disable the VFs associated with a PCIe PF.  This method
 enables per-PF, VF enable/disable values versus the first method,
 which applies to all PFs of the same device.  Additionally, the
 PCI SRIOV core support ensures that enable/disable operations are
 valid to reduce duplication in multiple drivers for the same
 checks, e.g., check numvfs == 0 if enabling VFs, ensure
 numvfs <= totalvfs.
 The second method is the recommended method for new/future VF devices.
 2.2 How can I use the Virtual Functions
 The VF is treated as hot-plugged PCI devices in the kernel, so they
 should be able to work in the same way as real PCI devices. The VF
 requires device driver that is same as a normal PCI device's.
 3. Developer Guide
 3.1 SR-IOV API
 To enable SR-IOV capability:
 (a) For the first method, in the driver:
 	int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 	'nr_virtfn' is number of VFs to be enabled.
 (b) For the second method, from sysfs:
 	echo 'nr_virtfn' > \
        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
 To disable SR-IOV capability:
 (a) For the first method, in the driver:
 	void pci_disable_sriov(struct pci_dev *dev);
 (b) For the second method, from sysfs:
 	echo  0 > \
        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
 To enable auto probing VFs by a compatible driver on the host, run
 command below before enabling SR-IOV capabilities. This is the
 default behavior.
 	echo 1 > \
        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
 To disable auto probing VFs by a compatible driver on the host, run
 command below before enabling SR-IOV capabilities. Updating this
 entry will not affect VFs which are already probed.
 	echo  0 > \
        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
 3.2 Usage example
 Following piece of code illustrates the usage of the SR-IOV API.
 static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
 {
 	pci_enable_sriov(dev, NR_VIRTFN);
 	...
 	return 0;
 }
 static void dev_remove(struct pci_dev *dev)
 {
 	pci_disable_sriov(dev);
 	...
 }
 static int dev_suspend(struct pci_dev *dev, pm_message_t state)
 {
 	...
 	return 0;
 }
 static int dev_resume(struct pci_dev *dev)
 {
 	...
 	return 0;
 }
 static void dev_shutdown(struct pci_dev *dev)
 {
 	...
 }
 static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
 {
 	if (numvfs > 0) {
 		...
 		pci_enable_sriov(dev, numvfs);
 		...
 		return numvfs;
 	}
 	if (numvfs == 0) {
 		....
 		pci_disable_sriov(dev);
 		...
 		return 0;
 	}
 }
 static struct pci_driver dev_driver = {
 	.name =		"SR-IOV Physical Function driver",
 	.id_table =	dev_id_table,
 	.probe =	dev_probe,
 	.remove =	dev_remove,
 	.suspend =	dev_suspend,
 	.resume =	dev_resume,
 	.shutdown =	dev_shutdown,
 	.sriov_configure = dev_sriov_configure,
 };
--- a/Documentation/PCI/pci.rst
+++ b/Documentation/PCI/pci.rst
@ -0,0 +1,578 @@
 .. SPDX-License-Identifier: GPL-2.0
 ==============================
 How To Write Linux PCI Drivers
 ==============================
 :Authors: - Martin Mares <mj@ucw.cz>
          - Grant Grundler <grundler@parisc-linux.org>
 The world of PCI is vast and full of (mostly unpleasant) surprises.
 Since each CPU architecture implements different chip-sets and PCI devices
 have different requirements (erm, "features"), the result is the PCI support
 in the Linux kernel is not as trivial as one would wish. This short paper
 tries to introduce all potential driver authors to Linux APIs for
 PCI device drivers.
 A more complete resource is the third edition of "Linux Device Drivers"
 by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman.
 LDD3 is available for free (under Creative Commons License) from:
 http://lwn.net/Kernel/LDD3/.
 However, keep in mind that all documents are subject to "bit rot".
 Refer to the source code if things are not working as described here.
 Please send questions/comments/patches about Linux PCI API to the
 "Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list.
 Structure of PCI drivers
 ========================
 PCI drivers "discover" PCI devices in a system via pci_register_driver().
 Actually, it's the other way around. When the PCI generic code discovers
 a new device, the driver with a matching "description" will be notified.
 Details on this below.
 pci_register_driver() leaves most of the probing for devices to
 the PCI layer and supports online insertion/removal of devices [thus
 supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver].
 pci_register_driver() call requires passing in a table of function
 pointers and thus dictates the high level structure of a driver.
 Once the driver knows about a PCI device and takes ownership, the
 driver generally needs to perform the following initialization:
  - Enable the device
  - Request MMIO/IOP resources
  - Set the DMA mask size (for both coherent and streaming DMA)
  - Allocate and initialize shared control data (pci_allocate_coherent())
  - Access device configuration space (if needed)
  - Register IRQ handler (request_irq())
  - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
  - Enable DMA/processing engines
 When done using the device, and perhaps the module needs to be unloaded,
 the driver needs to take the follow steps:
  - Disable the device from generating IRQs
  - Release the IRQ (free_irq())
  - Stop all DMA activity
  - Release DMA buffers (both streaming and coherent)
  - Unregister from other subsystems (e.g. scsi or netdev)
  - Release MMIO/IOP resources
  - Disable the device
 Most of these topics are covered in the following sections.
 For the rest look at LDD3 or <linux/pci.h> .
 If the PCI subsystem is not configured (CONFIG_PCI is not set), most of
 the PCI functions described below are defined as inline functions either
 completely empty or just returning an appropriate error codes to avoid
 lots of ifdefs in the drivers.
 pci_register_driver() call
 ==========================
 PCI device drivers call ``pci_register_driver()`` during their
 initialization with a pointer to a structure describing the driver
 (``struct pci_driver``):
 .. kernel-doc:: include/linux/pci.h
   :functions: pci_driver
 The ID table is an array of ``struct pci_device_id`` entries ending with an
 all-zero entry.  Definitions with static const are generally preferred.
 .. kernel-doc:: include/linux/mod_devicetable.h
   :functions: pci_device_id
 Most drivers only need ``PCI_DEVICE()`` or ``PCI_DEVICE_CLASS()`` to set up
 a pci_device_id table.
 New PCI IDs may be added to a device driver pci_ids table at runtime
 as shown below::
  echo "vendor device subvendor subdevice class class_mask driver_data" > \
  /sys/bus/pci/drivers/{driver}/new_id
 All fields are passed in as hexadecimal values (no leading 0x).
 The vendor and device fields are mandatory, the others are optional. Users
 need pass only as many optional fields as necessary:
  - subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF)
  - class and classmask fields default to 0
  - driver_data defaults to 0UL.
 Note that driver_data must match the value used by any of the pci_device_id
 entries defined in the driver. This makes the driver_data field mandatory
 if all the pci_device_id entries have a non-zero driver_data value.
 Once added, the driver probe routine will be invoked for any unclaimed
 PCI devices listed in its (newly updated) pci_ids list.
 When the driver exits, it just calls pci_unregister_driver() and the PCI layer
 automatically calls the remove hook for all devices handled by the driver.
 "Attributes" for driver functions/data
 --------------------------------------
 Please mark the initialization and cleanup functions where appropriate
 (the corresponding macros are defined in <linux/init.h>):
 	======		=================================================
 	__init		Initialization code. Thrown away after the driver
 			initializes.
 	__exit		Exit code. Ignored for non-modular drivers.
 	======		=================================================
 Tips on when/where to use the above attributes:
 	- The module_init()/module_exit() functions (and all
 	  initialization functions called _only_ from these)
 	  should be marked __init/__exit.
 	- Do not mark the struct pci_driver.
 	- Do NOT mark a function if you are not sure which mark to use.
 	  Better to not mark the function than mark the function wrong.
 How to find PCI devices manually
 ================================
 PCI drivers should have a really good reason for not using the
 pci_register_driver() interface to search for PCI devices.
 The main reason PCI devices are controlled by multiple drivers
 is because one PCI device implements several different HW services.
 E.g. combined serial/parallel port/floppy controller.
 A manual search may be performed using the following constructs:
 Searching by vendor and device ID::
 	struct pci_dev *dev = NULL;
 	while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev))
 		configure_device(dev);
 Searching by class ID (iterate in a similar way)::
 	pci_get_class(CLASS_ID, dev)
 Searching by both vendor/device and subsystem vendor/device ID::
 	pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev).
 You can use the constant PCI_ANY_ID as a wildcard replacement for
 VENDOR_ID or DEVICE_ID.  This allows searching for any device from a
 specific vendor, for example.
 These functions are hotplug-safe. They increment the reference count on
 the pci_dev that they return. You must eventually (possibly at module unload)
 decrement the reference count on these devices by calling pci_dev_put().
 Device Initialization Steps
 ===========================
 As noted in the introduction, most PCI drivers need the following steps
 for device initialization:
  - Enable the device
  - Request MMIO/IOP resources
  - Set the DMA mask size (for both coherent and streaming DMA)
  - Allocate and initialize shared control data (pci_allocate_coherent())
  - Access device configuration space (if needed)
  - Register IRQ handler (request_irq())
  - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
  - Enable DMA/processing engines.
 The driver can access PCI config space registers at any time.
 (Well, almost. When running BIST, config space can go away...but
 that will just result in a PCI Bus Master Abort and config reads
 will return garbage).
 Enable the PCI device
 ---------------------
 Before touching any device registers, the driver needs to enable
 the PCI device by calling pci_enable_device(). This will:
  - wake up the device if it was in suspended state,
  - allocate I/O and memory regions of the device (if BIOS did not),
  - allocate an IRQ (if BIOS did not).
 .. note::
   pci_enable_device() can fail! Check the return value.
 .. warning::
   OS BUG: we don't check resource allocations before enabling those
   resources. The sequence would make more sense if we called
   pci_request_resources() before calling pci_enable_device().
   Currently, the device drivers can't detect the bug when when two
   devices have been allocated the same range. This is not a common
   problem and unlikely to get fixed soon.
   This has been discussed before but not changed as of 2.6.19:
   http://lkml.org/lkml/2006/3/2/194
 pci_set_master() will enable DMA by setting the bus master bit
 in the PCI_COMMAND register. It also fixes the latency timer value if
 it's set to something bogus by the BIOS.  pci_clear_master() will
 disable DMA by clearing the bus master bit.
 If the PCI device can use the PCI Memory-Write-Invalidate transaction,
 call pci_set_mwi().  This enables the PCI_COMMAND bit for Mem-Wr-Inval
 and also ensures that the cache line size register is set correctly.
 Check the return value of pci_set_mwi() as not all architectures
 or chip-sets may support Memory-Write-Invalidate.  Alternatively,
 if Mem-Wr-Inval would be nice to have but is not required, call
 pci_try_set_mwi() to have the system do its best effort at enabling
 Mem-Wr-Inval.
 Request MMIO/IOP resources
 --------------------------
 Memory (MMIO), and I/O port addresses should NOT be read directly
 from the PCI device config space. Use the values in the pci_dev structure
 as the PCI "bus address" might have been remapped to a "host physical"
 address by the arch/chip-set specific kernel support.
 See Documentation/io-mapping.txt for how to access device registers
 or device memory.
 The device driver needs to call pci_request_region() to verify
 no other device is already using the same address resource.
 Conversely, drivers should call pci_release_region() AFTER
 calling pci_disable_device().
 The idea is to prevent two devices colliding on the same address range.
 .. tip::
   See OS BUG comment above. Currently (2.6.19), The driver can only
   determine MMIO and IO Port resource availability _after_ calling
   pci_enable_device().
 Generic flavors of pci_request_region() are request_mem_region()
 (for MMIO ranges) and request_region() (for IO Port ranges).
 Use these for address resources that are not described by "normal" PCI
 BARs.
 Also see pci_request_selected_regions() below.
 Set the DMA mask size
 ---------------------
 .. note::
   If anything below doesn't make sense, please refer to
   Documentation/DMA-API.txt. This section is just a reminder that
   drivers need to indicate DMA capabilities of the device and is not
   an authoritative source for DMA interfaces.
 While all drivers should explicitly indicate the DMA capability
 (e.g. 32 or 64 bit) of the PCI bus master, devices with more than
 32-bit bus master capability for streaming data need the driver
 to "register" this capability by calling pci_set_dma_mask() with
 appropriate parameters.  In general this allows more efficient DMA
 on systems where System RAM exists above 4G _physical_ address.
 Drivers for all PCI-X and PCIe compliant devices must call
 pci_set_dma_mask() as they are 64-bit DMA devices.
 Similarly, drivers must also "register" this capability if the device
 can directly address "consistent memory" in System RAM above 4G physical
 address by calling pci_set_consistent_dma_mask().
 Again, this includes drivers for all PCI-X and PCIe compliant devices.
 Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are
 64-bit DMA capable for payload ("streaming") data but not control
 ("consistent") data.
 Setup shared control data
 -------------------------
 Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared)
 memory.  See Documentation/DMA-API.txt for a full description of
 the DMA APIs. This section is just a reminder that it needs to be done
 before enabling DMA on the device.
 Initialize device registers
 ---------------------------
 Some drivers will need specific "capability" fields programmed
 or other "vendor specific" register initialized or reset.
 E.g. clearing pending interrupts.
 Register IRQ handler
 --------------------
 While calling request_irq() is the last step described here,
 this is often just another intermediate step to initialize a device.
 This step can often be deferred until the device is opened for use.
 All interrupt handlers for IRQ lines should be registered with IRQF_SHARED
 and use the devid to map IRQs to devices (remember that all PCI IRQ lines
 can be shared).
 request_irq() will associate an interrupt handler and device handle
 with an interrupt number. Historically interrupt numbers represent
 IRQ lines which run from the PCI device to the Interrupt controller.
 With MSI and MSI-X (more below) the interrupt number is a CPU "vector".
 request_irq() also enables the interrupt. Make sure the device is
 quiesced and does not have any interrupts pending before registering
 the interrupt handler.
 MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts"
 which deliver interrupts to the CPU via a DMA write to a Local APIC.
 The fundamental difference between MSI and MSI-X is how multiple
 "vectors" get allocated. MSI requires contiguous blocks of vectors
 while MSI-X can allocate several individual ones.
 MSI capability can be enabled by calling pci_alloc_irq_vectors() with the
 PCI_IRQ_MSI and/or PCI_IRQ_MSIX flags before calling request_irq(). This
 causes the PCI support to program CPU vector data into the PCI device
 capability registers. Many architectures, chip-sets, or BIOSes do NOT
 support MSI or MSI-X and a call to pci_alloc_irq_vectors with just
 the PCI_IRQ_MSI and PCI_IRQ_MSIX flags will fail, so try to always
 specify PCI_IRQ_LEGACY as well.
 Drivers that have different interrupt handlers for MSI/MSI-X and
 legacy INTx should chose the right one based on the msi_enabled
 and msix_enabled flags in the pci_dev structure after calling
 pci_alloc_irq_vectors.
 There are (at least) two really good reasons for using MSI:
 1) MSI is an exclusive interrupt vector by definition.
   This means the interrupt handler doesn't have to verify
   its device caused the interrupt.
 2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed
   to be visible to the host CPU(s) when the MSI is delivered. This
   is important for both data coherency and avoiding stale control data.
   This guarantee allows the driver to omit MMIO reads to flush
   the DMA stream.
 See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples
 of MSI/MSI-X usage.
 PCI device shutdown
 ===================
 When a PCI device driver is being unloaded, most of the following
 steps need to be performed:
  - Disable the device from generating IRQs
  - Release the IRQ (free_irq())
  - Stop all DMA activity
  - Release DMA buffers (both streaming and consistent)
  - Unregister from other subsystems (e.g. scsi or netdev)
  - Disable device from responding to MMIO/IO Port addresses
  - Release MMIO/IO Port resource(s)
 Stop IRQs on the device
 -----------------------
 How to do this is chip/device specific. If it's not done, it opens
 the possibility of a "screaming interrupt" if (and only if)
 the IRQ is shared with another device.
 When the shared IRQ handler is "unhooked", the remaining devices
 using the same IRQ line will still need the IRQ enabled. Thus if the
 "unhooked" device asserts IRQ line, the system will respond assuming
 it was one of the remaining devices asserted the IRQ line. Since none
 of the other devices will handle the IRQ, the system will "hang" until
 it decides the IRQ isn't going to get handled and masks the IRQ (100,000
 iterations later). Once the shared IRQ is masked, the remaining devices
 will stop functioning properly. Not a nice situation.
 This is another reason to use MSI or MSI-X if it's available.
 MSI and MSI-X are defined to be exclusive interrupts and thus
 are not susceptible to the "screaming interrupt" problem.
 Release the IRQ
 ---------------
 Once the device is quiesced (no more IRQs), one can call free_irq().
 This function will return control once any pending IRQs are handled,
 "unhook" the drivers IRQ handler from that IRQ, and finally release
 the IRQ if no one else is using it.
 Stop all DMA activity
 ---------------------
 It's extremely important to stop all DMA operations BEFORE attempting
 to deallocate DMA control data. Failure to do so can result in memory
 corruption, hangs, and on some chip-sets a hard crash.
 Stopping DMA after stopping the IRQs can avoid races where the
 IRQ handler might restart DMA engines.
 While this step sounds obvious and trivial, several "mature" drivers
 didn't get this step right in the past.
 Release DMA buffers
 -------------------
 Once DMA is stopped, clean up streaming DMA first.
 I.e. unmap data buffers and return buffers to "upstream"
 owners if there is one.
 Then clean up "consistent" buffers which contain the control data.
 See Documentation/DMA-API.txt for details on unmapping interfaces.
 Unregister from other subsystems
 --------------------------------
 Most low level PCI device drivers support some other subsystem
 like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your
 driver isn't losing resources from that other subsystem.
 If this happens, typically the symptom is an Oops (panic) when
 the subsystem attempts to call into a driver that has been unloaded.
 Disable Device from responding to MMIO/IO Port addresses
 --------------------------------------------------------
 io_unmap() MMIO or IO Port resources and then call pci_disable_device().
 This is the symmetric opposite of pci_enable_device().
 Do not access device registers after calling pci_disable_device().
 Release MMIO/IO Port Resource(s)
 --------------------------------
 Call pci_release_region() to mark the MMIO or IO Port range as available.
 Failure to do so usually results in the inability to reload the driver.
 How to access PCI config space
 ==============================
 You can use `pci_(read|write)_config_(byte|word|dword)` to access the config
 space of a device represented by `struct pci_dev *`. All these functions return
 0 when successful or an error code (`PCIBIOS_...`) which can be translated to a
 text string by pcibios_strerror. Most drivers expect that accesses to valid PCI
 devices don't fail.
 If you don't have a struct pci_dev available, you can call
 `pci_bus_(read|write)_config_(byte|word|dword)` to access a given device
 and function on that bus.
 If you access fields in the standard portion of the config header, please
 use symbolic names of locations and bits declared in <linux/pci.h>.
 If you need to access Extended PCI Capability registers, just call
 pci_find_capability() for the particular capability and it will find the
 corresponding register block for you.
 Other interesting functions
 ===========================
 =============================	================================================
 pci_get_domain_bus_and_slot()	Find pci_dev corresponding to given domain,
 				bus and slot and number. If the device is
 				found, its reference count is increased.
 pci_set_power_state()		Set PCI Power Management state (0=D0 ... 3=D3)
 pci_find_capability()		Find specified capability in device's capability
 				list.
 pci_resource_start()		Returns bus start address for a given PCI region
 pci_resource_end()		Returns bus end address for a given PCI region
 pci_resource_len()		Returns the byte length of a PCI region
 pci_set_drvdata()		Set private driver data pointer for a pci_dev
 pci_get_drvdata()		Return private driver data pointer for a pci_dev
 pci_set_mwi()			Enable Memory-Write-Invalidate transactions.
 pci_clear_mwi()			Disable Memory-Write-Invalidate transactions.
 =============================	================================================
 Miscellaneous hints
 ===================
 When displaying PCI device names to the user (for example when a driver wants
 to tell the user what card has it found), please use pci_name(pci_dev).
 Always refer to the PCI devices by a pointer to the pci_dev structure.
 All PCI layer functions use this identification and it's the only
 reasonable one. Don't use bus/slot/function numbers except for very
 special purposes -- on systems with multiple primary buses their semantics
 can be pretty complex.
 Don't try to turn on Fast Back to Back writes in your driver.  All devices
 on the bus need to be capable of doing it, so this is something which needs
 to be handled by platform and generic code, not individual drivers.
 Vendor and device identifications
 =================================
 Do not add new device or vendor IDs to include/linux/pci_ids.h unless they
 are shared across multiple drivers.  You can add private definitions in
 your driver if they're helpful, or just use plain hex constants.
 The device IDs are arbitrary hex numbers (vendor controlled) and normally used
 only in a single location, the pci_device_id table.
 Please DO submit new vendor/device IDs to http://pci-ids.ucw.cz/.
 There are mirrors of the pci.ids file at http://pciids.sourceforge.net/
 and https://github.com/pciutils/pciids.
 Obsolete functions
 ==================
 There are several functions which you might come across when trying to
 port an old driver to the new PCI interface.  They are no longer present
 in the kernel as they aren't compatible with hotplug or PCI domains or
 having sane locking.
 =================	===========================================
 pci_find_device()	Superseded by pci_get_device()
 pci_find_subsys()	Superseded by pci_get_subsys()
 pci_find_slot()		Superseded by pci_get_domain_bus_and_slot()
 pci_get_slot()		Superseded by pci_get_domain_bus_and_slot()
 =================	===========================================
 The alternative is the traditional PCI device driver that walks PCI
 device lists. This is still possible but discouraged.
 MMIO Space and "Write Posting"
 ==============================
 Converting a driver from using I/O Port space to using MMIO space
 often requires some additional changes. Specifically, "write posting"
 needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2)
 already do this. I/O Port space guarantees write transactions reach the PCI
 device before the CPU can continue. Writes to MMIO space allow the CPU
 to continue before the transaction reaches the PCI device. HW weenies
 call this "Write Posting" because the write completion is "posted" to
 the CPU before the transaction has reached its destination.
 Thus, timing sensitive code should add readl() where the CPU is
 expected to wait before doing other work.  The classic "bit banging"
 sequence works fine for I/O Port space::
       for (i = 8; --i; val >>= 1) {
               outb(val & 1, ioport_reg);      /* write bit */
               udelay(10);
       }
 The same sequence for MMIO space should be::
       for (i = 8; --i; val >>= 1) {
               writeb(val & 1, mmio_reg);      /* write bit */
               readb(safe_mmio_reg);           /* flush posted write */
               udelay(10);
       }
 It is important that "safe_mmio_reg" not have any side effects that
 interferes with the correct operation of the device.
 Another case to watch out for is when resetting a PCI device. Use PCI
 Configuration space reads to flush the writel(). This will gracefully
 handle the PCI master abort on all platforms if the PCI device is
 expected to not respond to a readl().  Most x86 platforms will allow
 MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage
 (e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail").
--- a/Documentation/PCI/pci.txt
+++ b/Documentation/PCI/pci.txt
@ -1,636 +0,0 @@
 			How To Write Linux PCI Drivers
 		by Martin Mares <mj@ucw.cz> on 07-Feb-2000
 	updated by Grant Grundler <grundler@parisc-linux.org> on 23-Dec-2006
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The world of PCI is vast and full of (mostly unpleasant) surprises.
 Since each CPU architecture implements different chip-sets and PCI devices
 have different requirements (erm, "features"), the result is the PCI support
 in the Linux kernel is not as trivial as one would wish. This short paper
 tries to introduce all potential driver authors to Linux APIs for
 PCI device drivers.
 A more complete resource is the third edition of "Linux Device Drivers"
 by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman.
 LDD3 is available for free (under Creative Commons License) from:
 	http://lwn.net/Kernel/LDD3/
 However, keep in mind that all documents are subject to "bit rot".
 Refer to the source code if things are not working as described here.
 Please send questions/comments/patches about Linux PCI API to the
 "Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list.
 0. Structure of PCI drivers
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 PCI drivers "discover" PCI devices in a system via pci_register_driver().
 Actually, it's the other way around. When the PCI generic code discovers
 a new device, the driver with a matching "description" will be notified.
 Details on this below.
 pci_register_driver() leaves most of the probing for devices to
 the PCI layer and supports online insertion/removal of devices [thus
 supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver].
 pci_register_driver() call requires passing in a table of function
 pointers and thus dictates the high level structure of a driver.
 Once the driver knows about a PCI device and takes ownership, the
 driver generally needs to perform the following initialization:
 	Enable the device
 	Request MMIO/IOP resources
 	Set the DMA mask size (for both coherent and streaming DMA)
 	Allocate and initialize shared control data (pci_allocate_coherent())
 	Access device configuration space (if needed)
 	Register IRQ handler (request_irq())
 	Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
 	Enable DMA/processing engines
 When done using the device, and perhaps the module needs to be unloaded,
 the driver needs to take the follow steps:
 	Disable the device from generating IRQs
 	Release the IRQ (free_irq())
 	Stop all DMA activity
 	Release DMA buffers (both streaming and coherent)
 	Unregister from other subsystems (e.g. scsi or netdev)
 	Release MMIO/IOP resources
 	Disable the device
 Most of these topics are covered in the following sections.
 For the rest look at LDD3 or <linux/pci.h> .
 If the PCI subsystem is not configured (CONFIG_PCI is not set), most of
 the PCI functions described below are defined as inline functions either
 completely empty or just returning an appropriate error codes to avoid
 lots of ifdefs in the drivers.
 1. pci_register_driver() call
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 PCI device drivers call pci_register_driver() during their
 initialization with a pointer to a structure describing the driver
 (struct pci_driver):
 	field name	Description
 	----------	------------------------------------------------------
 	id_table	Pointer to table of device ID's the driver is
 			interested in.  Most drivers should export this
 			table using MODULE_DEVICE_TABLE(pci,...).
 	probe		This probing function gets called (during execution
 			of pci_register_driver() for already existing
 			devices or later if a new device gets inserted) for
 			all PCI devices which match the ID table and are not
 			"owned" by the other drivers yet. This function gets
 			passed a "struct pci_dev *" for each device whose
 			entry in the ID table matches the device. The probe
 			function returns zero when the driver chooses to
 			take "ownership" of the device or an error code
 			(negative number) otherwise.
 			The probe function always gets called from process
 			context, so it can sleep.
 	remove		The remove() function gets called whenever a device
 			being handled by this driver is removed (either during
 			deregistration of the driver or when it's manually
 			pulled out of a hot-pluggable slot).
 			The remove function always gets called from process
 			context, so it can sleep.
 	suspend		Put device into low power state.
 	suspend_late	Put device into low power state.
 	resume_early	Wake device from low power state.
 	resume		Wake device from low power state.
 		(Please see Documentation/power/pci.txt for descriptions
 		of PCI Power Management and the related functions.)
 	shutdown	Hook into reboot_notifier_list (kernel/sys.c).
 			Intended to stop any idling DMA operations.
 			Useful for enabling wake-on-lan (NIC) or changing
 			the power state of a device before reboot.
 			e.g. drivers/net/e100.c.
 	err_handler	See Documentation/PCI/pci-error-recovery.txt
 The ID table is an array of struct pci_device_id entries ending with an
 all-zero entry.  Definitions with static const are generally preferred.
 Each entry consists of:
 	vendor,device	Vendor and device ID to match (or PCI_ANY_ID)
 	subvendor,	Subsystem vendor and device ID to match (or PCI_ANY_ID)
 	subdevice,
 	class		Device class, subclass, and "interface" to match.
 			See Appendix D of the PCI Local Bus Spec or
 			include/linux/pci_ids.h for a full list of classes.
 			Most drivers do not need to specify class/class_mask
 			as vendor/device is normally sufficient.
 	class_mask	limit which sub-fields of the class field are compared.
 			See drivers/scsi/sym53c8xx_2/ for example of usage.
 	driver_data	Data private to the driver.
 			Most drivers don't need to use driver_data field.
 			Best practice is to use driver_data as an index
 			into a static list of equivalent device types,
 			instead of using it as a pointer.
 Most drivers only need PCI_DEVICE() or PCI_DEVICE_CLASS() to set up
 a pci_device_id table.
 New PCI IDs may be added to a device driver pci_ids table at runtime
 as shown below:
 echo "vendor device subvendor subdevice class class_mask driver_data" > \
 /sys/bus/pci/drivers/{driver}/new_id
 All fields are passed in as hexadecimal values (no leading 0x).
 The vendor and device fields are mandatory, the others are optional. Users
 need pass only as many optional fields as necessary:
 	o subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF)
 	o class and classmask fields default to 0
 	o driver_data defaults to 0UL.
 Note that driver_data must match the value used by any of the pci_device_id
 entries defined in the driver. This makes the driver_data field mandatory
 if all the pci_device_id entries have a non-zero driver_data value.
 Once added, the driver probe routine will be invoked for any unclaimed
 PCI devices listed in its (newly updated) pci_ids list.
 When the driver exits, it just calls pci_unregister_driver() and the PCI layer
 automatically calls the remove hook for all devices handled by the driver.
 1.1 "Attributes" for driver functions/data
 Please mark the initialization and cleanup functions where appropriate
 (the corresponding macros are defined in <linux/init.h>):
 	__init		Initialization code. Thrown away after the driver
 			initializes.
 	__exit		Exit code. Ignored for non-modular drivers.
 Tips on when/where to use the above attributes:
 	o The module_init()/module_exit() functions (and all
 	  initialization functions called _only_ from these)
 	  should be marked __init/__exit.
 	o Do not mark the struct pci_driver.
 	o Do NOT mark a function if you are not sure which mark to use.
 	  Better to not mark the function than mark the function wrong.
 2. How to find PCI devices manually
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 PCI drivers should have a really good reason for not using the
 pci_register_driver() interface to search for PCI devices.
 The main reason PCI devices are controlled by multiple drivers
 is because one PCI device implements several different HW services.
 E.g. combined serial/parallel port/floppy controller.
 A manual search may be performed using the following constructs:
 Searching by vendor and device ID:
 	struct pci_dev *dev = NULL;
 	while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev))
 		configure_device(dev);
 Searching by class ID (iterate in a similar way):
 	pci_get_class(CLASS_ID, dev)
 Searching by both vendor/device and subsystem vendor/device ID:
 	pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev).
 You can use the constant PCI_ANY_ID as a wildcard replacement for
 VENDOR_ID or DEVICE_ID.  This allows searching for any device from a
 specific vendor, for example.
 These functions are hotplug-safe. They increment the reference count on
 the pci_dev that they return. You must eventually (possibly at module unload)
 decrement the reference count on these devices by calling pci_dev_put().
 3. Device Initialization Steps
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 As noted in the introduction, most PCI drivers need the following steps
 for device initialization:
 	Enable the device
 	Request MMIO/IOP resources
 	Set the DMA mask size (for both coherent and streaming DMA)
 	Allocate and initialize shared control data (pci_allocate_coherent())
 	Access device configuration space (if needed)
 	Register IRQ handler (request_irq())
 	Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
 	Enable DMA/processing engines.
 The driver can access PCI config space registers at any time.
 (Well, almost. When running BIST, config space can go away...but
 that will just result in a PCI Bus Master Abort and config reads
 will return garbage).
 3.1 Enable the PCI device
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 Before touching any device registers, the driver needs to enable
 the PCI device by calling pci_enable_device(). This will:
 	o wake up the device if it was in suspended state,
 	o allocate I/O and memory regions of the device (if BIOS did not),
 	o allocate an IRQ (if BIOS did not).
 NOTE: pci_enable_device() can fail! Check the return value.
 [ OS BUG: we don't check resource allocations before enabling those
  resources. The sequence would make more sense if we called
  pci_request_resources() before calling pci_enable_device().
  Currently, the device drivers can't detect the bug when when two
  devices have been allocated the same range. This is not a common
  problem and unlikely to get fixed soon.
  This has been discussed before but not changed as of 2.6.19:
 	http://lkml.org/lkml/2006/3/2/194
 ]
 pci_set_master() will enable DMA by setting the bus master bit
 in the PCI_COMMAND register. It also fixes the latency timer value if
 it's set to something bogus by the BIOS.  pci_clear_master() will
 disable DMA by clearing the bus master bit.
 If the PCI device can use the PCI Memory-Write-Invalidate transaction,
 call pci_set_mwi().  This enables the PCI_COMMAND bit for Mem-Wr-Inval
 and also ensures that the cache line size register is set correctly.
 Check the return value of pci_set_mwi() as not all architectures
 or chip-sets may support Memory-Write-Invalidate.  Alternatively,
 if Mem-Wr-Inval would be nice to have but is not required, call
 pci_try_set_mwi() to have the system do its best effort at enabling
 Mem-Wr-Inval.
 3.2 Request MMIO/IOP resources
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Memory (MMIO), and I/O port addresses should NOT be read directly
 from the PCI device config space. Use the values in the pci_dev structure
 as the PCI "bus address" might have been remapped to a "host physical"
 address by the arch/chip-set specific kernel support.
 See Documentation/io-mapping.txt for how to access device registers
 or device memory.
 The device driver needs to call pci_request_region() to verify
 no other device is already using the same address resource.
 Conversely, drivers should call pci_release_region() AFTER
 calling pci_disable_device().
 The idea is to prevent two devices colliding on the same address range.
 [ See OS BUG comment above. Currently (2.6.19), The driver can only
  determine MMIO and IO Port resource availability _after_ calling
  pci_enable_device(). ]
 Generic flavors of pci_request_region() are request_mem_region()
 (for MMIO ranges) and request_region() (for IO Port ranges).
 Use these for address resources that are not described by "normal" PCI
 BARs.
 Also see pci_request_selected_regions() below.
 3.3 Set the DMA mask size
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 [ If anything below doesn't make sense, please refer to
  Documentation/DMA-API.txt. This section is just a reminder that
  drivers need to indicate DMA capabilities of the device and is not
  an authoritative source for DMA interfaces. ]
 While all drivers should explicitly indicate the DMA capability
 (e.g. 32 or 64 bit) of the PCI bus master, devices with more than
 32-bit bus master capability for streaming data need the driver
 to "register" this capability by calling pci_set_dma_mask() with
 appropriate parameters.  In general this allows more efficient DMA
 on systems where System RAM exists above 4G _physical_ address.
 Drivers for all PCI-X and PCIe compliant devices must call
 pci_set_dma_mask() as they are 64-bit DMA devices.
 Similarly, drivers must also "register" this capability if the device
 can directly address "consistent memory" in System RAM above 4G physical
 address by calling pci_set_consistent_dma_mask().
 Again, this includes drivers for all PCI-X and PCIe compliant devices.
 Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are
 64-bit DMA capable for payload ("streaming") data but not control
 ("consistent") data.
 3.4 Setup shared control data
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared)
 memory.  See Documentation/DMA-API.txt for a full description of
 the DMA APIs. This section is just a reminder that it needs to be done
 before enabling DMA on the device.
 3.5 Initialize device registers
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Some drivers will need specific "capability" fields programmed
 or other "vendor specific" register initialized or reset.
 E.g. clearing pending interrupts.
 3.6 Register IRQ handler
 ~~~~~~~~~~~~~~~~~~~~~~~~
 While calling request_irq() is the last step described here,
 this is often just another intermediate step to initialize a device.
 This step can often be deferred until the device is opened for use.
 All interrupt handlers for IRQ lines should be registered with IRQF_SHARED
 and use the devid to map IRQs to devices (remember that all PCI IRQ lines
 can be shared).
 request_irq() will associate an interrupt handler and device handle
 with an interrupt number. Historically interrupt numbers represent
 IRQ lines which run from the PCI device to the Interrupt controller.
 With MSI and MSI-X (more below) the interrupt number is a CPU "vector".
 request_irq() also enables the interrupt. Make sure the device is
 quiesced and does not have any interrupts pending before registering
 the interrupt handler.
 MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts"
 which deliver interrupts to the CPU via a DMA write to a Local APIC.
 The fundamental difference between MSI and MSI-X is how multiple
 "vectors" get allocated. MSI requires contiguous blocks of vectors
 while MSI-X can allocate several individual ones.
 MSI capability can be enabled by calling pci_alloc_irq_vectors() with the
 PCI_IRQ_MSI and/or PCI_IRQ_MSIX flags before calling request_irq(). This
 causes the PCI support to program CPU vector data into the PCI device
 capability registers. Many architectures, chip-sets, or BIOSes do NOT
 support MSI or MSI-X and a call to pci_alloc_irq_vectors with just
 the PCI_IRQ_MSI and PCI_IRQ_MSIX flags will fail, so try to always
 specify PCI_IRQ_LEGACY as well.
 Drivers that have different interrupt handlers for MSI/MSI-X and
 legacy INTx should chose the right one based on the msi_enabled
 and msix_enabled flags in the pci_dev structure after calling
 pci_alloc_irq_vectors.
 There are (at least) two really good reasons for using MSI:
 1) MSI is an exclusive interrupt vector by definition.
   This means the interrupt handler doesn't have to verify
   its device caused the interrupt.
 2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed
   to be visible to the host CPU(s) when the MSI is delivered. This
   is important for both data coherency and avoiding stale control data.
   This guarantee allows the driver to omit MMIO reads to flush
   the DMA stream.
 See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples
 of MSI/MSI-X usage.
 4. PCI device shutdown
 ~~~~~~~~~~~~~~~~~~~~~~~
 When a PCI device driver is being unloaded, most of the following
 steps need to be performed:
 	Disable the device from generating IRQs
 	Release the IRQ (free_irq())
 	Stop all DMA activity
 	Release DMA buffers (both streaming and consistent)
 	Unregister from other subsystems (e.g. scsi or netdev)
 	Disable device from responding to MMIO/IO Port addresses
 	Release MMIO/IO Port resource(s)
 4.1 Stop IRQs on the device
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 How to do this is chip/device specific. If it's not done, it opens
 the possibility of a "screaming interrupt" if (and only if)
 the IRQ is shared with another device.
 When the shared IRQ handler is "unhooked", the remaining devices
 using the same IRQ line will still need the IRQ enabled. Thus if the
 "unhooked" device asserts IRQ line, the system will respond assuming
 it was one of the remaining devices asserted the IRQ line. Since none
 of the other devices will handle the IRQ, the system will "hang" until
 it decides the IRQ isn't going to get handled and masks the IRQ (100,000
 iterations later). Once the shared IRQ is masked, the remaining devices
 will stop functioning properly. Not a nice situation.
 This is another reason to use MSI or MSI-X if it's available.
 MSI and MSI-X are defined to be exclusive interrupts and thus
 are not susceptible to the "screaming interrupt" problem.
 4.2 Release the IRQ
 ~~~~~~~~~~~~~~~~~~~
 Once the device is quiesced (no more IRQs), one can call free_irq().
 This function will return control once any pending IRQs are handled,
 "unhook" the drivers IRQ handler from that IRQ, and finally release
 the IRQ if no one else is using it.
 4.3 Stop all DMA activity
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 It's extremely important to stop all DMA operations BEFORE attempting
 to deallocate DMA control data. Failure to do so can result in memory
 corruption, hangs, and on some chip-sets a hard crash.
 Stopping DMA after stopping the IRQs can avoid races where the
 IRQ handler might restart DMA engines.
 While this step sounds obvious and trivial, several "mature" drivers
 didn't get this step right in the past.
 4.4 Release DMA buffers
 ~~~~~~~~~~~~~~~~~~~~~~~
 Once DMA is stopped, clean up streaming DMA first.
 I.e. unmap data buffers and return buffers to "upstream"
 owners if there is one.
 Then clean up "consistent" buffers which contain the control data.
 See Documentation/DMA-API.txt for details on unmapping interfaces.
 4.5 Unregister from other subsystems
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Most low level PCI device drivers support some other subsystem
 like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your
 driver isn't losing resources from that other subsystem.
 If this happens, typically the symptom is an Oops (panic) when
 the subsystem attempts to call into a driver that has been unloaded.
 4.6 Disable Device from responding to MMIO/IO Port addresses
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 io_unmap() MMIO or IO Port resources and then call pci_disable_device().
 This is the symmetric opposite of pci_enable_device().
 Do not access device registers after calling pci_disable_device().
 4.7 Release MMIO/IO Port Resource(s)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Call pci_release_region() to mark the MMIO or IO Port range as available.
 Failure to do so usually results in the inability to reload the driver.
 5. How to access PCI config space
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 You can use pci_(read|write)_config_(byte|word|dword) to access the config
 space of a device represented by struct pci_dev *. All these functions return 0
 when successful or an error code (PCIBIOS_...) which can be translated to a text
 string by pcibios_strerror. Most drivers expect that accesses to valid PCI
 devices don't fail.
 If you don't have a struct pci_dev available, you can call
 pci_bus_(read|write)_config_(byte|word|dword) to access a given device
 and function on that bus.
 If you access fields in the standard portion of the config header, please
 use symbolic names of locations and bits declared in <linux/pci.h>.
 If you need to access Extended PCI Capability registers, just call
 pci_find_capability() for the particular capability and it will find the
 corresponding register block for you.
 6. Other interesting functions
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 pci_get_domain_bus_and_slot()	Find pci_dev corresponding to given domain,
 				bus and slot and number. If the device is
 				found, its reference count is increased.
 pci_set_power_state()		Set PCI Power Management state (0=D0 ... 3=D3)
 pci_find_capability()		Find specified capability in device's capability
 				list.
 pci_resource_start()		Returns bus start address for a given PCI region
 pci_resource_end()		Returns bus end address for a given PCI region
 pci_resource_len()		Returns the byte length of a PCI region
 pci_set_drvdata()		Set private driver data pointer for a pci_dev
 pci_get_drvdata()		Return private driver data pointer for a pci_dev
 pci_set_mwi()			Enable Memory-Write-Invalidate transactions.
 pci_clear_mwi()			Disable Memory-Write-Invalidate transactions.
 7. Miscellaneous hints
 ~~~~~~~~~~~~~~~~~~~~~~
 When displaying PCI device names to the user (for example when a driver wants
 to tell the user what card has it found), please use pci_name(pci_dev).
 Always refer to the PCI devices by a pointer to the pci_dev structure.
 All PCI layer functions use this identification and it's the only
 reasonable one. Don't use bus/slot/function numbers except for very
 special purposes -- on systems with multiple primary buses their semantics
 can be pretty complex.
 Don't try to turn on Fast Back to Back writes in your driver.  All devices
 on the bus need to be capable of doing it, so this is something which needs
 to be handled by platform and generic code, not individual drivers.
 8. Vendor and device identifications
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Do not add new device or vendor IDs to include/linux/pci_ids.h unless they
 are shared across multiple drivers.  You can add private definitions in
 your driver if they're helpful, or just use plain hex constants.
 The device IDs are arbitrary hex numbers (vendor controlled) and normally used
 only in a single location, the pci_device_id table.
 Please DO submit new vendor/device IDs to http://pci-ids.ucw.cz/.
 There are mirrors of the pci.ids file at http://pciids.sourceforge.net/
 and https://github.com/pciutils/pciids.
 9. Obsolete functions
 ~~~~~~~~~~~~~~~~~~~~~
 There are several functions which you might come across when trying to
 port an old driver to the new PCI interface.  They are no longer present
 in the kernel as they aren't compatible with hotplug or PCI domains or
 having sane locking.
 pci_find_device()	Superseded by pci_get_device()
 pci_find_subsys()	Superseded by pci_get_subsys()
 pci_find_slot()		Superseded by pci_get_domain_bus_and_slot()
 pci_get_slot()		Superseded by pci_get_domain_bus_and_slot()
 The alternative is the traditional PCI device driver that walks PCI
 device lists. This is still possible but discouraged.
 10. MMIO Space and "Write Posting"
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Converting a driver from using I/O Port space to using MMIO space
 often requires some additional changes. Specifically, "write posting"
 needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2)
 already do this. I/O Port space guarantees write transactions reach the PCI
 device before the CPU can continue. Writes to MMIO space allow the CPU
 to continue before the transaction reaches the PCI device. HW weenies
 call this "Write Posting" because the write completion is "posted" to
 the CPU before the transaction has reached its destination.
 Thus, timing sensitive code should add readl() where the CPU is
 expected to wait before doing other work.  The classic "bit banging"
 sequence works fine for I/O Port space:
       for (i = 8; --i; val >>= 1) {
               outb(val & 1, ioport_reg);      /* write bit */
               udelay(10);
       }
 The same sequence for MMIO space should be:
       for (i = 8; --i; val >>= 1) {
               writeb(val & 1, mmio_reg);      /* write bit */
               readb(safe_mmio_reg);           /* flush posted write */
               udelay(10);
       }
 It is important that "safe_mmio_reg" not have any side effects that
 interferes with the correct operation of the device.
 Another case to watch out for is when resetting a PCI device. Use PCI
 Configuration space reads to flush the writel(). This will gracefully
 handle the PCI master abort on all platforms if the PCI device is
 expected to not respond to a readl().  Most x86 platforms will allow
 MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage
 (e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail").
--- a/Documentation/PCI/pcieaer-howto.rst
+++ b/Documentation/PCI/pcieaer-howto.rst
@ -0,0 +1,311 @@
 .. SPDX-License-Identifier: GPL-2.0
 .. include:: <isonum.txt>
 ===========================================================
 The PCI Express Advanced Error Reporting Driver Guide HOWTO
 ===========================================================
 :Authors: - T. Long Nguyen <tom.l.nguyen@intel.com>
          - Yanmin Zhang <yanmin.zhang@intel.com>
 :Copyright: |copy| 2006 Intel Corporation
 Overview
 ===========
 About this guide
 ----------------
 This guide describes the basics of the PCI Express Advanced Error
 Reporting (AER) driver and provides information on how to use it, as
 well as how to enable the drivers of endpoint devices to conform with
 PCI Express AER driver.
 What is the PCI Express AER Driver?
 -----------------------------------
 PCI Express error signaling can occur on the PCI Express link itself
 or on behalf of transactions initiated on the link. PCI Express
 defines two error reporting paradigms: the baseline capability and
 the Advanced Error Reporting capability. The baseline capability is
 required of all PCI Express components providing a minimum defined
 set of error reporting requirements. Advanced Error Reporting
 capability is implemented with a PCI Express advanced error reporting
 extended capability structure providing more robust error reporting.
 The PCI Express AER driver provides the infrastructure to support PCI
 Express Advanced Error Reporting capability. The PCI Express AER
 driver provides three basic functions:
  - Gathers the comprehensive error information if errors occurred.
  - Reports error to the users.
  - Performs error recovery actions.
 AER driver only attaches root ports which support PCI-Express AER
 capability.
 User Guide
 ==========
 Include the PCI Express AER Root Driver into the Linux Kernel
 -------------------------------------------------------------
 The PCI Express AER Root driver is a Root Port service driver attached
 to the PCI Express Port Bus driver. If a user wants to use it, the driver
 has to be compiled. Option CONFIG_PCIEAER supports this capability. It
 depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and
 CONFIG_PCIEAER = y.
 Load PCI Express AER Root Driver
 --------------------------------
 Some systems have AER support in firmware. Enabling Linux AER support at
 the same time the firmware handles AER may result in unpredictable
 behavior. Therefore, Linux does not handle AER events unless the firmware
 grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0
 Specification for details regarding _OSC usage.
 AER error output
 ----------------
 When a PCIe AER error is captured, an error message will be output to
 console. If it's a correctable error, it is output as a warning.
 Otherwise, it is printed as an error. So users could choose different
 log level to filter out correctable error messages.
 Below shows an example::
  0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
  0000:50:00.0:   device [8086:0329] error status/mask=00100000/00000000
  0000:50:00.0:    [20] Unsupported Request    (First)
  0000:50:00.0:   TLP Header: 04000001 00200a03 05010000 00050100
 In the example, 'Requester ID' means the ID of the device who sends
 the error message to root port. Pls. refer to pci express specs for
 other fields.
 AER Statistics / Counters
 -------------------------
 When PCIe AER errors are captured, the counters / statistics are also exposed
 in the form of sysfs attributes which are documented at
 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
 Developer Guide
 ===============
 To enable AER aware support requires a software driver to configure
 the AER capability structure within its device and to provide callbacks.
 To support AER better, developers need understand how AER does work
 firstly.
 PCI Express errors are classified into two types: correctable errors
 and uncorrectable errors. This classification is based on the impacts
 of those errors, which may result in degraded performance or function
 failure.
 Correctable errors pose no impacts on the functionality of the
 interface. The PCI Express protocol can recover without any software
 intervention or any loss of data. These errors are detected and
 corrected by hardware. Unlike correctable errors, uncorrectable
 errors impact functionality of the interface. Uncorrectable errors
 can cause a particular transaction or a particular PCI Express link
 to be unreliable. Depending on those error conditions, uncorrectable
 errors are further classified into non-fatal errors and fatal errors.
 Non-fatal errors cause the particular transaction to be unreliable,
 but the PCI Express link itself is fully functional. Fatal errors, on
 the other hand, cause the link to be unreliable.
 When AER is enabled, a PCI Express device will automatically send an
 error message to the PCIe root port above it when the device captures
 an error. The Root Port, upon receiving an error reporting message,
 internally processes and logs the error message in its PCI Express
 capability structure. Error information being logged includes storing
 the error reporting agent's requestor ID into the Error Source
 Identification Registers and setting the error bits of the Root Error
 Status Register accordingly. If AER error reporting is enabled in Root
 Error Command Register, the Root Port generates an interrupt if an
 error is detected.
 Note that the errors as described above are related to the PCI Express
 hierarchy and links. These errors do not include any device specific
 errors because device specific errors will still get sent directly to
 the device driver.
 Configure the AER capability structure
 --------------------------------------
 AER aware drivers of PCI Express component need change the device
 control registers to enable AER. They also could change AER registers,
 including mask and severity registers. Helper function
 pci_enable_pcie_error_reporting could be used to enable AER. See
 section 3.3.
 Provide callbacks
 -----------------
 callback reset_link to reset pci express link
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 This callback is used to reset the pci express physical link when a
 fatal error happens. The root port aer service driver provides a
 default reset_link function, but different upstream ports might
 have different specifications to reset pci express link, so all
 upstream ports should provide their own reset_link functions.
 In struct pcie_port_service_driver, a new pointer, reset_link, is
 added.
 ::
 	pci_ers_result_t (*reset_link) (struct pci_dev *dev);
 Section 3.2.2.2 provides more detailed info on when to call
 reset_link.
 PCI error-recovery callbacks
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The PCI Express AER Root driver uses error callbacks to coordinate
 with downstream device drivers associated with a hierarchy in question
 when performing error recovery actions.
 Data struct pci_driver has a pointer, err_handler, to point to
 pci_error_handlers who consists of a couple of callback function
 pointers. AER driver follows the rules defined in
 pci-error-recovery.txt except pci express specific parts (e.g.
 reset_link). Pls. refer to pci-error-recovery.txt for detailed
 definitions of the callbacks.
 Below sections specify when to call the error callback functions.
 Correctable errors
 ~~~~~~~~~~~~~~~~~~
 Correctable errors pose no impacts on the functionality of
 the interface. The PCI Express protocol can recover without any
 software intervention or any loss of data. These errors do not
 require any recovery actions. The AER driver clears the device's
 correctable error status register accordingly and logs these errors.
 Non-correctable (non-fatal and fatal) errors
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 If an error message indicates a non-fatal error, performing link reset
 at upstream is not required. The AER driver calls error_detected(dev,
 pci_channel_io_normal) to all drivers associated within a hierarchy in
 question. for example::
  EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort
 If Upstream port A captures an AER error, the hierarchy consists of
 Downstream port B and EndPoint.
 A driver may return PCI_ERS_RESULT_CAN_RECOVER,
 PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
 whether it can recover or the AER driver calls mmio_enabled as next.
 If an error message indicates a fatal error, kernel will broadcast
 error_detected(dev, pci_channel_io_frozen) to all drivers within
 a hierarchy in question. Then, performing link reset at upstream is
 necessary. As different kinds of devices might use different approaches
 to reset link, AER port service driver is required to provide the
 function to reset link. Firstly, kernel looks for if the upstream
 component has an aer driver. If it has, kernel uses the reset_link
 callback of the aer driver. If the upstream component has no aer driver
 and the port is downstream port, we will perform a hot reset as the
 default by setting the Secondary Bus Reset bit of the Bridge Control
 register associated with the downstream port. As for upstream ports,
 they should provide their own aer service drivers with reset_link
 function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
 reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
 to mmio_enabled.
 helper functions
 ----------------
 ::
  int pci_enable_pcie_error_reporting(struct pci_dev *dev);
 pci_enable_pcie_error_reporting enables the device to send error
 messages to root port when an error is detected. Note that devices
 don't enable the error reporting by default, so device drivers need
 call this function to enable it.
 ::
  int pci_disable_pcie_error_reporting(struct pci_dev *dev);
 pci_disable_pcie_error_reporting disables the device to send error
 messages to root port when an error is detected.
 ::
  int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);`
 pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
 error status register.
 Frequent Asked Questions
 ------------------------
 Q:
  What happens if a PCI Express device driver does not provide an
  error recovery handler (pci_driver->err_handler is equal to NULL)?
 A:
  The devices attached with the driver won't be recovered. If the
  error is fatal, kernel will print out warning messages. Please refer
  to section 3 for more information.
 Q:
  What happens if an upstream port service driver does not provide
  callback reset_link?
 A:
  Fatal error recovery will fail if the errors are reported by the
  upstream ports who are attached by the service driver.
 Q:
  How does this infrastructure deal with driver that is not PCI
  Express aware?
 A:
  This infrastructure calls the error callback functions of the
  driver when an error happens. But if the driver is not aware of
  PCI Express, the device might not report its own errors to root
  port.
 Q:
  What modifications will that driver need to make it compatible
  with the PCI Express AER Root driver?
 A:
  It could call the helper functions to enable AER in devices and
  cleanup uncorrectable status register. Pls. refer to section 3.3.
 Software error injection
 ========================
 Debugging PCIe AER error recovery code is quite difficult because it
 is hard to trigger real hardware errors. Software based error
 injection can be used to fake various kinds of PCIe errors.
 First you should enable PCIe AER software error injection in kernel
 configuration, that is, following item should be in your .config.
 CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
 After reboot with new kernel or insert the module, a device file named
 /dev/aer_inject should be created.
 Then, you need a user space tool named aer-inject, which can be gotten
 from:
    https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
 More information about aer-inject can be found in the document comes
 with its source code.
--- a/Documentation/PCI/pcieaer-howto.txt
+++ b/Documentation/PCI/pcieaer-howto.txt
@ -1,267 +0,0 @@
   The PCI Express Advanced Error Reporting Driver Guide HOWTO
 		T. Long Nguyen	<tom.l.nguyen@intel.com>
 		Yanmin Zhang	<yanmin.zhang@intel.com>
 				07/29/2006
 1. Overview
 1.1 About this guide
 This guide describes the basics of the PCI Express Advanced Error
 Reporting (AER) driver and provides information on how to use it, as
 well as how to enable the drivers of endpoint devices to conform with
 PCI Express AER driver.
 1.2 Copyright (C) Intel Corporation 2006.
 1.3 What is the PCI Express AER Driver?
 PCI Express error signaling can occur on the PCI Express link itself
 or on behalf of transactions initiated on the link. PCI Express
 defines two error reporting paradigms: the baseline capability and
 the Advanced Error Reporting capability. The baseline capability is
 required of all PCI Express components providing a minimum defined
 set of error reporting requirements. Advanced Error Reporting
 capability is implemented with a PCI Express advanced error reporting
 extended capability structure providing more robust error reporting.
 The PCI Express AER driver provides the infrastructure to support PCI
 Express Advanced Error Reporting capability. The PCI Express AER
 driver provides three basic functions:
 -	Gathers the comprehensive error information if errors occurred.
 -	Reports error to the users.
 -	Performs error recovery actions.
 AER driver only attaches root ports which support PCI-Express AER
 capability.
 2. User Guide
 2.1 Include the PCI Express AER Root Driver into the Linux Kernel
 The PCI Express AER Root driver is a Root Port service driver attached
 to the PCI Express Port Bus driver. If a user wants to use it, the driver
 has to be compiled. Option CONFIG_PCIEAER supports this capability. It
 depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and
 CONFIG_PCIEAER = y.
 2.2 Load PCI Express AER Root Driver
 Some systems have AER support in firmware. Enabling Linux AER support at
 the same time the firmware handles AER may result in unpredictable
 behavior. Therefore, Linux does not handle AER events unless the firmware
 grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0
 Specification for details regarding _OSC usage.
 2.3 AER error output
 When a PCIe AER error is captured, an error message will be output to
 console. If it's a correctable error, it is output as a warning.
 Otherwise, it is printed as an error. So users could choose different
 log level to filter out correctable error messages.
 Below shows an example:
 0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
 0000:50:00.0:   device [8086:0329] error status/mask=00100000/00000000
 0000:50:00.0:    [20] Unsupported Request    (First)
 0000:50:00.0:   TLP Header: 04000001 00200a03 05010000 00050100
 In the example, 'Requester ID' means the ID of the device who sends
 the error message to root port. Pls. refer to pci express specs for
 other fields.
 2.4 AER Statistics / Counters
 When PCIe AER errors are captured, the counters / statistics are also exposed
 in the form of sysfs attributes which are documented at
 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
 3. Developer Guide
 To enable AER aware support requires a software driver to configure
 the AER capability structure within its device and to provide callbacks.
 To support AER better, developers need understand how AER does work
 firstly.
 PCI Express errors are classified into two types: correctable errors
 and uncorrectable errors. This classification is based on the impacts
 of those errors, which may result in degraded performance or function
 failure.
 Correctable errors pose no impacts on the functionality of the
 interface. The PCI Express protocol can recover without any software
 intervention or any loss of data. These errors are detected and
 corrected by hardware. Unlike correctable errors, uncorrectable
 errors impact functionality of the interface. Uncorrectable errors
 can cause a particular transaction or a particular PCI Express link
 to be unreliable. Depending on those error conditions, uncorrectable
 errors are further classified into non-fatal errors and fatal errors.
 Non-fatal errors cause the particular transaction to be unreliable,
 but the PCI Express link itself is fully functional. Fatal errors, on
 the other hand, cause the link to be unreliable.
 When AER is enabled, a PCI Express device will automatically send an
 error message to the PCIe root port above it when the device captures
 an error. The Root Port, upon receiving an error reporting message,
 internally processes and logs the error message in its PCI Express
 capability structure. Error information being logged includes storing
 the error reporting agent's requestor ID into the Error Source
 Identification Registers and setting the error bits of the Root Error
 Status Register accordingly. If AER error reporting is enabled in Root
 Error Command Register, the Root Port generates an interrupt if an
 error is detected.
 Note that the errors as described above are related to the PCI Express
 hierarchy and links. These errors do not include any device specific
 errors because device specific errors will still get sent directly to
 the device driver.
 3.1 Configure the AER capability structure
 AER aware drivers of PCI Express component need change the device
 control registers to enable AER. They also could change AER registers,
 including mask and severity registers. Helper function
 pci_enable_pcie_error_reporting could be used to enable AER. See
 section 3.3.
 3.2. Provide callbacks
 3.2.1 callback reset_link to reset pci express link
 This callback is used to reset the pci express physical link when a
 fatal error happens. The root port aer service driver provides a
 default reset_link function, but different upstream ports might
 have different specifications to reset pci express link, so all
 upstream ports should provide their own reset_link functions.
 In struct pcie_port_service_driver, a new pointer, reset_link, is
 added.
 pci_ers_result_t (*reset_link) (struct pci_dev *dev);
 Section 3.2.2.2 provides more detailed info on when to call
 reset_link.
 3.2.2 PCI error-recovery callbacks
 The PCI Express AER Root driver uses error callbacks to coordinate
 with downstream device drivers associated with a hierarchy in question
 when performing error recovery actions.
 Data struct pci_driver has a pointer, err_handler, to point to
 pci_error_handlers who consists of a couple of callback function
 pointers. AER driver follows the rules defined in
 pci-error-recovery.txt except pci express specific parts (e.g.
 reset_link). Pls. refer to pci-error-recovery.txt for detailed
 definitions of the callbacks.
 Below sections specify when to call the error callback functions.
 3.2.2.1 Correctable errors
 Correctable errors pose no impacts on the functionality of
 the interface. The PCI Express protocol can recover without any
 software intervention or any loss of data. These errors do not
 require any recovery actions. The AER driver clears the device's
 correctable error status register accordingly and logs these errors.
 3.2.2.2 Non-correctable (non-fatal and fatal) errors
 If an error message indicates a non-fatal error, performing link reset
 at upstream is not required. The AER driver calls error_detected(dev,
 pci_channel_io_normal) to all drivers associated within a hierarchy in
 question. for example,
 EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort.
 If Upstream port A captures an AER error, the hierarchy consists of
 Downstream port B and EndPoint.
 A driver may return PCI_ERS_RESULT_CAN_RECOVER,
 PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
 whether it can recover or the AER driver calls mmio_enabled as next.
 If an error message indicates a fatal error, kernel will broadcast
 error_detected(dev, pci_channel_io_frozen) to all drivers within
 a hierarchy in question. Then, performing link reset at upstream is
 necessary. As different kinds of devices might use different approaches
 to reset link, AER port service driver is required to provide the
 function to reset link. Firstly, kernel looks for if the upstream
 component has an aer driver. If it has, kernel uses the reset_link
 callback of the aer driver. If the upstream component has no aer driver
 and the port is downstream port, we will perform a hot reset as the
 default by setting the Secondary Bus Reset bit of the Bridge Control
 register associated with the downstream port. As for upstream ports,
 they should provide their own aer service drivers with reset_link
 function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
 reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
 to mmio_enabled.
 3.3 helper functions
 3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev);
 pci_enable_pcie_error_reporting enables the device to send error
 messages to root port when an error is detected. Note that devices
 don't enable the error reporting by default, so device drivers need
 call this function to enable it.
 3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev);
 pci_disable_pcie_error_reporting disables the device to send error
 messages to root port when an error is detected.
 3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
 pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
 error status register.
 3.4 Frequent Asked Questions
 Q: What happens if a PCI Express device driver does not provide an
 error recovery handler (pci_driver->err_handler is equal to NULL)?
 A: The devices attached with the driver won't be recovered. If the
 error is fatal, kernel will print out warning messages. Please refer
 to section 3 for more information.
 Q: What happens if an upstream port service driver does not provide
 callback reset_link?
 A: Fatal error recovery will fail if the errors are reported by the
 upstream ports who are attached by the service driver.
 Q: How does this infrastructure deal with driver that is not PCI
 Express aware?
 A: This infrastructure calls the error callback functions of the
 driver when an error happens. But if the driver is not aware of
 PCI Express, the device might not report its own errors to root
 port.
 Q: What modifications will that driver need to make it compatible
 with the PCI Express AER Root driver?
 A: It could call the helper functions to enable AER in devices and
 cleanup uncorrectable status register. Pls. refer to section 3.3.
 4. Software error injection
 Debugging PCIe AER error recovery code is quite difficult because it
 is hard to trigger real hardware errors. Software based error
 injection can be used to fake various kinds of PCIe errors.
 First you should enable PCIe AER software error injection in kernel
 configuration, that is, following item should be in your .config.
 CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
 After reboot with new kernel or insert the module, a device file named
 /dev/aer_inject should be created.
 Then, you need a user space tool named aer-inject, which can be gotten
 from:
    https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
 More information about aer-inject can be found in the document comes
 with its source code.
--- a/Documentation/PCI/picebus-howto.rst
+++ b/Documentation/PCI/picebus-howto.rst
@ -0,0 +1,220 @@
 .. SPDX-License-Identifier: GPL-2.0
 .. include:: <isonum.txt>
 ===========================================
 The PCI Express Port Bus Driver Guide HOWTO
 ===========================================
 :Author: Tom L Nguyen tom.l.nguyen@intel.com 11/03/2004
 :Copyright: |copy| 2004 Intel Corporation
 About this guide
 ================
 This guide describes the basics of the PCI Express Port Bus driver
 and provides information on how to enable the service drivers to
 register/unregister with the PCI Express Port Bus Driver.
 What is the PCI Express Port Bus Driver
 =======================================
 A PCI Express Port is a logical PCI-PCI Bridge structure. There
 are two types of PCI Express Port: the Root Port and the Switch
 Port. The Root Port originates a PCI Express link from a PCI Express
 Root Complex and the Switch Port connects PCI Express links to
 internal logical PCI buses. The Switch Port, which has its secondary
 bus representing the switch's internal routing logic, is called the
 switch's Upstream Port. The switch's Downstream Port is bridging from
 switch's internal routing bus to a bus representing the downstream
 PCI Express link from the PCI Express Switch.
 A PCI Express Port can provide up to four distinct functions,
 referred to in this document as services, depending on its port type.
 PCI Express Port's services include native hotplug support (HP),
 power management event support (PME), advanced error reporting
 support (AER), and virtual channel support (VC). These services may
 be handled by a single complex driver or be individually distributed
 and handled by corresponding service drivers.
 Why use the PCI Express Port Bus Driver?
 ========================================
 In existing Linux kernels, the Linux Device Driver Model allows a
 physical device to be handled by only a single driver. The PCI
 Express Port is a PCI-PCI Bridge device with multiple distinct
 services. To maintain a clean and simple solution each service
 may have its own software service driver. In this case several
 service drivers will compete for a single PCI-PCI Bridge device.
 For example, if the PCI Express Root Port native hotplug service
 driver is loaded first, it claims a PCI-PCI Bridge Root Port. The
 kernel therefore does not load other service drivers for that Root
 Port. In other words, it is impossible to have multiple service
 drivers load and run on a PCI-PCI Bridge device simultaneously
 using the current driver model.
 To enable multiple service drivers running simultaneously requires
 having a PCI Express Port Bus driver, which manages all populated
 PCI Express Ports and distributes all provided service requests
 to the corresponding service drivers as required. Some key
 advantages of using the PCI Express Port Bus driver are listed below:
  - Allow multiple service drivers to run simultaneously on
    a PCI-PCI Bridge Port device.
  - Allow service drivers implemented in an independent
    staged approach.
  - Allow one service driver to run on multiple PCI-PCI Bridge
    Port devices.
  - Manage and distribute resources of a PCI-PCI Bridge Port
    device to requested service drivers.
 Configuring the PCI Express Port Bus Driver vs. Service Drivers
 ===============================================================
 Including the PCI Express Port Bus Driver Support into the Kernel
 -----------------------------------------------------------------
 Including the PCI Express Port Bus driver depends on whether the PCI
 Express support is included in the kernel config. The kernel will
 automatically include the PCI Express Port Bus driver as a kernel
 driver when the PCI Express support is enabled in the kernel.
 Enabling Service Driver Support
 -------------------------------
 PCI device drivers are implemented based on Linux Device Driver Model.
 All service drivers are PCI device drivers. As discussed above, it is
 impossible to load any service driver once the kernel has loaded the
 PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver
 Model requires some minimal changes on existing service drivers that
 imposes no impact on the functionality of existing service drivers.
 A service driver is required to use the two APIs shown below to
 register its service with the PCI Express Port Bus driver (see
 section 5.2.1 & 5.2.2). It is important that a service driver
 initializes the pcie_port_service_driver data structure, included in
 header file /include/linux/pcieport_if.h, before calling these APIs.
 Failure to do so will result an identity mismatch, which prevents
 the PCI Express Port Bus driver from loading a service driver.
 pcie_port_service_register
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 ::
  int pcie_port_service_register(struct pcie_port_service_driver *new)
 This API replaces the Linux Driver Model's pci_register_driver API. A
 service driver should always calls pcie_port_service_register at
 module init. Note that after service driver being loaded, calls
 such as pci_enable_device(dev) and pci_set_master(dev) are no longer
 necessary since these calls are executed by the PCI Port Bus driver.
 pcie_port_service_unregister
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 ::
  void pcie_port_service_unregister(struct pcie_port_service_driver *new)
 pcie_port_service_unregister replaces the Linux Driver Model's
 pci_unregister_driver. It's always called by service driver when a
 module exits.
 Sample Code
 ~~~~~~~~~~~
 Below is sample service driver code to initialize the port service
 driver data structure.
 ::
  static struct pcie_port_service_id service_id[] = { {
    .vendor = PCI_ANY_ID,
    .device = PCI_ANY_ID,
    .port_type = PCIE_RC_PORT,
    .service_type = PCIE_PORT_SERVICE_AER,
    }, { /* end: all zeroes */ }
  };
  static struct pcie_port_service_driver root_aerdrv = {
    .name		= (char *)device_name,
    .id_table	= &service_id[0],
    .probe		= aerdrv_load,
    .remove		= aerdrv_unload,
    .suspend	= aerdrv_suspend,
    .resume		= aerdrv_resume,
  };
 Below is a sample code for registering/unregistering a service
 driver.
 ::
  static int __init aerdrv_service_init(void)
  {
    int retval = 0;
    retval = pcie_port_service_register(&root_aerdrv);
    if (!retval) {
      /*
      * FIX ME
      */
    }
    return retval;
  }
  static void __exit aerdrv_service_exit(void)
  {
    pcie_port_service_unregister(&root_aerdrv);
  }
  module_init(aerdrv_service_init);
  module_exit(aerdrv_service_exit);
 Possible Resource Conflicts
 ===========================
 Since all service drivers of a PCI-PCI Bridge Port device are
 allowed to run simultaneously, below lists a few of possible resource
 conflicts with proposed solutions.
 MSI and MSI-X Vector Resource
 -----------------------------
 Once MSI or MSI-X interrupts are enabled on a device, it stays in this
 mode until they are disabled again.  Since service drivers of the same
 PCI-PCI Bridge port share the same physical device, if an individual
 service driver enables or disables MSI/MSI-X mode it may result
 unpredictable behavior.
 To avoid this situation all service drivers are not permitted to
 switch interrupt mode on its device. The PCI Express Port Bus driver
 is responsible for determining the interrupt mode and this should be
 transparent to service drivers. Service drivers need to know only
 the vector IRQ assigned to the field irq of struct pcie_device, which
 is passed in when the PCI Express Port Bus driver probes each service
 driver. Service drivers should use (struct pcie_device*)dev->irq to
 call request_irq/free_irq. In addition, the interrupt mode is stored
 in the field interrupt_mode of struct pcie_device.
 PCI Memory/IO Mapped Regions
 ----------------------------
 Service drivers for PCI Express Power Management (PME), Advanced
 Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access
 PCI configuration space on the PCI Express port. In all cases the
 registers accessed are independent of each other. This patch assumes
 that all service drivers will be well behaved and not overwrite
 other service driver's configuration settings.
 PCI Config Registers
 --------------------
 Each service driver runs its PCI config operations on its own
 capability structure except the PCI Express capability structure, in
 which Root Control register and Device Control register are shared
 between PME and AER. This patch assumes that all service drivers
 will be well behaved and not overwrite other service driver's
 configuration settings.
--- a/Documentation/accelerators/ocxl.rst
+++ b/Documentation/accelerators/ocxl.rst
@ -1,178 +0,0 @@
 :orphan:
 ========================================================
 OpenCAPI (Open Coherent Accelerator Processor Interface)
 ========================================================
 OpenCAPI is an interface between processors and accelerators. It aims
 at being low-latency and high-bandwidth. The specification is
 developed by the `OpenCAPI Consortium <http://opencapi.org/>`_.
 It allows an accelerator (which could be a FPGA, ASICs, ...) to access
 the host memory coherently, using virtual addresses. An OpenCAPI
 device can also host its own memory, that can be accessed from the
 host.
 OpenCAPI is known in linux as 'ocxl', as the open, processor-agnostic
 evolution of 'cxl' (the driver for the IBM CAPI interface for
 powerpc), which was named that way to avoid confusion with the ISDN
 CAPI subsystem.
 High-level view
 ===============
 OpenCAPI defines a Data Link Layer (DL) and Transaction Layer (TL), to
 be implemented on top of a physical link. Any processor or device
 implementing the DL and TL can start sharing memory.
 ::
  +-----------+                         +-------------+
  |           |                         |             |
  |           |                         | Accelerated |
  | Processor |                         |  Function   |
  |           |  +--------+             |    Unit     |  +--------+
  |           |--| Memory |             |    (AFU)    |--| Memory |
  |           |  +--------+             |             |  +--------+
  +-----------+                         +-------------+
       |                                       |
  +-----------+                         +-------------+
  |    TL     |                         |    TLX      |
  +-----------+                         +-------------+
       |                                       |
  +-----------+                         +-------------+
  |    DL     |                         |    DLX      |
  +-----------+                         +-------------+
       |                                       |
       |                   PHY                 |
       +---------------------------------------+
 Device discovery
 ================
 OpenCAPI relies on a PCI-like configuration space, implemented on the
 device. So the host can discover AFUs by querying the config space.
 OpenCAPI devices in Linux are treated like PCI devices (with a few
 caveats). The firmware is expected to abstract the hardware as if it
 was a PCI link. A lot of the existing PCI infrastructure is reused:
 devices are scanned and BARs are assigned during the standard PCI
 enumeration. Commands like 'lspci' can therefore be used to see what
 devices are available.
 The configuration space defines the AFU(s) that can be found on the
 physical adapter, such as its name, how many memory contexts it can
 work with, the size of its MMIO areas, ...
 MMIO
 ====
 OpenCAPI defines two MMIO areas for each AFU:
 * the global MMIO area, with registers pertinent to the whole AFU.
 * a per-process MMIO area, which has a fixed size for each context.
 AFU interrupts
 ==============
 OpenCAPI includes the possibility for an AFU to send an interrupt to a
 host process. It is done through a 'intrp_req' defined in the
 Transaction Layer, specifying a 64-bit object handle which defines the
 interrupt.
 The driver allows a process to allocate an interrupt and obtain its
 64-bit object handle, that can be passed to the AFU.
 char devices
 ============
 The driver creates one char device per AFU found on the physical
 device. A physical device may have multiple functions and each
 function can have multiple AFUs. At the time of this writing though,
 it has only been tested with devices exporting only one AFU.
 Char devices can be found in /dev/ocxl/ and are named as:
 /dev/ocxl/<AFU name>.<location>.<index>
 where <AFU name> is a max 20-character long name, as found in the
 config space of the AFU.
 <location> is added by the driver and can help distinguish devices
 when a system has more than one instance of the same OpenCAPI device.
 <index> is also to help distinguish AFUs in the unlikely case where a
 device carries multiple copies of the same AFU.
 Sysfs class
 ===========
 An ocxl class is added for the devices representing the AFUs. See
 /sys/class/ocxl. The layout is described in
 Documentation/ABI/testing/sysfs-class-ocxl
 User API
 ========
 open
 ----
 Based on the AFU definition found in the config space, an AFU may
 support working with more than one memory context, in which case the
 associated char device may be opened multiple times by different
 processes.
 ioctl
 -----
 OCXL_IOCTL_ATTACH:
  Attach the memory context of the calling process to the AFU so that
  the AFU can access its memory.
 OCXL_IOCTL_IRQ_ALLOC:
  Allocate an AFU interrupt and return an identifier.
 OCXL_IOCTL_IRQ_FREE:
  Free a previously allocated AFU interrupt.
 OCXL_IOCTL_IRQ_SET_FD:
  Associate an event fd to an AFU interrupt so that the user process
  can be notified when the AFU sends an interrupt.
 OCXL_IOCTL_GET_METADATA:
  Obtains configuration information from the card, such at the size of
  MMIO areas, the AFU version, and the PASID for the current context.
 OCXL_IOCTL_ENABLE_P9_WAIT:
  Allows the AFU to wake a userspace thread executing 'wait'. Returns
  information to userspace to allow it to configure the AFU. Note that
  this is only available on POWER9.
 OCXL_IOCTL_GET_FEATURES:
  Reports on which CPU features that affect OpenCAPI are usable from
  userspace.
 mmap
 ----
 A process can mmap the per-process MMIO area for interactions with the
 AFU.
--- a/Documentation/accounting/cgroupstats.rst
+++ b/Documentation/accounting/cgroupstats.rst
@ -0,0 +1,31 @@
 ==================
 Control Groupstats
 ==================
 Control Groupstats is inspired by the discussion at
 http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as
 suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.
 Per cgroup statistics infrastructure re-uses code from the taskstats
 interface. A new set of cgroup operations are registered with commands
 and attributes specific to cgroups. It should be very easy to
 extend per cgroup statistics, by adding members to the cgroupstats
 structure.
 The current model for cgroupstats is a pull, a push model (to post
 statistics on interesting events), should be very easy to add. Currently
 user space requests for statistics by passing the cgroup path.
 Statistics about the state of all the tasks in the cgroup is returned to
 user space.
 NOTE: We currently rely on delay accounting for extracting information
 about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this
 information will not be available.
 To extract cgroup statistics a utility very similar to getdelays.c
 has been developed, the sample output of the utility is shown below::
  ~/balbir/cgroupstats # ./getdelays  -C "/sys/fs/cgroup/a"
  sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0
  ~/balbir/cgroupstats # ./getdelays  -C "/sys/fs/cgroup"
  sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2
--- a/Documentation/accounting/cgroupstats.txt
+++ b/Documentation/accounting/cgroupstats.txt
@ -1,27 +0,0 @@
 Control Groupstats is inspired by the discussion at
 http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as
 suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.
 Per cgroup statistics infrastructure re-uses code from the taskstats
 interface. A new set of cgroup operations are registered with commands
 and attributes specific to cgroups. It should be very easy to
 extend per cgroup statistics, by adding members to the cgroupstats
 structure.
 The current model for cgroupstats is a pull, a push model (to post
 statistics on interesting events), should be very easy to add. Currently
 user space requests for statistics by passing the cgroup path.
 Statistics about the state of all the tasks in the cgroup is returned to
 user space.
 NOTE: We currently rely on delay accounting for extracting information
 about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this
 information will not be available.
 To extract cgroup statistics a utility very similar to getdelays.c
 has been developed, the sample output of the utility is shown below
 ~/balbir/cgroupstats # ./getdelays  -C "/sys/fs/cgroup/a"
 sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0
 ~/balbir/cgroupstats # ./getdelays  -C "/sys/fs/cgroup"
 sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2
--- a/Documentation/accounting/delay-accounting.rst
+++ b/Documentation/accounting/delay-accounting.rst
@ -0,0 +1,126 @@
 ================
 Delay accounting
 ================
 Tasks encounter delays in execution when they wait
 for some kernel resource to become available e.g. a
 runnable task may wait for a free CPU to run on.
 The per-task delay accounting functionality measures
 the delays experienced by a task while
 a) waiting for a CPU (while being runnable)
 b) completion of synchronous block I/O initiated by the task
 c) swapping in pages
 d) memory reclaim
 and makes these statistics available to userspace through
 the taskstats interface.
 Such delays provide feedback for setting a task's cpu priority,
 io priority and rss limit values appropriately. Long delays for
 important tasks could be a trigger for raising its corresponding priority.
 The functionality, through its use of the taskstats interface, also provides
 delay statistics aggregated for all tasks (or threads) belonging to a
 thread group (corresponding to a traditional Unix process). This is a commonly
 needed aggregation that is more efficiently done by the kernel.
 Userspace utilities, particularly resource management applications, can also
 aggregate delay statistics into arbitrary groups. To enable this, delay
 statistics of a task are available both during its lifetime as well as on its
 exit, ensuring continuous and complete monitoring can be done.
 Interface
 ---------
 Delay accounting uses the taskstats interface which is described
 in detail in a separate document in this directory. Taskstats returns a
 generic data structure to userspace corresponding to per-pid and per-tgid
 statistics. The delay accounting functionality populates specific fields of
 this structure. See
     include/linux/taskstats.h
 for a description of the fields pertaining to delay accounting.
 It will generally be in the form of counters returning the cumulative
 delay seen for cpu, sync block I/O, swapin, memory reclaim etc.
 Taking the difference of two successive readings of a given
 counter (say cpu_delay_total) for a task will give the delay
 experienced by the task waiting for the corresponding resource
 in that interval.
 When a task exits, records containing the per-task statistics
 are sent to userspace without requiring a command. If it is the last exiting
 task of a thread group, the per-tgid statistics are also sent. More details
 are given in the taskstats interface description.
 The getdelays.c userspace utility in tools/accounting directory allows simple
 commands to be run and the corresponding delay statistics to be displayed. It
 also serves as an example of using the taskstats interface.
 Usage
 -----
 Compile the kernel with::
 	CONFIG_TASK_DELAY_ACCT=y
 	CONFIG_TASKSTATS=y
 Delay accounting is enabled by default at boot up.
 To disable, add::
   nodelayacct
 to the kernel boot options. The rest of the instructions
 below assume this has not been done.
 After the system has booted up, use a utility
 similar to  getdelays.c to access the delays
 seen by a given task or a task group (tgid).
 The utility also allows a given command to be
 executed and the corresponding delays to be
 seen.
 General format of the getdelays command::
 	getdelays [-t tgid] [-p pid] [-c cmd...]
 Get delays, since system boot, for pid 10::
 	# ./getdelays -p 10
 	(output similar to next case)
 Get sum of delays, since system boot, for all pids with tgid 5::
 	# ./getdelays -t 5
 	CPU	count	real total	virtual total	delay total
 		7876	92005750	100000000	24001500
 	IO	count	delay total
 		0	0
 	SWAP	count	delay total
 		0	0
 	RECLAIM	count	delay total
 		0	0
 Get delays seen in executing a given simple command::
  # ./getdelays -c ls /
  bin   data1  data3  data5  dev  home  media  opt   root  srv        sys  usr
  boot  data2  data4  data6  etc  lib   mnt    proc  sbin  subdomain  tmp  var
  CPU	count	real total	virtual total	delay total
 	6	4000250		4000000		0
  IO	count	delay total
 	0	0
  SWAP	count	delay total
 	0	0
  RECLAIM	count	delay total
 	0	0
--- a/Documentation/accounting/delay-accounting.txt
+++ b/Documentation/accounting/delay-accounting.txt
@ -1,117 +0,0 @@
 Delay accounting
 ----------------
 Tasks encounter delays in execution when they wait
 for some kernel resource to become available e.g. a
 runnable task may wait for a free CPU to run on.
 The per-task delay accounting functionality measures
 the delays experienced by a task while
 a) waiting for a CPU (while being runnable)
 b) completion of synchronous block I/O initiated by the task
 c) swapping in pages
 d) memory reclaim
 and makes these statistics available to userspace through
 the taskstats interface.
 Such delays provide feedback for setting a task's cpu priority,
 io priority and rss limit values appropriately. Long delays for
 important tasks could be a trigger for raising its corresponding priority.
 The functionality, through its use of the taskstats interface, also provides
 delay statistics aggregated for all tasks (or threads) belonging to a
 thread group (corresponding to a traditional Unix process). This is a commonly
 needed aggregation that is more efficiently done by the kernel.
 Userspace utilities, particularly resource management applications, can also
 aggregate delay statistics into arbitrary groups. To enable this, delay
 statistics of a task are available both during its lifetime as well as on its
 exit, ensuring continuous and complete monitoring can be done.
 Interface
 ---------
 Delay accounting uses the taskstats interface which is described
 in detail in a separate document in this directory. Taskstats returns a
 generic data structure to userspace corresponding to per-pid and per-tgid
 statistics. The delay accounting functionality populates specific fields of
 this structure. See
     include/linux/taskstats.h
 for a description of the fields pertaining to delay accounting.
 It will generally be in the form of counters returning the cumulative
 delay seen for cpu, sync block I/O, swapin, memory reclaim etc.
 Taking the difference of two successive readings of a given
 counter (say cpu_delay_total) for a task will give the delay
 experienced by the task waiting for the corresponding resource
 in that interval.
 When a task exits, records containing the per-task statistics
 are sent to userspace without requiring a command. If it is the last exiting
 task of a thread group, the per-tgid statistics are also sent. More details
 are given in the taskstats interface description.
 The getdelays.c userspace utility in tools/accounting directory allows simple
 commands to be run and the corresponding delay statistics to be displayed. It
 also serves as an example of using the taskstats interface.
 Usage
 -----
 Compile the kernel with
 	CONFIG_TASK_DELAY_ACCT=y
 	CONFIG_TASKSTATS=y
 Delay accounting is enabled by default at boot up.
 To disable, add
   nodelayacct
 to the kernel boot options. The rest of the instructions
 below assume this has not been done.
 After the system has booted up, use a utility
 similar to  getdelays.c to access the delays
 seen by a given task or a task group (tgid).
 The utility also allows a given command to be
 executed and the corresponding delays to be
 seen.
 General format of the getdelays command
 getdelays [-t tgid] [-p pid] [-c cmd...]
 Get delays, since system boot, for pid 10
 # ./getdelays -p 10
 (output similar to next case)
 Get sum of delays, since system boot, for all pids with tgid 5
 # ./getdelays -t 5
 CPU	count	real total	virtual total	delay total
 	7876	92005750	100000000	24001500
 IO	count	delay total
 	0	0
 SWAP	count	delay total
 	0	0
 RECLAIM	count	delay total
 	0	0
 Get delays seen in executing a given simple command
 # ./getdelays -c ls /
 bin   data1  data3  data5  dev  home  media  opt   root  srv        sys  usr
 boot  data2  data4  data6  etc  lib   mnt    proc  sbin  subdomain  tmp  var
 CPU	count	real total	virtual total	delay total
 	6	4000250		4000000		0
 IO	count	delay total
 	0	0
 SWAP	count	delay total
 	0	0
 RECLAIM	count	delay total
 	0	0
--- a/Documentation/accounting/index.rst
+++ b/Documentation/accounting/index.rst
@ -0,0 +1,14 @@
 .. SPDX-License-Identifier: GPL-2.0
 ==========
 Accounting
 ==========
 .. toctree::
   :maxdepth: 1
   cgroupstats
   delay-accounting
   psi
   taskstats
   taskstats-struct
--- a/Documentation/accounting/psi.rst
+++ b/Documentation/accounting/psi.rst
@ -0,0 +1,182 @@
 ================================
 PSI - Pressure Stall Information
 ================================
 :Date: April, 2018
 :Author: Johannes Weiner <hannes@cmpxchg.org>
 When CPU, memory or IO devices are contended, workloads experience
 latency spikes, throughput losses, and run the risk of OOM kills.
 Without an accurate measure of such contention, users are forced to
 either play it safe and under-utilize their hardware resources, or
 roll the dice and frequently suffer the disruptions resulting from
 excessive overcommit.
 The psi feature identifies and quantifies the disruptions caused by
 such resource crunches and the time impact it has on complex workloads
 or even entire systems.
 Having an accurate measure of productivity losses caused by resource
 scarcity aids users in sizing workloads to hardware--or provisioning
 hardware according to workload demand.
 As psi aggregates this information in realtime, systems can be managed
 dynamically using techniques such as load shedding, migrating jobs to
 other systems or data centers, or strategically pausing or killing low
 priority or restartable batch jobs.
 This allows maximizing hardware utilization without sacrificing
 workload health or risking major disruptions such as OOM kills.
 Pressure interface
 ==================
 Pressure information for each resource is exported through the
 respective file in /proc/pressure/ -- cpu, memory, and io.
 The format for CPU is as such::
 	some avg10=0.00 avg60=0.00 avg300=0.00 total=0
 and for memory and IO::
 	some avg10=0.00 avg60=0.00 avg300=0.00 total=0
 	full avg10=0.00 avg60=0.00 avg300=0.00 total=0
 The "some" line indicates the share of time in which at least some
 tasks are stalled on a given resource.
 The "full" line indicates the share of time in which all non-idle
 tasks are stalled on a given resource simultaneously. In this state
 actual CPU cycles are going to waste, and a workload that spends
 extended time in this state is considered to be thrashing. This has
 severe impact on performance, and it's useful to distinguish this
 situation from a state where some tasks are stalled but the CPU is
 still doing productive work. As such, time spent in this subset of the
 stall state is tracked separately and exported in the "full" averages.
 The ratios (in %) are tracked as recent trends over ten, sixty, and
 three hundred second windows, which gives insight into short term events
 as well as medium and long term trends. The total absolute stall time
 (in us) is tracked and exported as well, to allow detection of latency
 spikes which wouldn't necessarily make a dent in the time averages,
 or to average trends over custom time frames.
 Monitoring for pressure thresholds
 ==================================
 Users can register triggers and use poll() to be woken up when resource
 pressure exceeds certain thresholds.
 A trigger describes the maximum cumulative stall time over a specific
 time window, e.g. 100ms of total stall time within any 500ms window to
 generate a wakeup event.
 To register a trigger user has to open psi interface file under
 /proc/pressure/ representing the resource to be monitored and write the
 desired threshold and time window. The open file descriptor should be
 used to wait for trigger events using select(), poll() or epoll().
 The following format is used::
 	<some|full> <stall amount in us> <time window in us>
 For example writing "some 150000 1000000" into /proc/pressure/memory
 would add 150ms threshold for partial memory stall measured within
 1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
 would add 50ms threshold for full io stall measured within 1sec time window.
 Triggers can be set on more than one psi metric and more than one trigger
 for the same psi metric can be specified. However for each trigger a separate
 file descriptor is required to be able to poll it separately from others,
 therefore for each trigger a separate open() syscall should be made even
 when opening the same psi interface file.
 Monitors activate only when system enters stall state for the monitored
 psi metric and deactivates upon exit from the stall state. While system is
 in the stall state psi signal growth is monitored at a rate of 10 times per
 tracking window.
 The kernel accepts window sizes ranging from 500ms to 10s, therefore min
 monitoring update interval is 50ms and max is 1s. Min limit is set to
 prevent overly frequent polling. Max limit is chosen as a high enough number
 after which monitors are most likely not needed and psi averages can be used
 instead.
 When activated, psi monitor stays active for at least the duration of one
 tracking window to avoid repeated activations/deactivations when system is
 bouncing in and out of the stall state.
 Notifications to the userspace are rate-limited to one per tracking window.
 The trigger will de-register when the file descriptor used to define the
 trigger  is closed.
 Userspace monitor usage example
 ===============================
 ::
  #include <errno.h>
  #include <fcntl.h>
  #include <stdio.h>
  #include <poll.h>
  #include <string.h>
  #include <unistd.h>
  /*
   * Monitor memory partial stall with 1s tracking window size
   * and 150ms threshold.
   */
  int main() {
 	const char trig[] = "some 150000 1000000";
 	struct pollfd fds;
 	int n;
 	fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
 	if (fds.fd < 0) {
 		printf("/proc/pressure/memory open error: %s\n",
 			strerror(errno));
 		return 1;
 	}
 	fds.events = POLLPRI;
 	if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
 		printf("/proc/pressure/memory write error: %s\n",
 			strerror(errno));
 		return 1;
 	}
 	printf("waiting for events...\n");
 	while (1) {
 		n = poll(&fds, 1, -1);
 		if (n < 0) {
 			printf("poll error: %s\n", strerror(errno));
 			return 1;
 		}
 		if (fds.revents & POLLERR) {
 			printf("got POLLERR, event source is gone\n");
 			return 0;
 		}
 		if (fds.revents & POLLPRI) {
 			printf("event triggered!\n");
 		} else {
 			printf("unknown event received: 0x%x\n", fds.revents);
 			return 1;
 		}
 	}
 	return 0;
  }
 Cgroup2 interface
 =================
 In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
 mounted, pressure stall information is also tracked for tasks grouped
 into cgroups. Each subdirectory in the cgroupfs mountpoint contains
 cpu.pressure, memory.pressure, and io.pressure files; the format is
 the same as the /proc/pressure/ files.
 Per-cgroup psi monitors can be specified and used the same way as
 system-wide ones.
--- a/Documentation/accounting/psi.txt
+++ b/Documentation/accounting/psi.txt
@ -1,180 +0,0 @@
 ================================
 PSI - Pressure Stall Information
 ================================
 :Date: April, 2018
 :Author: Johannes Weiner <hannes@cmpxchg.org>
 When CPU, memory or IO devices are contended, workloads experience
 latency spikes, throughput losses, and run the risk of OOM kills.
 Without an accurate measure of such contention, users are forced to
 either play it safe and under-utilize their hardware resources, or
 roll the dice and frequently suffer the disruptions resulting from
 excessive overcommit.
 The psi feature identifies and quantifies the disruptions caused by
 such resource crunches and the time impact it has on complex workloads
 or even entire systems.
 Having an accurate measure of productivity losses caused by resource
 scarcity aids users in sizing workloads to hardware--or provisioning
 hardware according to workload demand.
 As psi aggregates this information in realtime, systems can be managed
 dynamically using techniques such as load shedding, migrating jobs to
 other systems or data centers, or strategically pausing or killing low
 priority or restartable batch jobs.
 This allows maximizing hardware utilization without sacrificing
 workload health or risking major disruptions such as OOM kills.
 Pressure interface
 ==================
 Pressure information for each resource is exported through the
 respective file in /proc/pressure/ -- cpu, memory, and io.
 The format for CPU is as such:
 some avg10=0.00 avg60=0.00 avg300=0.00 total=0
 and for memory and IO:
 some avg10=0.00 avg60=0.00 avg300=0.00 total=0
 full avg10=0.00 avg60=0.00 avg300=0.00 total=0
 The "some" line indicates the share of time in which at least some
 tasks are stalled on a given resource.
 The "full" line indicates the share of time in which all non-idle
 tasks are stalled on a given resource simultaneously. In this state
 actual CPU cycles are going to waste, and a workload that spends
 extended time in this state is considered to be thrashing. This has
 severe impact on performance, and it's useful to distinguish this
 situation from a state where some tasks are stalled but the CPU is
 still doing productive work. As such, time spent in this subset of the
 stall state is tracked separately and exported in the "full" averages.
 The ratios (in %) are tracked as recent trends over ten, sixty, and
 three hundred second windows, which gives insight into short term events
 as well as medium and long term trends. The total absolute stall time
 (in us) is tracked and exported as well, to allow detection of latency
 spikes which wouldn't necessarily make a dent in the time averages,
 or to average trends over custom time frames.
 Monitoring for pressure thresholds
 ==================================
 Users can register triggers and use poll() to be woken up when resource
 pressure exceeds certain thresholds.
 A trigger describes the maximum cumulative stall time over a specific
 time window, e.g. 100ms of total stall time within any 500ms window to
 generate a wakeup event.
 To register a trigger user has to open psi interface file under
 /proc/pressure/ representing the resource to be monitored and write the
 desired threshold and time window. The open file descriptor should be
 used to wait for trigger events using select(), poll() or epoll().
 The following format is used:
 <some|full> <stall amount in us> <time window in us>
 For example writing "some 150000 1000000" into /proc/pressure/memory
 would add 150ms threshold for partial memory stall measured within
 1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
 would add 50ms threshold for full io stall measured within 1sec time window.
 Triggers can be set on more than one psi metric and more than one trigger
 for the same psi metric can be specified. However for each trigger a separate
 file descriptor is required to be able to poll it separately from others,
 therefore for each trigger a separate open() syscall should be made even
 when opening the same psi interface file.
 Monitors activate only when system enters stall state for the monitored
 psi metric and deactivates upon exit from the stall state. While system is
 in the stall state psi signal growth is monitored at a rate of 10 times per
 tracking window.
 The kernel accepts window sizes ranging from 500ms to 10s, therefore min
 monitoring update interval is 50ms and max is 1s. Min limit is set to
 prevent overly frequent polling. Max limit is chosen as a high enough number
 after which monitors are most likely not needed and psi averages can be used
 instead.
 When activated, psi monitor stays active for at least the duration of one
 tracking window to avoid repeated activations/deactivations when system is
 bouncing in and out of the stall state.
 Notifications to the userspace are rate-limited to one per tracking window.
 The trigger will de-register when the file descriptor used to define the
 trigger  is closed.
 Userspace monitor usage example
 ===============================
 #include <errno.h>
 #include <fcntl.h>
 #include <stdio.h>
 #include <poll.h>
 #include <string.h>
 #include <unistd.h>
 /*
 * Monitor memory partial stall with 1s tracking window size
 * and 150ms threshold.
 */
 int main() {
 	const char trig[] = "some 150000 1000000";
 	struct pollfd fds;
 	int n;
 	fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
 	if (fds.fd < 0) {
 		printf("/proc/pressure/memory open error: %s\n",
 			strerror(errno));
 		return 1;
 	}
 	fds.events = POLLPRI;
 	if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
 		printf("/proc/pressure/memory write error: %s\n",
 			strerror(errno));
 		return 1;
 	}
 	printf("waiting for events...\n");
 	while (1) {
 		n = poll(&fds, 1, -1);
 		if (n < 0) {
 			printf("poll error: %s\n", strerror(errno));
 			return 1;
 		}
 		if (fds.revents & POLLERR) {
 			printf("got POLLERR, event source is gone\n");
 			return 0;
 		}
 		if (fds.revents & POLLPRI) {
 			printf("event triggered!\n");
 		} else {
 			printf("unknown event received: 0x%x\n", fds.revents);
 			return 1;
 		}
 	}
 	return 0;
 }
 Cgroup2 interface
 =================
 In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
 mounted, pressure stall information is also tracked for tasks grouped
 into cgroups. Each subdirectory in the cgroupfs mountpoint contains
 cpu.pressure, memory.pressure, and io.pressure files; the format is
 the same as the /proc/pressure/ files.
 Per-cgroup psi monitors can be specified and used the same way as
 system-wide ones.
--- a/Documentation/accounting/taskstats-struct.rst
+++ b/Documentation/accounting/taskstats-struct.rst
@ -0,0 +1,199 @@
 ====================
 The struct taskstats
 ====================
 This document contains an explanation of the struct taskstats fields.
 There are three different groups of fields in the struct taskstats:
 1) Common and basic accounting fields
    If CONFIG_TASKSTATS is set, the taskstats interface is enabled and
    the common fields and basic accounting fields are collected for
    delivery at do_exit() of a task.
 2) Delay accounting fields
    These fields are placed between::
 	/* Delay accounting fields start */
    and::
 	/* Delay accounting fields end */
    Their values are collected if CONFIG_TASK_DELAY_ACCT is set.
 3) Extended accounting fields
    These fields are placed between::
 	/* Extended accounting fields start */
    and::
 	/* Extended accounting fields end */
    Their values are collected if CONFIG_TASK_XACCT is set.
 4) Per-task and per-thread context switch count statistics
 5) Time accounting for SMT machines
 6) Extended delay accounting fields for memory reclaim
 Future extension should add fields to the end of the taskstats struct, and
 should not change the relative position of each field within the struct.
 ::
  struct taskstats {
 1) Common and basic accounting fields::
 	/* The version number of this struct. This field is always set to
 	 * TAKSTATS_VERSION, which is defined in <linux/taskstats.h>.
 	 * Each time the struct is changed, the value should be incremented.
 	 */
 	__u16	version;
 	/* The exit code of a task. */
 	__u32	ac_exitcode;		/* Exit status */
 	/* The accounting flags of a task as defined in <linux/acct.h>
 	 * Defined values are AFORK, ASU, ACOMPAT, ACORE, and AXSIG.
 	 */
 	__u8	ac_flag;		/* Record flags */
 	/* The value of task_nice() of a task. */
 	__u8	ac_nice;		/* task_nice */
 	/* The name of the command that started this task. */
 	char	ac_comm[TS_COMM_LEN];	/* Command name */
 	/* The scheduling discipline as set in task->policy field. */
 	__u8	ac_sched;		/* Scheduling discipline */
 	__u8	ac_pad[3];
 	__u32	ac_uid;			/* User ID */
 	__u32	ac_gid;			/* Group ID */
 	__u32	ac_pid;			/* Process ID */
 	__u32	ac_ppid;		/* Parent process ID */
 	/* The time when a task begins, in [secs] since 1970. */
 	__u32	ac_btime;		/* Begin time [sec since 1970] */
 	/* The elapsed time of a task, in [usec]. */
 	__u64	ac_etime;		/* Elapsed time [usec] */
 	/* The user CPU time of a task, in [usec]. */
 	__u64	ac_utime;		/* User CPU time [usec] */
 	/* The system CPU time of a task, in [usec]. */
 	__u64	ac_stime;		/* System CPU time [usec] */
 	/* The minor page fault count of a task, as set in task->min_flt. */
 	__u64	ac_minflt;		/* Minor Page Fault Count */
 	/* The major page fault count of a task, as set in task->maj_flt. */
 	__u64	ac_majflt;		/* Major Page Fault Count */
 2) Delay accounting fields::
 	/* Delay accounting fields start
 	 *
 	 * All values, until the comment "Delay accounting fields end" are
 	 * available only if delay accounting is enabled, even though the last
 	 * few fields are not delays
 	 *
 	 * xxx_count is the number of delay values recorded
 	 * xxx_delay_total is the corresponding cumulative delay in nanoseconds
 	 *
 	 * xxx_delay_total wraps around to zero on overflow
 	 * xxx_count incremented regardless of overflow
 	 */
 	/* Delay waiting for cpu, while runnable
 	 * count, delay_total NOT updated atomically
 	 */
 	__u64	cpu_count;
 	__u64	cpu_delay_total;
 	/* Following four fields atomically updated using task->delays->lock */
 	/* Delay waiting for synchronous block I/O to complete
 	 * does not account for delays in I/O submission
 	 */
 	__u64	blkio_count;
 	__u64	blkio_delay_total;
 	/* Delay waiting for page fault I/O (swap in only) */
 	__u64	swapin_count;
 	__u64	swapin_delay_total;
 	/* cpu "wall-clock" running time
 	 * On some architectures, value will adjust for cpu time stolen
 	 * from the kernel in involuntary waits due to virtualization.
 	 * Value is cumulative, in nanoseconds, without a corresponding count
 	 * and wraps around to zero silently on overflow
 	 */
 	__u64	cpu_run_real_total;
 	/* cpu "virtual" running time
 	 * Uses time intervals seen by the kernel i.e. no adjustment
 	 * for kernel's involuntary waits due to virtualization.
 	 * Value is cumulative, in nanoseconds, without a corresponding count
 	 * and wraps around to zero silently on overflow
 	 */
 	__u64	cpu_run_virtual_total;
 	/* Delay accounting fields end */
 	/* version 1 ends here */
 3) Extended accounting fields::
 	/* Extended accounting fields start */
 	/* Accumulated RSS usage in duration of a task, in MBytes-usecs.
 	 * The current rss usage is added to this counter every time
 	 * a tick is charged to a task's system time. So, at the end we
 	 * will have memory usage multiplied by system time. Thus an
 	 * average usage per system time unit can be calculated.
 	 */
 	__u64	coremem;		/* accumulated RSS usage in MB-usec */
 	/* Accumulated virtual memory usage in duration of a task.
 	 * Same as acct_rss_mem1 above except that we keep track of VM usage.
 	 */
 	__u64	virtmem;		/* accumulated VM usage in MB-usec */
 	/* High watermark of RSS usage in duration of a task, in KBytes. */
 	__u64	hiwater_rss;		/* High-watermark of RSS usage */
 	/* High watermark of VM  usage in duration of a task, in KBytes. */
 	__u64	hiwater_vm;		/* High-water virtual memory usage */
 	/* The following four fields are I/O statistics of a task. */
 	__u64	read_char;		/* bytes read */
 	__u64	write_char;		/* bytes written */
 	__u64	read_syscalls;		/* read syscalls */
 	__u64	write_syscalls;		/* write syscalls */
 	/* Extended accounting fields end */
 4) Per-task and per-thread statistics::
 	__u64	nvcsw;			/* Context voluntary switch counter */
 	__u64	nivcsw;			/* Context involuntary switch counter */
 5) Time accounting for SMT machines::
 	__u64	ac_utimescaled;		/* utime scaled on frequency etc */
 	__u64	ac_stimescaled;		/* stime scaled on frequency etc */
 	__u64	cpu_scaled_run_real_total; /* scaled cpu_run_real_total */
 6) Extended delay accounting fields for memory reclaim::
 	/* Delay waiting for memory reclaim */
 	__u64	freepages_count;
 	__u64	freepages_delay_total;
 ::
  }
--- a/Documentation/accounting/taskstats-struct.txt
+++ b/Documentation/accounting/taskstats-struct.txt
@ -1,180 +0,0 @@
 The struct taskstats
 --------------------
 This document contains an explanation of the struct taskstats fields.
 There are three different groups of fields in the struct taskstats:
 1) Common and basic accounting fields
    If CONFIG_TASKSTATS is set, the taskstats interface is enabled and
    the common fields and basic accounting fields are collected for
    delivery at do_exit() of a task.
 2) Delay accounting fields
    These fields are placed between
    /* Delay accounting fields start */
    and
    /* Delay accounting fields end */
    Their values are collected if CONFIG_TASK_DELAY_ACCT is set.
 3) Extended accounting fields
    These fields are placed between
    /* Extended accounting fields start */
    and
    /* Extended accounting fields end */
    Their values are collected if CONFIG_TASK_XACCT is set.
 4) Per-task and per-thread context switch count statistics
 5) Time accounting for SMT machines
 6) Extended delay accounting fields for memory reclaim
 Future extension should add fields to the end of the taskstats struct, and
 should not change the relative position of each field within the struct.
 struct taskstats {
 1) Common and basic accounting fields:
 	/* The version number of this struct. This field is always set to
 	 * TAKSTATS_VERSION, which is defined in <linux/taskstats.h>.
 	 * Each time the struct is changed, the value should be incremented.
 	 */
 	__u16	version;
  	/* The exit code of a task. */
 	__u32	ac_exitcode;		/* Exit status */
  	/* The accounting flags of a task as defined in <linux/acct.h>
 	 * Defined values are AFORK, ASU, ACOMPAT, ACORE, and AXSIG.
 	 */
 	__u8	ac_flag;		/* Record flags */
  	/* The value of task_nice() of a task. */
 	__u8	ac_nice;		/* task_nice */
  	/* The name of the command that started this task. */
 	char	ac_comm[TS_COMM_LEN];	/* Command name */
  	/* The scheduling discipline as set in task->policy field. */
 	__u8	ac_sched;		/* Scheduling discipline */
 	__u8	ac_pad[3];
 	__u32	ac_uid;			/* User ID */
 	__u32	ac_gid;			/* Group ID */
 	__u32	ac_pid;			/* Process ID */
 	__u32	ac_ppid;		/* Parent process ID */
  	/* The time when a task begins, in [secs] since 1970. */
 	__u32	ac_btime;		/* Begin time [sec since 1970] */
  	/* The elapsed time of a task, in [usec]. */
 	__u64	ac_etime;		/* Elapsed time [usec] */
  	/* The user CPU time of a task, in [usec]. */
 	__u64	ac_utime;		/* User CPU time [usec] */
  	/* The system CPU time of a task, in [usec]. */
 	__u64	ac_stime;		/* System CPU time [usec] */
  	/* The minor page fault count of a task, as set in task->min_flt. */
 	__u64	ac_minflt;		/* Minor Page Fault Count */
 	/* The major page fault count of a task, as set in task->maj_flt. */
 	__u64	ac_majflt;		/* Major Page Fault Count */
 2) Delay accounting fields:
 	/* Delay accounting fields start
 	 *
 	 * All values, until the comment "Delay accounting fields end" are
 	 * available only if delay accounting is enabled, even though the last
 	 * few fields are not delays
 	 *
 	 * xxx_count is the number of delay values recorded
 	 * xxx_delay_total is the corresponding cumulative delay in nanoseconds
 	 *
 	 * xxx_delay_total wraps around to zero on overflow
 	 * xxx_count incremented regardless of overflow
 	 */
 	/* Delay waiting for cpu, while runnable
 	 * count, delay_total NOT updated atomically
 	 */
 	__u64	cpu_count;
 	__u64	cpu_delay_total;
 	/* Following four fields atomically updated using task->delays->lock */
 	/* Delay waiting for synchronous block I/O to complete
 	 * does not account for delays in I/O submission
 	 */
 	__u64	blkio_count;
 	__u64	blkio_delay_total;
 	/* Delay waiting for page fault I/O (swap in only) */
 	__u64	swapin_count;
 	__u64	swapin_delay_total;
 	/* cpu "wall-clock" running time
 	 * On some architectures, value will adjust for cpu time stolen
 	 * from the kernel in involuntary waits due to virtualization.
 	 * Value is cumulative, in nanoseconds, without a corresponding count
 	 * and wraps around to zero silently on overflow
 	 */
 	__u64	cpu_run_real_total;
 	/* cpu "virtual" running time
 	 * Uses time intervals seen by the kernel i.e. no adjustment
 	 * for kernel's involuntary waits due to virtualization.
 	 * Value is cumulative, in nanoseconds, without a corresponding count
 	 * and wraps around to zero silently on overflow
 	 */
 	__u64	cpu_run_virtual_total;
 	/* Delay accounting fields end */
 	/* version 1 ends here */
 3) Extended accounting fields
 	/* Extended accounting fields start */
 	/* Accumulated RSS usage in duration of a task, in MBytes-usecs.
 	 * The current rss usage is added to this counter every time
 	 * a tick is charged to a task's system time. So, at the end we
 	 * will have memory usage multiplied by system time. Thus an
 	 * average usage per system time unit can be calculated.
 	 */
 	__u64	coremem;		/* accumulated RSS usage in MB-usec */
  	/* Accumulated virtual memory usage in duration of a task.
 	 * Same as acct_rss_mem1 above except that we keep track of VM usage.
 	 */
 	__u64	virtmem;		/* accumulated VM usage in MB-usec */
  	/* High watermark of RSS usage in duration of a task, in KBytes. */
 	__u64	hiwater_rss;		/* High-watermark of RSS usage */
  	/* High watermark of VM  usage in duration of a task, in KBytes. */
 	__u64	hiwater_vm;		/* High-water virtual memory usage */
 	/* The following four fields are I/O statistics of a task. */
 	__u64	read_char;		/* bytes read */
 	__u64	write_char;		/* bytes written */
 	__u64	read_syscalls;		/* read syscalls */
 	__u64	write_syscalls;		/* write syscalls */
 	/* Extended accounting fields end */
 4) Per-task and per-thread statistics
 	__u64	nvcsw;			/* Context voluntary switch counter */
 	__u64	nivcsw;			/* Context involuntary switch counter */
 5) Time accounting for SMT machines
 	__u64	ac_utimescaled;		/* utime scaled on frequency etc */
 	__u64	ac_stimescaled;		/* stime scaled on frequency etc */
 	__u64	cpu_scaled_run_real_total; /* scaled cpu_run_real_total */
 6) Extended delay accounting fields for memory reclaim
 	/* Delay waiting for memory reclaim */
 	__u64	freepages_count;
 	__u64	freepages_delay_total;
 }
--- a/Documentation/accounting/taskstats.rst
+++ b/Documentation/accounting/taskstats.rst
@ -0,0 +1,180 @@
 =============================
 Per-task statistics interface
 =============================
 Taskstats is a netlink-based interface for sending per-task and
 per-process statistics from the kernel to userspace.
 Taskstats was designed for the following benefits:
 - efficiently provide statistics during lifetime of a task and on its exit
 - unified interface for multiple accounting subsystems
 - extensibility for use by future accounting patches
 Terminology
 -----------
 "pid", "tid" and "task" are used interchangeably and refer to the standard
 Linux task defined by struct task_struct.  per-pid stats are the same as
 per-task stats.
 "tgid", "process" and "thread group" are used interchangeably and refer to the
 tasks that share an mm_struct i.e. the traditional Unix process. Despite the
 use of tgid, there is no special treatment for the task that is thread group
 leader - a process is deemed alive as long as it has any task belonging to it.
 Usage
 -----
 To get statistics during a task's lifetime, userspace opens a unicast netlink
 socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.
 The response contains statistics for a task (if pid is specified) or the sum of
 statistics for all tasks of the process (if tgid is specified).
 To obtain statistics for tasks which are exiting, the userspace listener
 sends a register command and specifies a cpumask. Whenever a task exits on
 one of the cpus in the cpumask, its per-pid statistics are sent to the
 registered listener. Using cpumasks allows the data received by one listener
 to be limited and assists in flow control over the netlink interface and is
 explained in more detail below.
 If the exiting task is the last thread exiting its thread group,
 an additional record containing the per-tgid stats is also sent to userspace.
 The latter contains the sum of per-pid stats for all threads in the thread
 group, both past and present.
 getdelays.c is a simple utility demonstrating usage of the taskstats interface
 for reporting delay accounting statistics. Users can register cpumasks,
 send commands and process responses, listen for per-tid/tgid exit data,
 write the data received to a file and do basic flow control by increasing
 receive buffer sizes.
 Interface
 ---------
 The user-kernel interface is encapsulated in include/linux/taskstats.h
 To avoid this documentation becoming obsolete as the interface evolves, only
 an outline of the current version is given. taskstats.h always overrides the
 description here.
 struct taskstats is the common accounting structure for both per-pid and
 per-tgid data. It is versioned and can be extended by each accounting subsystem
 that is added to the kernel. The fields and their semantics are defined in the
 taskstats.h file.
 The data exchanged between user and kernel space is a netlink message belonging
 to the NETLINK_GENERIC family and using the netlink attributes interface.
 The messages are in the format::
    +----------+- - -+-------------+-------------------+
    | nlmsghdr | Pad |  genlmsghdr | taskstats payload |
    +----------+- - -+-------------+-------------------+
 The taskstats payload is one of the following three kinds:
 1. Commands: Sent from user to kernel. Commands to get data on
 a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID,
 containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes
 the task/process for which userspace wants statistics.
 Commands to register/deregister interest in exit data from a set of cpus
 consist of one attribute, of type
 TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the
 attribute payload. The cpumask is specified as an ascii string of
 comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8
 the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest
 in cpus before closing the listening socket, the kernel cleans up its interest
 set over time. However, for the sake of efficiency, an explicit deregistration
 is advisable.
 2. Response for a command: sent from the kernel in response to a userspace
 command. The payload is a series of three attributes of type:
 a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates
 a pid/tgid will be followed by some stats.
 b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats
 are being returned.
 c) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The
 same structure is used for both per-pid and per-tgid stats.
 3. New message sent by kernel whenever a task exits. The payload consists of a
   series of attributes of the following type:
 a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats
 b) TASKSTATS_TYPE_PID: contains exiting task's pid
 c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats
 d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats
 e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs
 f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process
 per-tgid stats
 --------------
 Taskstats provides per-process stats, in addition to per-task stats, since
 resource management is often done at a process granularity and aggregating task
 stats in userspace alone is inefficient and potentially inaccurate (due to lack
 of atomicity).
 However, maintaining per-process, in addition to per-task stats, within the
 kernel has space and time overheads. To address this, the taskstats code
 accumulates each exiting task's statistics into a process-wide data structure.
 When the last task of a process exits, the process level data accumulated also
 gets sent to userspace (along with the per-task data).
 When a user queries to get per-tgid data, the sum of all other live threads in
 the group is added up and added to the accumulated total for previously exited
 threads of the same thread group.
 Extending taskstats
 -------------------
 There are two ways to extend the taskstats interface to export more
 per-task/process stats as patches to collect them get added to the kernel
 in future:
 1. Adding more fields to the end of the existing struct taskstats. Backward
   compatibility is ensured by the version number within the
   structure. Userspace will use only the fields of the struct that correspond
   to the version its using.
 2. Defining separate statistic structs and using the netlink attributes
   interface to return them. Since userspace processes each netlink attribute
   independently, it can always ignore attributes whose type it does not
   understand (because it is using an older version of the interface).
 Choosing between 1. and 2. is a matter of trading off flexibility and
 overhead. If only a few fields need to be added, then 1. is the preferable
 path since the kernel and userspace don't need to incur the overhead of
 processing new netlink attributes. But if the new fields expand the existing
 struct too much, requiring disparate userspace accounting utilities to
 unnecessarily receive large structures whose fields are of no interest, then
 extending the attributes structure would be worthwhile.
 Flow control for taskstats
 --------------------------
 When the rate of task exits becomes large, a listener may not be able to keep
 up with the kernel's rate of sending per-tid/tgid exit data leading to data
 loss. This possibility gets compounded when the taskstats structure gets
 extended and the number of cpus grows large.
 To avoid losing statistics, userspace should do one or more of the following:
 - increase the receive buffer sizes for the netlink sockets opened by
  listeners to receive exit data.
 - create more listeners and reduce the number of cpus being listened to by
  each listener. In the extreme case, there could be one listener for each cpu.
  Users may also consider setting the cpu affinity of the listener to the subset
  of cpus to which it listens, especially if they are listening to just one cpu.
 Despite these measures, if the userspace receives ENOBUFS error messages
 indicated overflow of receive buffers, it should take measures to handle the
 loss of data.
--- a/Documentation/accounting/taskstats.txt
+++ b/Documentation/accounting/taskstats.txt
@ -1,181 +0,0 @@
 Per-task statistics interface
 -----------------------------
 Taskstats is a netlink-based interface for sending per-task and
 per-process statistics from the kernel to userspace.
 Taskstats was designed for the following benefits:
 - efficiently provide statistics during lifetime of a task and on its exit
 - unified interface for multiple accounting subsystems
 - extensibility for use by future accounting patches
 Terminology
 -----------
 "pid", "tid" and "task" are used interchangeably and refer to the standard
 Linux task defined by struct task_struct.  per-pid stats are the same as
 per-task stats.
 "tgid", "process" and "thread group" are used interchangeably and refer to the
 tasks that share an mm_struct i.e. the traditional Unix process. Despite the
 use of tgid, there is no special treatment for the task that is thread group
 leader - a process is deemed alive as long as it has any task belonging to it.
 Usage
 -----
 To get statistics during a task's lifetime, userspace opens a unicast netlink
 socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.
 The response contains statistics for a task (if pid is specified) or the sum of
 statistics for all tasks of the process (if tgid is specified).
 To obtain statistics for tasks which are exiting, the userspace listener
 sends a register command and specifies a cpumask. Whenever a task exits on
 one of the cpus in the cpumask, its per-pid statistics are sent to the
 registered listener. Using cpumasks allows the data received by one listener
 to be limited and assists in flow control over the netlink interface and is
 explained in more detail below.
 If the exiting task is the last thread exiting its thread group,
 an additional record containing the per-tgid stats is also sent to userspace.
 The latter contains the sum of per-pid stats for all threads in the thread
 group, both past and present.
 getdelays.c is a simple utility demonstrating usage of the taskstats interface
 for reporting delay accounting statistics. Users can register cpumasks,
 send commands and process responses, listen for per-tid/tgid exit data,
 write the data received to a file and do basic flow control by increasing
 receive buffer sizes.
 Interface
 ---------
 The user-kernel interface is encapsulated in include/linux/taskstats.h
 To avoid this documentation becoming obsolete as the interface evolves, only
 an outline of the current version is given. taskstats.h always overrides the
 description here.
 struct taskstats is the common accounting structure for both per-pid and
 per-tgid data. It is versioned and can be extended by each accounting subsystem
 that is added to the kernel. The fields and their semantics are defined in the
 taskstats.h file.
 The data exchanged between user and kernel space is a netlink message belonging
 to the NETLINK_GENERIC family and using the netlink attributes interface.
 The messages are in the format
    +----------+- - -+-------------+-------------------+
    | nlmsghdr | Pad |  genlmsghdr | taskstats payload |
    +----------+- - -+-------------+-------------------+
 The taskstats payload is one of the following three kinds:
 1. Commands: Sent from user to kernel. Commands to get data on
 a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID,
 containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes
 the task/process for which userspace wants statistics.
 Commands to register/deregister interest in exit data from a set of cpus
 consist of one attribute, of type
 TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the
 attribute payload. The cpumask is specified as an ascii string of
 comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8
 the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest
 in cpus before closing the listening socket, the kernel cleans up its interest
 set over time. However, for the sake of efficiency, an explicit deregistration
 is advisable.
 2. Response for a command: sent from the kernel in response to a userspace
 command. The payload is a series of three attributes of type:
 a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates
 a pid/tgid will be followed by some stats.
 b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats
 are being returned.
 c) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The
 same structure is used for both per-pid and per-tgid stats.
 3. New message sent by kernel whenever a task exits. The payload consists of a
   series of attributes of the following type:
 a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats
 b) TASKSTATS_TYPE_PID: contains exiting task's pid
 c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats
 d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats
 e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs
 f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process
 per-tgid stats
 --------------
 Taskstats provides per-process stats, in addition to per-task stats, since
 resource management is often done at a process granularity and aggregating task
 stats in userspace alone is inefficient and potentially inaccurate (due to lack
 of atomicity).
 However, maintaining per-process, in addition to per-task stats, within the
 kernel has space and time overheads. To address this, the taskstats code
 accumulates each exiting task's statistics into a process-wide data structure.
 When the last task of a process exits, the process level data accumulated also
 gets sent to userspace (along with the per-task data).
 When a user queries to get per-tgid data, the sum of all other live threads in
 the group is added up and added to the accumulated total for previously exited
 threads of the same thread group.
 Extending taskstats
 -------------------
 There are two ways to extend the taskstats interface to export more
 per-task/process stats as patches to collect them get added to the kernel
 in future:
 1. Adding more fields to the end of the existing struct taskstats. Backward
   compatibility is ensured by the version number within the
   structure. Userspace will use only the fields of the struct that correspond
   to the version its using.
 2. Defining separate statistic structs and using the netlink attributes
   interface to return them. Since userspace processes each netlink attribute
   independently, it can always ignore attributes whose type it does not
   understand (because it is using an older version of the interface).
 Choosing between 1. and 2. is a matter of trading off flexibility and
 overhead. If only a few fields need to be added, then 1. is the preferable
 path since the kernel and userspace don't need to incur the overhead of
 processing new netlink attributes. But if the new fields expand the existing
 struct too much, requiring disparate userspace accounting utilities to
 unnecessarily receive large structures whose fields are of no interest, then
 extending the attributes structure would be worthwhile.
 Flow control for taskstats
 --------------------------
 When the rate of task exits becomes large, a listener may not be able to keep
 up with the kernel's rate of sending per-tid/tgid exit data leading to data
 loss. This possibility gets compounded when the taskstats structure gets
 extended and the number of cpus grows large.
 To avoid losing statistics, userspace should do one or more of the following:
 - increase the receive buffer sizes for the netlink sockets opened by
 listeners to receive exit data.
 - create more listeners and reduce the number of cpus being listened to by
 each listener. In the extreme case, there could be one listener for each cpu.
 Users may also consider setting the cpu affinity of the listener to the subset
 of cpus to which it listens, especially if they are listening to just one cpu.
 Despite these measures, if the userspace receives ENOBUFS error messages
 indicated overflow of receive buffers, it should take measures to handle the
 loss of data.
 ----
--- a/Documentation/admin-guide/aoe/aoe.rst
+++ b/Documentation/admin-guide/aoe/aoe.rst
@ -0,0 +1,150 @@
 Introduction
 ============
 ATA over Ethernet is a network protocol that provides simple access to
 block storage on the LAN.
  http://support.coraid.com/documents/AoEr11.txt
 The EtherDrive (R) HOWTO for 2.6 and 3.x kernels is found at ...
  http://support.coraid.com/support/linux/EtherDrive-2.6-HOWTO.html
 It has many tips and hints!  Please see, especially, recommended
 tunings for virtual memory:
  http://support.coraid.com/support/linux/EtherDrive-2.6-HOWTO-5.html#ss5.19
 The aoetools are userland programs that are designed to work with this
 driver.  The aoetools are on sourceforge.
  http://aoetools.sourceforge.net/
 The scripts in this Documentation/admin-guide/aoe directory are intended to
 document the use of the driver and are not necessary if you install
 the aoetools.
 Creating Device Nodes
 =====================
  Users of udev should find the block device nodes created
  automatically, but to create all the necessary device nodes, use the
  udev configuration rules provided in udev.txt (in this directory).
  There is a udev-install.sh script that shows how to install these
  rules on your system.
  There is also an autoload script that shows how to edit
  /etc/modprobe.d/aoe.conf to ensure that the aoe module is loaded when
  necessary.  Preloading the aoe module is preferable to autoloading,
  however, because AoE discovery takes a few seconds.  It can be
  confusing when an AoE device is not present the first time the a
  command is run but appears a second later.
 Using Device Nodes
 ==================
  "cat /dev/etherd/err" blocks, waiting for error diagnostic output,
  like any retransmitted packets.
  "echo eth2 eth4 > /dev/etherd/interfaces" tells the aoe driver to
  limit ATA over Ethernet traffic to eth2 and eth4.  AoE traffic from
  untrusted networks should be ignored as a matter of security.  See
  also the aoe_iflist driver option described below.
  "echo > /dev/etherd/discover" tells the driver to find out what AoE
  devices are available.
  In the future these character devices may disappear and be replaced
  by sysfs counterparts.  Using the commands in aoetools insulates
  users from these implementation details.
  The block devices are named like this::
 	e{shelf}.{slot}
 	e{shelf}.{slot}p{part}
  ... so that "e0.2" is the third blade from the left (slot 2) in the
  first shelf (shelf address zero).  That's the whole disk.  The first
  partition on that disk would be "e0.2p1".
 Using sysfs
 ===========
  Each aoe block device in /sys/block has the extra attributes of
  state, mac, and netif.  The state attribute is "up" when the device
  is ready for I/O and "down" if detected but unusable.  The
  "down,closewait" state shows that the device is still open and
  cannot come up again until it has been closed.
  The mac attribute is the ethernet address of the remote AoE device.
  The netif attribute is the network interface on the localhost
  through which we are communicating with the remote AoE device.
  There is a script in this directory that formats this information in
  a convenient way.  Users with aoetools should use the aoe-stat
  command::
    root@makki root# sh Documentation/admin-guide/aoe/status.sh
       e10.0            eth3              up
       e10.1            eth3              up
       e10.2            eth3              up
       e10.3            eth3              up
       e10.4            eth3              up
       e10.5            eth3              up
       e10.6            eth3              up
       e10.7            eth3              up
       e10.8            eth3              up
       e10.9            eth3              up
        e4.0            eth1              up
        e4.1            eth1              up
        e4.2            eth1              up
        e4.3            eth1              up
        e4.4            eth1              up
        e4.5            eth1              up
        e4.6            eth1              up
        e4.7            eth1              up
        e4.8            eth1              up
        e4.9            eth1              up
  Use /sys/module/aoe/parameters/aoe_iflist (or better, the driver
  option discussed below) instead of /dev/etherd/interfaces to limit
  AoE traffic to the network interfaces in the given
  whitespace-separated list.  Unlike the old character device, the
  sysfs entry can be read from as well as written to.
  It's helpful to trigger discovery after setting the list of allowed
  interfaces.  The aoetools package provides an aoe-discover script
  for this purpose.  You can also directly use the
  /dev/etherd/discover special file described above.
 Driver Options
 ==============
  There is a boot option for the built-in aoe driver and a
  corresponding module parameter, aoe_iflist.  Without this option,
  all network interfaces may be used for ATA over Ethernet.  Here is a
  usage example for the module parameter::
    modprobe aoe_iflist="eth1 eth3"
  The aoe_deadsecs module parameter determines the maximum number of
  seconds that the driver will wait for an AoE device to provide a
  response to an AoE command.  After aoe_deadsecs seconds have
  elapsed, the AoE device will be marked as "down".  A value of zero
  is supported for testing purposes and makes the aoe driver keep
  trying AoE commands forever.
  The aoe_maxout module parameter has a default of 128.  This is the
  maximum number of unresponded packets that will be sent to an AoE
  target at one time.
  The aoe_dyndevs module parameter defaults to 1, meaning that the
  driver will assign a block device minor number to a discovered AoE
  target based on the order of its discovery.  With dynamic minor
  device numbers in use, a greater range of AoE shelf and slot
  addresses can be supported.  Users with udev will never have to
  think about minor numbers.  Using aoe_dyndevs=0 allows device nodes
  to be pre-created using a static minor-number scheme with the
  aoe-mkshelf script in the aoetools.
--- a/Documentation/admin-guide/aoe/autoload.sh
+++ b/Documentation/admin-guide/aoe/autoload.sh
--- a/Documentation/admin-guide/aoe/examples.rst
+++ b/Documentation/admin-guide/aoe/examples.rst
--- a/Documentation/admin-guide/aoe/index.rst
+++ b/Documentation/admin-guide/aoe/index.rst
@ -0,0 +1,17 @@
 =======================
 ATA over Ethernet (AoE)
 =======================
 .. toctree::
    :maxdepth: 1
    aoe
    todo
    examples
 .. only::  subproject and html
   Indices
   =======
   * :ref:`genindex`
--- a/Documentation/admin-guide/aoe/status.sh
+++ b/Documentation/admin-guide/aoe/status.sh
--- a/Documentation/admin-guide/aoe/todo.rst
+++ b/Documentation/admin-guide/aoe/todo.rst
--- a/Documentation/admin-guide/aoe/udev-install.sh
+++ b/Documentation/admin-guide/aoe/udev-install.sh
--- a/Documentation/admin-guide/aoe/udev.txt
+++ b/Documentation/admin-guide/aoe/udev.txt
@ -0,0 +1,26 @@
 # These rules tell udev what device nodes to create for aoe support.
 # They may be installed along the following lines.  Check the section
 # 8 udev manpage to see whether your udev supports SUBSYSTEM, and
 # whether it uses one or two equal signs for SUBSYSTEM and KERNEL.
 # 
 #   ecashin@makki ~$ su
 #   Password:
 #   bash# find /etc -type f -name udev.conf
 #   /etc/udev/udev.conf
 #   bash# grep udev_rules= /etc/udev/udev.conf
 #   udev_rules="/etc/udev/rules.d/"
 #   bash# ls /etc/udev/rules.d/
 #   10-wacom.rules  50-udev.rules
 #   bash# cp /path/to/linux/Documentation/admin-guide/aoe/udev.txt \
 #           /etc/udev/rules.d/60-aoe.rules
 #  
 # aoe char devices
 SUBSYSTEM=="aoe", KERNEL=="discover",	NAME="etherd/%k", GROUP="disk", MODE="0220"
 SUBSYSTEM=="aoe", KERNEL=="err",	NAME="etherd/%k", GROUP="disk", MODE="0440"
 SUBSYSTEM=="aoe", KERNEL=="interfaces",	NAME="etherd/%k", GROUP="disk", MODE="0220"
 SUBSYSTEM=="aoe", KERNEL=="revalidate",	NAME="etherd/%k", GROUP="disk", MODE="0220"
 SUBSYSTEM=="aoe", KERNEL=="flush",	NAME="etherd/%k", GROUP="disk", MODE="0220"
 # aoe block devices     
 KERNEL=="etherd*",       GROUP="disk"
--- a/Documentation/admin-guide/blockdev/drbd/DRBD-8.3-data-packets.svg
+++ b/Documentation/admin-guide/blockdev/drbd/DRBD-8.3-data-packets.svg
--- a/Documentation/admin-guide/blockdev/drbd/DRBD-data-packets.svg
+++ b/Documentation/admin-guide/blockdev/drbd/DRBD-data-packets.svg
--- a/Documentation/admin-guide/blockdev/drbd/conn-states-8.dot
+++ b/Documentation/admin-guide/blockdev/drbd/conn-states-8.dot
--- a/Documentation/admin-guide/blockdev/drbd/data-structure-v9.rst
+++ b/Documentation/admin-guide/blockdev/drbd/data-structure-v9.rst
@ -0,0 +1,42 @@
 ================================
 kernel data structure for DRBD-9
 ================================
 This describes the in kernel data structure for DRBD-9. Starting with
 Linux v3.14 we are reorganizing DRBD to use this data structure.
 Basic Data Structure
 ====================
 A node has a number of DRBD resources.  Each such resource has a number of
 devices (aka volumes) and connections to other nodes ("peer nodes"). Each DRBD
 device is represented by a block device locally.
 The DRBD objects are interconnected to form a matrix as depicted below; a
 drbd_peer_device object sits at each intersection between a drbd_device and a
 drbd_connection::
  /--------------+---------------+.....+---------------\
  |   resource   |    device     |     |    device     |
  +--------------+---------------+.....+---------------+
  |  connection  |  peer_device  |     |  peer_device  |
  +--------------+---------------+.....+---------------+
  :              :               :     :               :
  :              :               :     :               :
  +--------------+---------------+.....+---------------+
  |  connection  |  peer_device  |     |  peer_device  |
  \--------------+---------------+.....+---------------/
 In this table, horizontally, devices can be accessed from resources by their
 volume number.  Likewise, peer_devices can be accessed from connections by
 their volume number.  Objects in the vertical direction are connected by double
 linked lists.  There are back pointers from peer_devices to their connections a
 devices, and from connections and devices to their resource.
 All resources are in the drbd_resources double-linked list.  In addition, all
 devices can be accessed by their minor device number via the drbd_devices idr.
 The drbd_resource, drbd_connection, and drbd_device objects are reference
 counted.  The peer_device objects only serve to establish the links between
 devices and connections; their lifetime is determined by the lifetime of the
 device and connection which they reference.
--- a/Documentation/admin-guide/blockdev/drbd/disk-states-8.dot
+++ b/Documentation/admin-guide/blockdev/drbd/disk-states-8.dot
--- a/Documentation/admin-guide/blockdev/drbd/drbd-connection-state-overview.dot
+++ b/Documentation/admin-guide/blockdev/drbd/drbd-connection-state-overview.dot
--- a/Documentation/admin-guide/blockdev/drbd/figures.rst
+++ b/Documentation/admin-guide/blockdev/drbd/figures.rst
@ -0,0 +1,30 @@
 .. SPDX-License-Identifier: GPL-2.0
 .. The here included files are intended to help understand the implementation
 Data flows that Relate some functions, and write packets
 ========================================================
 .. kernel-figure:: DRBD-8.3-data-packets.svg
    :alt:   DRBD-8.3-data-packets.svg
    :align: center
 .. kernel-figure:: DRBD-data-packets.svg
    :alt:   DRBD-data-packets.svg
    :align: center
 Sub graphs of DRBD's state transitions
 ======================================
 .. kernel-figure:: conn-states-8.dot
    :alt:   conn-states-8.dot
    :align: center
 .. kernel-figure:: disk-states-8.dot
    :alt:   disk-states-8.dot
    :align: center
 .. kernel-figure:: node-states-8.dot
    :alt:   node-states-8.dot
    :align: center
--- a/Documentation/admin-guide/blockdev/drbd/index.rst
+++ b/Documentation/admin-guide/blockdev/drbd/index.rst
@ -0,0 +1,19 @@
 ==========================================
 Distributed Replicated Block Device - DRBD
 ==========================================
 Description
 ===========
  DRBD is a shared-nothing, synchronously replicated block device. It
  is designed to serve as a building block for high availability
  clusters and in this context, is a "drop-in" replacement for shared
  storage. Simplistically, you could see it as a network RAID 1.
  Please visit http://www.drbd.org to find out more.
 .. toctree::
   :maxdepth: 1
   data-structure-v9
   figures
--- a/Documentation/admin-guide/blockdev/drbd/node-states-8.dot
+++ b/Documentation/admin-guide/blockdev/drbd/node-states-8.dot
@ -0,0 +1,13 @@
 digraph node_states {
 	Secondary -> Primary           [ label = "ioctl_set_state()" ]
 	Primary   -> Secondary 	       [ label = "ioctl_set_state()" ]
 }
 digraph peer_states {
 	Secondary -> Primary           [ label = "recv state packet" ]
 	Primary   -> Secondary 	       [ label = "recv state packet" ]
 	Primary   -> Unknown 	       [ label = "connection lost" ]
 	Secondary  -> Unknown  	       [ label = "connection lost" ]
 	Unknown   -> Primary           [ label = "connected" ]
 	Unknown   -> Secondary         [ label = "connected" ]
 }
--- a/Documentation/admin-guide/blockdev/floppy.rst
+++ b/Documentation/admin-guide/blockdev/floppy.rst
@ -0,0 +1,255 @@
 =============
 Floppy Driver
 =============
 FAQ list:
 =========
 A FAQ list may be found in the fdutils package (see below), and also
 at <http://fdutils.linux.lu/faq.html>.
 LILO configuration options (Thinkpad users, read this)
 ======================================================
 The floppy driver is configured using the 'floppy=' option in
 lilo. This option can be typed at the boot prompt, or entered in the
 lilo configuration file.
 Example: If your kernel is called linux-2.6.9, type the following line
 at the lilo boot prompt (if you have a thinkpad)::
 linux-2.6.9 floppy=thinkpad
 You may also enter the following line in /etc/lilo.conf, in the description
 of linux-2.6.9::
 append = "floppy=thinkpad"
 Several floppy related options may be given, example::
 linux-2.6.9 floppy=daring floppy=two_fdc
 append = "floppy=daring floppy=two_fdc"
 If you give options both in the lilo config file and on the boot
 prompt, the option strings of both places are concatenated, the boot
 prompt options coming last. That's why there are also options to
 restore the default behavior.
 Module configuration options
 ============================
 If you use the floppy driver as a module, use the following syntax::
 	modprobe floppy floppy="<options>"
 Example::
 	modprobe floppy floppy="omnibook messages"
 If you need certain options enabled every time you load the floppy driver,
 you can put::
 	options floppy floppy="omnibook messages"
 in a configuration file in /etc/modprobe.d/.
 The floppy driver related options are:
 floppy=asus_pci
 	Sets the bit mask to allow only units 0 and 1. (default)
 floppy=daring
 	Tells the floppy driver that you have a well behaved floppy controller.
 	This allows more efficient and smoother operation, but may fail on
 	certain controllers. This may speed up certain operations.
 floppy=0,daring
 	Tells the floppy driver that your floppy controller should be used
 	with caution.
 floppy=one_fdc
 	Tells the floppy driver that you have only one floppy controller.
 	(default)
 floppy=two_fdc / floppy=<address>,two_fdc
 	Tells the floppy driver that you have two floppy controllers.
 	The second floppy controller is assumed to be at <address>.
 	This option is not needed if the second controller is at address
 	0x370, and if you use the 'cmos' option.
 floppy=thinkpad
 	Tells the floppy driver that you have a Thinkpad. Thinkpads use an
 	inverted convention for the disk change line.
 floppy=0,thinkpad
 	Tells the floppy driver that you don't have a Thinkpad.
 floppy=omnibook / floppy=nodma
 	Tells the floppy driver not to use Dma for data transfers.
 	This is needed on HP Omnibooks, which don't have a workable
 	DMA channel for the floppy driver. This option is also useful
 	if you frequently get "Unable to allocate DMA memory" messages.
 	Indeed, dma memory needs to be continuous in physical memory,
 	and is thus harder to find, whereas non-dma buffers may be
 	allocated in virtual memory. However, I advise against this if
 	you have an FDC without a FIFO (8272A or 82072). 82072A and
 	later are OK. You also need at least a 486 to use nodma.
 	If you use nodma mode, I suggest you also set the FIFO
 	threshold to 10 or lower, in order to limit the number of data
 	transfer interrupts.
 	If you have a FIFO-able FDC, the floppy driver automatically
 	falls back on non DMA mode if no DMA-able memory can be found.
 	If you want to avoid this, explicitly ask for 'yesdma'.
 floppy=yesdma
 	Tells the floppy driver that a workable DMA channel is available.
 	(default)
 floppy=nofifo
 	Disables the FIFO entirely. This is needed if you get "Bus
 	master arbitration error" messages from your Ethernet card (or
 	from other devices) while accessing the floppy.
 floppy=usefifo
 	Enables the FIFO. (default)
 floppy=<threshold>,fifo_depth
 	Sets the FIFO threshold. This is mostly relevant in DMA
 	mode. If this is higher, the floppy driver tolerates more
 	interrupt latency, but it triggers more interrupts (i.e. it
 	imposes more load on the rest of the system). If this is
 	lower, the interrupt latency should be lower too (faster
 	processor). The benefit of a lower threshold is less
 	interrupts.
 	To tune the fifo threshold, switch on over/underrun messages
 	using 'floppycontrol --messages'. Then access a floppy
 	disk. If you get a huge amount of "Over/Underrun - retrying"
 	messages, then the fifo threshold is too low. Try with a
 	higher value, until you only get an occasional Over/Underrun.
 	It is a good idea to compile the floppy driver as a module
 	when doing this tuning. Indeed, it allows to try different
 	fifo values without rebooting the machine for each test. Note
 	that you need to do 'floppycontrol --messages' every time you
 	re-insert the module.
 	Usually, tuning the fifo threshold should not be needed, as
 	the default (0xa) is reasonable.
 floppy=<drive>,<type>,cmos
 	Sets the CMOS type of <drive> to <type>. This is mandatory if
 	you have more than two floppy drives (only two can be
 	described in the physical CMOS), or if your BIOS uses
 	non-standard CMOS types. The CMOS types are:
 	       ==  ==================================
 		0  Use the value of the physical CMOS
 		1  5 1/4 DD
 		2  5 1/4 HD
 		3  3 1/2 DD
 		4  3 1/2 HD
 		5  3 1/2 ED
 		6  3 1/2 ED
 	       16  unknown or not installed
 	       ==  ==================================
 	(Note: there are two valid types for ED drives. This is because 5 was
 	initially chosen to represent floppy *tapes*, and 6 for ED drives.
 	AMI ignored this, and used 5 for ED drives. That's why the floppy
 	driver handles both.)
 floppy=unexpected_interrupts
 	Print a warning message when an unexpected interrupt is received.
 	(default)
 floppy=no_unexpected_interrupts / floppy=L40SX
 	Don't print a message when an unexpected interrupt is received. This
 	is needed on IBM L40SX laptops in certain video modes. (There seems
 	to be an interaction between video and floppy. The unexpected
 	interrupts affect only performance, and can be safely ignored.)
 floppy=broken_dcl
 	Don't use the disk change line, but assume that the disk was
 	changed whenever the device node is reopened. Needed on some
 	boxes where the disk change line is broken or unsupported.
 	This should be regarded as a stopgap measure, indeed it makes
 	floppy operation less efficient due to unneeded cache
 	flushings, and slightly more unreliable. Please verify your
 	cable, connection and jumper settings if you have any DCL
 	problems. However, some older drives, and also some laptops
 	are known not to have a DCL.
 floppy=debug
 	Print debugging messages.
 floppy=messages
 	Print informational messages for some operations (disk change
 	notifications, warnings about over and underruns, and about
 	autodetection).
 floppy=silent_dcl_clear
 	Uses a less noisy way to clear the disk change line (which
 	doesn't involve seeks). Implied by 'daring' option.
 floppy=<nr>,irq
 	Sets the floppy IRQ to <nr> instead of 6.
 floppy=<nr>,dma
 	Sets the floppy DMA channel to <nr> instead of 2.
 floppy=slow
 	Use PS/2 stepping rate::
 	   PS/2 floppies have much slower step rates than regular floppies.
 	   It's been recommended that take about 1/4 of the default speed
 	   in some more extreme cases.
 Supporting utilities and additional documentation:
 ==================================================
 Additional parameters of the floppy driver can be configured at
 runtime. Utilities which do this can be found in the fdutils package.
 This package also contains a new version of mtools which allows to
 access high capacity disks (up to 1992K on a high density 3 1/2 disk!).
 It also contains additional documentation about the floppy driver.
 The latest version can be found at fdutils homepage:
 http://fdutils.linux.lu
 The fdutils releases can be found at:
 http://fdutils.linux.lu/download.html
 http://www.tux.org/pub/knaff/fdutils/
 ftp://metalab.unc.edu/pub/Linux/utils/disk-management/
 Reporting problems about the floppy driver
 ==========================================
 If you have a question or a bug report about the floppy driver, mail
 me at Alain.Knaff@poboxes.com . If you post to Usenet, preferably use
 comp.os.linux.hardware. As the volume in these groups is rather high,
 be sure to include the word "floppy" (or "FLOPPY") in the subject
 line.  If the reported problem happens when mounting floppy disks, be
 sure to mention also the type of the filesystem in the subject line.
 Be sure to read the FAQ before mailing/posting any bug reports!
 Alain
 Changelog
 =========
 10-30-2004 :
 		Cleanup, updating, add reference to module configuration.
 		James Nelson <james4765@gmail.com>
 6-3-2000 :
 		Original Document
--- a/Documentation/admin-guide/blockdev/index.rst
+++ b/Documentation/admin-guide/blockdev/index.rst
@ -0,0 +1,16 @@
 .. SPDX-License-Identifier: GPL-2.0
 ===========================
 The Linux RapidIO Subsystem
 ===========================
 .. toctree::
   :maxdepth: 1
   floppy
   nbd
   paride
   ramdisk
   zram
   drbd/index
--- a/Documentation/admin-guide/blockdev/nbd.rst
+++ b/Documentation/admin-guide/blockdev/nbd.rst
@ -0,0 +1,31 @@
 ==================================
 Network Block Device (TCP version)
 ==================================
 1) Overview
 -----------
 What is it: With this compiled in the kernel (or as a module), Linux
 can use a remote server as one of its block devices. So every time
 the client computer wants to read, e.g., /dev/nb0, it sends a
 request over TCP to the server, which will reply with the data read.
 This can be used for stations with low disk space (or even diskless)
 to borrow disk space from another computer.
 Unlike NFS, it is possible to put any filesystem on it, etc.
 For more information, or to download the nbd-client and nbd-server
 tools, go to http://nbd.sf.net/.
 The nbd kernel module need only be installed on the client
 system, as the nbd-server is completely in userspace. In fact,
 the nbd-server has been successfully ported to other operating
 systems, including Windows.
 A) NBD parameters
 -----------------
 max_part
 	Number of partitions per device (default: 0).
 nbds_max
 	Number of block devices that should be initialized (default: 16).
--- a/Documentation/admin-guide/blockdev/paride.rst
+++ b/Documentation/admin-guide/blockdev/paride.rst
@ -0,0 +1,439 @@
 ===================================
 Linux and parallel port IDE devices
 ===================================
 PARIDE v1.03   (c) 1997-8  Grant Guenther <grant@torque.net>
 1. Introduction
 ===============
 Owing to the simplicity and near universality of the parallel port interface
 to personal computers, many external devices such as portable hard-disk,
 CD-ROM, LS-120 and tape drives use the parallel port to connect to their
 host computer.  While some devices (notably scanners) use ad-hoc methods
 to pass commands and data through the parallel port interface, most
 external devices are actually identical to an internal model, but with
 a parallel-port adapter chip added in.  Some of the original parallel port
 adapters were little more than mechanisms for multiplexing a SCSI bus.
 (The Iomega PPA-3 adapter used in the ZIP drives is an example of this
 approach).  Most current designs, however, take a different approach.
 The adapter chip reproduces a small ISA or IDE bus in the external device
 and the communication protocol provides operations for reading and writing
 device registers, as well as data block transfer functions.  Sometimes,
 the device being addressed via the parallel cable is a standard SCSI
 controller like an NCR 5380.  The "ditto" family of external tape
 drives use the ISA replicator to interface a floppy disk controller,
 which is then connected to a floppy-tape mechanism.  The vast majority
 of external parallel port devices, however, are now based on standard
 IDE type devices, which require no intermediate controller.  If one
 were to open up a parallel port CD-ROM drive, for instance, one would
 find a standard ATAPI CD-ROM drive, a power supply, and a single adapter
 that interconnected a standard PC parallel port cable and a standard
 IDE cable.  It is usually possible to exchange the CD-ROM device with
 any other device using the IDE interface.
 The document describes the support in Linux for parallel port IDE
 devices.  It does not cover parallel port SCSI devices, "ditto" tape
 drives or scanners.  Many different devices are supported by the
 parallel port IDE subsystem, including:
 	- MicroSolutions backpack CD-ROM
 	- MicroSolutions backpack PD/CD
 	- MicroSolutions backpack hard-drives
 	- MicroSolutions backpack 8000t tape drive
 	- SyQuest EZ-135, EZ-230 & SparQ drives
 	- Avatar Shark
 	- Imation Superdisk LS-120
 	- Maxell Superdisk LS-120
 	- FreeCom Power CD
 	- Hewlett-Packard 5GB and 8GB tape drives
 	- Hewlett-Packard 7100 and 7200 CD-RW drives
 as well as most of the clone and no-name products on the market.
 To support such a wide range of devices, PARIDE, the parallel port IDE
 subsystem, is actually structured in three parts.   There is a base
 paride module which provides a registry and some common methods for
 accessing the parallel ports.  The second component is a set of
 high-level drivers for each of the different types of supported devices:
 	===	=============
 	pd	IDE disk
 	pcd	ATAPI CD-ROM
 	pf	ATAPI disk
 	pt	ATAPI tape
 	pg	ATAPI generic
 	===	=============
 (Currently, the pg driver is only used with CD-R drives).
 The high-level drivers function according to the relevant standards.
 The third component of PARIDE is a set of low-level protocol drivers
 for each of the parallel port IDE adapter chips.  Thanks to the interest
 and encouragement of Linux users from many parts of the world,
 support is available for almost all known adapter protocols:
 	====    ====================================== ====
        aten    ATEN EH-100                            (HK)
        bpck    Microsolutions backpack                (US)
        comm    DataStor (old-type) "commuter" adapter (TW)
        dstr    DataStor EP-2000                       (TW)
        epat    Shuttle EPAT                           (UK)
        epia    Shuttle EPIA                           (UK)
 	fit2    FIT TD-2000			       (US)
 	fit3    FIT TD-3000			       (US)
 	friq    Freecom IQ cable                       (DE)
        frpw    Freecom Power                          (DE)
        kbic    KingByte KBIC-951A and KBIC-971A       (TW)
 	ktti    KT Technology PHd adapter              (SG)
        on20    OnSpec 90c20                           (US)
        on26    OnSpec 90c26                           (US)
 	====    ====================================== ====
 2. Using the PARIDE subsystem
 =============================
 While configuring the Linux kernel, you may choose either to build
 the PARIDE drivers into your kernel, or to build them as modules.
 In either case, you will need to select "Parallel port IDE device support"
 as well as at least one of the high-level drivers and at least one
 of the parallel port communication protocols.  If you do not know
 what kind of parallel port adapter is used in your drive, you could
 begin by checking the file names and any text files on your DOS
 installation floppy.  Alternatively, you can look at the markings on
 the adapter chip itself.  That's usually sufficient to identify the
 correct device.
 You can actually select all the protocol modules, and allow the PARIDE
 subsystem to try them all for you.
 For the "brand-name" products listed above, here are the protocol
 and high-level drivers that you would use:
 	================	============	======	========
 	Manufacturer		Model		Driver	Protocol
 	================	============	======	========
 	MicroSolutions		CD-ROM		pcd	bpck
 	MicroSolutions		PD drive	pf	bpck
 	MicroSolutions		hard-drive	pd	bpck
 	MicroSolutions          8000t tape      pt      bpck
 	SyQuest			EZ, SparQ	pd	epat
 	Imation			Superdisk	pf	epat
 	Maxell                  Superdisk       pf      friq
 	Avatar			Shark		pd	epat
 	FreeCom			CD-ROM		pcd	frpw
 	Hewlett-Packard		5GB Tape	pt	epat
 	Hewlett-Packard		7200e (CD)	pcd	epat
 	Hewlett-Packard		7200e (CD-R)	pg	epat
 	================	============	======	========
 2.1  Configuring built-in drivers
 ---------------------------------
 We recommend that you get to know how the drivers work and how to
 configure them as loadable modules, before attempting to compile a
 kernel with the drivers built-in.
 If you built all of your PARIDE support directly into your kernel,
 and you have just a single parallel port IDE device, your kernel should
 locate it automatically for you.  If you have more than one device,
 you may need to give some command line options to your bootloader
 (eg: LILO), how to do that is beyond the scope of this document.
 The high-level drivers accept a number of command line parameters, all
 of which are documented in the source files in linux/drivers/block/paride.
 By default, each driver will automatically try all parallel ports it
 can find, and all protocol types that have been installed, until it finds
 a parallel port IDE adapter.  Once it finds one, the probe stops.  So,
 if you have more than one device, you will need to tell the drivers
 how to identify them.  This requires specifying the port address, the
 protocol identification number and, for some devices, the drive's
 chain ID.  While your system is booting, a number of messages are
 displayed on the console.  Like all such messages, they can be
 reviewed with the 'dmesg' command.  Among those messages will be
 some lines like::
 	paride: bpck registered as protocol 0
 	paride: epat registered as protocol 1
 The numbers will always be the same until you build a new kernel with
 different protocol selections.  You should note these numbers as you
 will need them to identify the devices.
 If you happen to be using a MicroSolutions backpack device, you will
 also need to know the unit ID number for each drive.  This is usually
 the last two digits of the drive's serial number (but read MicroSolutions'
 documentation about this).
 As an example, let's assume that you have a MicroSolutions PD/CD drive
 with unit ID number 36 connected to the parallel port at 0x378, a SyQuest
 EZ-135 connected to the chained port on the PD/CD drive and also an
 Imation Superdisk connected to port 0x278.  You could give the following
 options on your boot command::
 	pd.drive0=0x378,1 pf.drive0=0x278,1 pf.drive1=0x378,0,36
 In the last option, pf.drive1 configures device /dev/pf1, the 0x378
 is the parallel port base address, the 0 is the protocol registration
 number and 36 is the chain ID.
 Please note:  while PARIDE will work both with and without the
 PARPORT parallel port sharing system that is included by the
 "Parallel port support" option, PARPORT must be included and enabled
 if you want to use chains of devices on the same parallel port.
 2.2  Loading and configuring PARIDE as modules
 ----------------------------------------------
 It is much faster and simpler to get to understand the PARIDE drivers
 if you use them as loadable kernel modules.
 Note 1:
 	using these drivers with the "kerneld" automatic module loading
 	system is not recommended for beginners, and is not documented here.
 Note 2:
 	if you build PARPORT support as a loadable module, PARIDE must
 	also be built as loadable modules, and PARPORT must be loaded before
 	the PARIDE modules.
 To use PARIDE, you must begin by::
 	insmod paride
 this loads a base module which provides a registry for the protocols,
 among other tasks.
 Then, load as many of the protocol modules as you think you might need.
 As you load each module, it will register the protocols that it supports,
 and print a log message to your kernel log file and your console. For
 example::
 	# insmod epat
 	paride: epat registered as protocol 0
 	# insmod kbic
 	paride: k951 registered as protocol 1
        paride: k971 registered as protocol 2
 Finally, you can load high-level drivers for each kind of device that
 you have connected.  By default, each driver will autoprobe for a single
 device, but you can support up to four similar devices by giving their
 individual co-ordinates when you load the driver.
 For example, if you had two no-name CD-ROM drives both using the
 KingByte KBIC-951A adapter, one on port 0x378 and the other on 0x3bc
 you could give the following command::
 	# insmod pcd drive0=0x378,1 drive1=0x3bc,1
 For most adapters, giving a port address and protocol number is sufficient,
 but check the source files in linux/drivers/block/paride for more
 information.  (Hopefully someone will write some man pages one day !).
 As another example, here's what happens when PARPORT is installed, and
 a SyQuest EZ-135 is attached to port 0x378::
 	# insmod paride
 	paride: version 1.0 installed
 	# insmod epat
 	paride: epat registered as protocol 0
 	# insmod pd
 	pd: pd version 1.0, major 45, cluster 64, nice 0
 	pda: Sharing parport1 at 0x378
 	pda: epat 1.0, Shuttle EPAT chip c3 at 0x378, mode 5 (EPP-32), delay 1
 	pda: SyQuest EZ135A, 262144 blocks [128M], (512/16/32), removable media
 	 pda: pda1
 Note that the last line is the output from the generic partition table
 scanner - in this case it reports that it has found a disk with one partition.
 2.3  Using a PARIDE device
 --------------------------
 Once the drivers have been loaded, you can access PARIDE devices in the
 same way as their traditional counterparts.  You will probably need to
 create the device "special files".  Here is a simple script that you can
 cut to a file and execute::
  #!/bin/bash
  #
  # mkd -- a script to create the device special files for the PARIDE subsystem
  #
  function mkdev {
    mknod $1 $2 $3 $4 ; chmod 0660 $1 ; chown root:disk $1
  }
  #
  function pd {
    D=$( printf \\$( printf "x%03x" $[ $1 + 97 ] ) )
    mkdev pd$D b 45 $[ $1 * 16 ]
    for P in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    do mkdev pd$D$P b 45 $[ $1 * 16 + $P ]
    done
  }
  #
  cd /dev
  #
  for u in 0 1 2 3 ; do pd $u ; done
  for u in 0 1 2 3 ; do mkdev pcd$u b 46 $u ; done
  for u in 0 1 2 3 ; do mkdev pf$u  b 47 $u ; done
  for u in 0 1 2 3 ; do mkdev pt$u  c 96 $u ; done
  for u in 0 1 2 3 ; do mkdev npt$u c 96 $[ $u + 128 ] ; done
  for u in 0 1 2 3 ; do mkdev pg$u  c 97 $u ; done
  #
  # end of mkd
 With the device files and drivers in place, you can access PARIDE devices
 like any other Linux device.   For example, to mount a CD-ROM in pcd0, use::
 	mount /dev/pcd0 /cdrom
 If you have a fresh Avatar Shark cartridge, and the drive is pda, you
 might do something like::
 	fdisk /dev/pda		-- make a new partition table with
 				   partition 1 of type 83
 	mke2fs /dev/pda1	-- to build the file system
 	mkdir /shark		-- make a place to mount the disk
 	mount /dev/pda1 /shark
 Devices like the Imation superdisk work in the same way, except that
 they do not have a partition table.  For example to make a 120MB
 floppy that you could share with a DOS system::
 	mkdosfs /dev/pf0
 	mount /dev/pf0 /mnt
 2.4  The pf driver
 ------------------
 The pf driver is intended for use with parallel port ATAPI disk
 devices.  The most common devices in this category are PD drives
 and LS-120 drives.  Traditionally, media for these devices are not
 partitioned.  Consequently, the pf driver does not support partitioned
 media.  This may be changed in a future version of the driver.
 2.5  Using the pt driver
 ------------------------
 The pt driver for parallel port ATAPI tape drives is a minimal driver.
 It does not yet support many of the standard tape ioctl operations.
 For best performance, a block size of 32KB should be used.  You will
 probably want to set the parallel port delay to 0, if you can.
 2.6  Using the pg driver
 ------------------------
 The pg driver can be used in conjunction with the cdrecord program
 to create CD-ROMs.  Please get cdrecord version 1.6.1 or later
 from ftp://ftp.fokus.gmd.de/pub/unix/cdrecord/ .  To record CD-R media
 your parallel port should ideally be set to EPP mode, and the "port delay"
 should be set to 0.  With those settings it is possible to record at 2x
 speed without any buffer underruns.  If you cannot get the driver to work
 in EPP mode, try to use "bidirectional" or "PS/2" mode and 1x speeds only.
 3. Troubleshooting
 ==================
 3.1  Use EPP mode if you can
 ----------------------------
 The most common problems that people report with the PARIDE drivers
 concern the parallel port CMOS settings.  At this time, none of the
 PARIDE protocol modules support ECP mode, or any ECP combination modes.
 If you are able to do so, please set your parallel port into EPP mode
 using your CMOS setup procedure.
 3.2  Check the port delay
 -------------------------
 Some parallel ports cannot reliably transfer data at full speed.  To
 offset the errors, the PARIDE protocol modules introduce a "port
 delay" between each access to the i/o ports.  Each protocol sets
 a default value for this delay.  In most cases, the user can override
 the default and set it to 0 - resulting in somewhat higher transfer
 rates.  In some rare cases (especially with older 486 systems) the
 default delays are not long enough.  if you experience corrupt data
 transfers, or unexpected failures, you may wish to increase the
 port delay.   The delay can be programmed using the "driveN" parameters
 to each of the high-level drivers.  Please see the notes above, or
 read the comments at the beginning of the driver source files in
 linux/drivers/block/paride.
 3.3  Some drives need a printer reset
 -------------------------------------
 There appear to be a number of "noname" external drives on the market
 that do not always power up correctly.  We have noticed this with some
 drives based on OnSpec and older Freecom adapters.  In these rare cases,
 the adapter can often be reinitialised by issuing a "printer reset" on
 the parallel port.  As the reset operation is potentially disruptive in
 multiple device environments, the PARIDE drivers will not do it
 automatically.  You can however, force a printer reset by doing::
 	insmod lp reset=1
 	rmmod lp
 If you have one of these marginal cases, you should probably build
 your paride drivers as modules, and arrange to do the printer reset
 before loading the PARIDE drivers.
 3.4  Use the verbose option and dmesg if you need help
 ------------------------------------------------------
 While a lot of testing has gone into these drivers to make them work
 as smoothly as possible, problems will arise.  If you do have problems,
 please check all the obvious things first:  does the drive work in
 DOS with the manufacturer's drivers ?  If that doesn't yield any useful
 clues, then please make sure that only one drive is hooked to your system,
 and that either (a) PARPORT is enabled or (b) no other device driver
 is using your parallel port (check in /proc/ioports).  Then, load the
 appropriate drivers (you can load several protocol modules if you want)
 as in::
 	# insmod paride
 	# insmod epat
 	# insmod bpck
 	# insmod kbic
 	...
 	# insmod pd verbose=1
 (using the correct driver for the type of device you have, of course).
 The verbose=1 parameter will cause the drivers to log a trace of their
 activity as they attempt to locate your drive.
 Use 'dmesg' to capture a log of all the PARIDE messages (any messages
 beginning with paride:, a protocol module's name or a driver's name) and
 include that with your bug report.  You can submit a bug report in one
 of two ways.  Either send it directly to the author of the PARIDE suite,
 by e-mail to grant@torque.net, or join the linux-parport mailing list
 and post your report there.
 3.5  For more information or help
 ---------------------------------
 You can join the linux-parport mailing list by sending a mail message
 to:
 		linux-parport-request@torque.net
 with the single word::
 		subscribe
 in the body of the mail message (not in the subject line).   Please be
 sure that your mail program is correctly set up when you do this,  as
 the list manager is a robot that will subscribe you using the reply
 address in your mail headers.  REMOVE any anti-spam gimmicks you may
 have in your mail headers, when sending mail to the list server.
 You might also find some useful information on the linux-parport
 web pages (although they are not always up to date) at
 	http://web.archive.org/web/%2E/http://www.torque.net/parport/
--- a/Documentation/admin-guide/blockdev/ramdisk.rst
+++ b/Documentation/admin-guide/blockdev/ramdisk.rst
@ -0,0 +1,177 @@
 ==========================================
 Using the RAM disk block device with Linux
 ==========================================
 .. Contents:
 	1) Overview
 	2) Kernel Command Line Parameters
 	3) Using "rdev -r"
 	4) An Example of Creating a Compressed RAM Disk
 1) Overview
 -----------
 The RAM disk driver is a way to use main system memory as a block device.  It
 is required for initrd, an initial filesystem used if you need to load modules
 in order to access the root filesystem (see Documentation/admin-guide/initrd.rst).  It can
 also be used for a temporary filesystem for crypto work, since the contents
 are erased on reboot.
 The RAM disk dynamically grows as more space is required. It does this by using
 RAM from the buffer cache. The driver marks the buffers it is using as dirty
 so that the VM subsystem does not try to reclaim them later.
 The RAM disk supports up to 16 RAM disks by default, and can be reconfigured
 to support an unlimited number of RAM disks (at your own risk).  Just change
 the configuration symbol BLK_DEV_RAM_COUNT in the Block drivers config menu
 and (re)build the kernel.
 To use RAM disk support with your system, run './MAKEDEV ram' from the /dev
 directory.  RAM disks are all major number 1, and start with minor number 0
 for /dev/ram0, etc.  If used, modern kernels use /dev/ram0 for an initrd.
 The new RAM disk also has the ability to load compressed RAM disk images,
 allowing one to squeeze more programs onto an average installation or
 rescue floppy disk.
 2) Parameters
 ---------------------------------
 2a) Kernel Command Line Parameters
 	ramdisk_size=N
 		Size of the ramdisk.
 This parameter tells the RAM disk driver to set up RAM disks of N k size.  The
 default is 4096 (4 MB).
 2b) Module parameters
 	rd_nr
 		/dev/ramX devices created.
 	max_part
 		Maximum partition number.
 	rd_size
 		See ramdisk_size.
 3) Using "rdev -r"
 ------------------
 The usage of the word (two bytes) that "rdev -r" sets in the kernel image is
 as follows. The low 11 bits (0 -> 10) specify an offset (in 1 k blocks) of up
 to 2 MB (2^11) of where to find the RAM disk (this used to be the size). Bit
 14 indicates that a RAM disk is to be loaded, and bit 15 indicates whether a
 prompt/wait sequence is to be given before trying to read the RAM disk. Since
 the RAM disk dynamically grows as data is being written into it, a size field
 is not required. Bits 11 to 13 are not currently used and may as well be zero.
 These numbers are no magical secrets, as seen below::
  ./arch/x86/kernel/setup.c:#define RAMDISK_IMAGE_START_MASK     0x07FF
  ./arch/x86/kernel/setup.c:#define RAMDISK_PROMPT_FLAG          0x8000
  ./arch/x86/kernel/setup.c:#define RAMDISK_LOAD_FLAG            0x4000
 Consider a typical two floppy disk setup, where you will have the
 kernel on disk one, and have already put a RAM disk image onto disk #2.
 Hence you want to set bits 0 to 13 as 0, meaning that your RAM disk
 starts at an offset of 0 kB from the beginning of the floppy.
 The command line equivalent is: "ramdisk_start=0"
 You want bit 14 as one, indicating that a RAM disk is to be loaded.
 The command line equivalent is: "load_ramdisk=1"
 You want bit 15 as one, indicating that you want a prompt/keypress
 sequence so that you have a chance to switch floppy disks.
 The command line equivalent is: "prompt_ramdisk=1"
 Putting that together gives 2^15 + 2^14 + 0 = 49152 for an rdev word.
 So to create disk one of the set, you would do::
 	/usr/src/linux# cat arch/x86/boot/zImage > /dev/fd0
 	/usr/src/linux# rdev /dev/fd0 /dev/fd0
 	/usr/src/linux# rdev -r /dev/fd0 49152
 If you make a boot disk that has LILO, then for the above, you would use::
 	append = "ramdisk_start=0 load_ramdisk=1 prompt_ramdisk=1"
 Since the default start = 0 and the default prompt = 1, you could use::
 	append = "load_ramdisk=1"
 4) An Example of Creating a Compressed RAM Disk
 -----------------------------------------------
 To create a RAM disk image, you will need a spare block device to
 construct it on. This can be the RAM disk device itself, or an
 unused disk partition (such as an unmounted swap partition). For this
 example, we will use the RAM disk device, "/dev/ram0".
 Note: This technique should not be done on a machine with less than 8 MB
 of RAM. If using a spare disk partition instead of /dev/ram0, then this
 restriction does not apply.
 a) Decide on the RAM disk size that you want. Say 2 MB for this example.
   Create it by writing to the RAM disk device. (This step is not currently
   required, but may be in the future.) It is wise to zero out the
   area (esp. for disks) so that maximal compression is achieved for
   the unused blocks of the image that you are about to create::
 	dd if=/dev/zero of=/dev/ram0 bs=1k count=2048
 b) Make a filesystem on it. Say ext2fs for this example::
 	mke2fs -vm0 /dev/ram0 2048
 c) Mount it, copy the files you want to it (eg: /etc/* /dev/* ...)
   and unmount it again.
 d) Compress the contents of the RAM disk. The level of compression
   will be approximately 50% of the space used by the files. Unused
   space on the RAM disk will compress to almost nothing::
 	dd if=/dev/ram0 bs=1k count=2048 | gzip -v9 > /tmp/ram_image.gz
 e) Put the kernel onto the floppy::
 	dd if=zImage of=/dev/fd0 bs=1k
 f) Put the RAM disk image onto the floppy, after the kernel. Use an offset
   that is slightly larger than the kernel, so that you can put another
   (possibly larger) kernel onto the same floppy later without overlapping
   the RAM disk image. An offset of 400 kB for kernels about 350 kB in
   size would be reasonable. Make sure offset+size of ram_image.gz is
   not larger than the total space on your floppy (usually 1440 kB)::
 	dd if=/tmp/ram_image.gz of=/dev/fd0 bs=1k seek=400
 g) Use "rdev" to set the boot device, RAM disk offset, prompt flag, etc.
   For prompt_ramdisk=1, load_ramdisk=1, ramdisk_start=400, one would
   have 2^15 + 2^14 + 400 = 49552::
 	rdev /dev/fd0 /dev/fd0
 	rdev -r /dev/fd0 49552
 That is it. You now have your boot/root compressed RAM disk floppy. Some
 users may wish to combine steps (d) and (f) by using a pipe.
 						Paul Gortmaker 12/95
 Changelog:
 ----------
 10-22-04 :
 		Updated to reflect changes in command line options, remove
 		obsolete references, general cleanup.
 		James Nelson (james4765@gmail.com)
 12-95 :
 		Original Document
--- a/Documentation/admin-guide/blockdev/zram.rst
+++ b/Documentation/admin-guide/blockdev/zram.rst
@ -0,0 +1,422 @@
 ========================================
 zram: Compressed RAM based block devices
 ========================================
 Introduction
 ============
 The zram module creates RAM based block devices named /dev/zram<id>
 (<id> = 0, 1, ...). Pages written to these disks are compressed and stored
 in memory itself. These disks allow very fast I/O and compression provides
 good amounts of memory savings. Some of the usecases include /tmp storage,
 use as swap disks, various caches under /var and maybe many more :)
 Statistics for individual zram devices are exported through sysfs nodes at
 /sys/block/zram<id>/
 Usage
 =====
 There are several ways to configure and manage zram device(-s):
 a) using zram and zram_control sysfs attributes
 b) using zramctl utility, provided by util-linux (util-linux@vger.kernel.org).
 In this document we will describe only 'manual' zram configuration steps,
 IOW, zram and zram_control sysfs attributes.
 In order to get a better idea about zramctl please consult util-linux
 documentation, zramctl man-page or `zramctl --help`. Please be informed
 that zram maintainers do not develop/maintain util-linux or zramctl, should
 you have any questions please contact util-linux@vger.kernel.org
 Following shows a typical sequence of steps for using zram.
 WARNING
 =======
 For the sake of simplicity we skip error checking parts in most of the
 examples below. However, it is your sole responsibility to handle errors.
 zram sysfs attributes always return negative values in case of errors.
 The list of possible return codes:
 ========  =============================================================
 -EBUSY	  an attempt to modify an attribute that cannot be changed once
 	  the device has been initialised. Please reset device first;
 -ENOMEM	  zram was not able to allocate enough memory to fulfil your
 	  needs;
 -EINVAL	  invalid input has been provided.
 ========  =============================================================
 If you use 'echo', the returned value that is changed by 'echo' utility,
 and, in general case, something like::
 	echo 3 > /sys/block/zram0/max_comp_streams
 	if [ $? -ne 0 ];
 		handle_error
 	fi
 should suffice.
 1) Load Module
 ==============
 ::
 	modprobe zram num_devices=4
 	This creates 4 devices: /dev/zram{0,1,2,3}
 num_devices parameter is optional and tells zram how many devices should be
 pre-created. Default: 1.
 2) Set max number of compression streams
 ========================================
 Regardless the value passed to this attribute, ZRAM will always
 allocate multiple compression streams - one per online CPUs - thus
 allowing several concurrent compression operations. The number of
 allocated compression streams goes down when some of the CPUs
 become offline. There is no single-compression-stream mode anymore,
 unless you are running a UP system or has only 1 CPU online.
 To find out how many streams are currently available::
 	cat /sys/block/zram0/max_comp_streams
 3) Select compression algorithm
 ===============================
 Using comp_algorithm device attribute one can see available and
 currently selected (shown in square brackets) compression algorithms,
 change selected compression algorithm (once the device is initialised
 there is no way to change compression algorithm).
 Examples::
 	#show supported compression algorithms
 	cat /sys/block/zram0/comp_algorithm
 	lzo [lz4]
 	#select lzo compression algorithm
 	echo lzo > /sys/block/zram0/comp_algorithm
 For the time being, the `comp_algorithm` content does not necessarily
 show every compression algorithm supported by the kernel. We keep this
 list primarily to simplify device configuration and one can configure
 a new device with a compression algorithm that is not listed in
 `comp_algorithm`. The thing is that, internally, ZRAM uses Crypto API
 and, if some of the algorithms were built as modules, it's impossible
 to list all of them using, for instance, /proc/crypto or any other
 method. This, however, has an advantage of permitting the usage of
 custom crypto compression modules (implementing S/W or H/W compression).
 4) Set Disksize
 ===============
 Set disk size by writing the value to sysfs node 'disksize'.
 The value can be either in bytes or you can use mem suffixes.
 Examples::
 	# Initialize /dev/zram0 with 50MB disksize
 	echo $((50*1024*1024)) > /sys/block/zram0/disksize
 	# Using mem suffixes
 	echo 256K > /sys/block/zram0/disksize
 	echo 512M > /sys/block/zram0/disksize
 	echo 1G > /sys/block/zram0/disksize
 Note:
 There is little point creating a zram of greater than twice the size of memory
 since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
 size of the disk when not in use so a huge zram is wasteful.
 5) Set memory limit: Optional
 =============================
 Set memory limit by writing the value to sysfs node 'mem_limit'.
 The value can be either in bytes or you can use mem suffixes.
 In addition, you could change the value in runtime.
 Examples::
 	# limit /dev/zram0 with 50MB memory
 	echo $((50*1024*1024)) > /sys/block/zram0/mem_limit
 	# Using mem suffixes
 	echo 256K > /sys/block/zram0/mem_limit
 	echo 512M > /sys/block/zram0/mem_limit
 	echo 1G > /sys/block/zram0/mem_limit
 	# To disable memory limit
 	echo 0 > /sys/block/zram0/mem_limit
 6) Activate
 ===========
 ::
 	mkswap /dev/zram0
 	swapon /dev/zram0
 	mkfs.ext4 /dev/zram1
 	mount /dev/zram1 /tmp
 7) Add/remove zram devices
 ==========================
 zram provides a control interface, which enables dynamic (on-demand) device
 addition and removal.
 In order to add a new /dev/zramX device, perform read operation on hot_add
 attribute. This will return either new device's device id (meaning that you
 can use /dev/zram<id>) or error code.
 Example::
 	cat /sys/class/zram-control/hot_add
 	1
 To remove the existing /dev/zramX device (where X is a device id)
 execute::
 	echo X > /sys/class/zram-control/hot_remove
 8) Stats
 ========
 Per-device statistics are exported as various nodes under /sys/block/zram<id>/
 A brief description of exported device attributes. For more details please
 read Documentation/ABI/testing/sysfs-block-zram.
 ======================  ======  ===============================================
 Name            	access            description
 ======================  ======  ===============================================
 disksize          	RW	show and set the device's disk size
 initstate         	RO	shows the initialization state of the device
 reset             	WO	trigger device reset
 mem_used_max      	WO	reset the `mem_used_max` counter (see later)
 mem_limit         	WO	specifies the maximum amount of memory ZRAM can
 				use to store the compressed data
 writeback_limit   	WO	specifies the maximum amount of write IO zram
 				can write out to backing device as 4KB unit
 writeback_limit_enable  RW	show and set writeback_limit feature
 max_comp_streams  	RW	the number of possible concurrent compress
 				operations
 comp_algorithm    	RW	show and change the compression algorithm
 compact           	WO	trigger memory compaction
 debug_stat        	RO	this file is used for zram debugging purposes
 backing_dev	  	RW	set up backend storage for zram to write out
 idle		  	WO	mark allocated slot as idle
 ======================  ======  ===============================================
 User space is advised to use the following files to read the device statistics.
 File /sys/block/zram<id>/stat
 Represents block layer statistics. Read Documentation/block/stat.rst for
 details.
 File /sys/block/zram<id>/io_stat
 The stat file represents device's I/O statistics not accounted by block
 layer and, thus, not available in zram<id>/stat file. It consists of a
 single line of text and contains the following stats separated by
 whitespace:
 =============    =============================================================
 failed_reads     The number of failed reads
 failed_writes    The number of failed writes
 invalid_io       The number of non-page-size-aligned I/O requests
 notify_free      Depending on device usage scenario it may account
                  a) the number of pages freed because of swap slot free
                     notifications
                  b) the number of pages freed because of
                     REQ_OP_DISCARD requests sent by bio. The former ones are
                     sent to a swap block device when a swap slot is freed,
                     which implies that this disk is being used as a swap disk.
                  The latter ones are sent by filesystem mounted with
                  discard option, whenever some data blocks are getting
                  discarded.
 =============    =============================================================
 File /sys/block/zram<id>/mm_stat
 The stat file represents device's mm statistics. It consists of a single
 line of text and contains the following stats separated by whitespace:
 ================ =============================================================
 orig_data_size   uncompressed size of data stored in this disk.
 		  This excludes same-element-filled pages (same_pages) since
 		  no memory is allocated for them.
                  Unit: bytes
 compr_data_size  compressed size of data stored in this disk
 mem_used_total   the amount of memory allocated for this disk. This
                  includes allocator fragmentation and metadata overhead,
                  allocated for this disk. So, allocator space efficiency
                  can be calculated using compr_data_size and this statistic.
                  Unit: bytes
 mem_limit        the maximum amount of memory ZRAM can use to store
                  the compressed data
 mem_used_max     the maximum amount of memory zram have consumed to
                  store the data
 same_pages       the number of same element filled pages written to this disk.
                  No memory is allocated for such pages.
 pages_compacted  the number of pages freed during compaction
 huge_pages	  the number of incompressible pages
 ================ =============================================================
 File /sys/block/zram<id>/bd_stat
 The stat file represents device's backing device statistics. It consists of
 a single line of text and contains the following stats separated by whitespace:
 ============== =============================================================
 bd_count	size of data written in backing device.
 		Unit: 4K bytes
 bd_reads	the number of reads from backing device
 		Unit: 4K bytes
 bd_writes	the number of writes to backing device
 		Unit: 4K bytes
 ============== =============================================================
 9) Deactivate
 =============
 ::
 	swapoff /dev/zram0
 	umount /dev/zram1
 10) Reset
 =========
 	Write any positive value to 'reset' sysfs node::
 		echo 1 > /sys/block/zram0/reset
 		echo 1 > /sys/block/zram1/reset
 	This frees all the memory allocated for the given device and
 	resets the disksize to zero. You must set the disksize again
 	before reusing the device.
 Optional Feature
 ================
 writeback
 ---------
 With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
 to backing storage rather than keeping it in memory.
 To use the feature, admin should set up backing device via::
 	echo /dev/sda5 > /sys/block/zramX/backing_dev
 before disksize setting. It supports only partition at this moment.
 If admin want to use incompressible page writeback, they could do via::
 	echo huge > /sys/block/zramX/write
 To use idle page writeback, first, user need to declare zram pages
 as idle::
 	echo all > /sys/block/zramX/idle
 From now on, any pages on zram are idle pages. The idle mark
 will be removed until someone request access of the block.
 IOW, unless there is access request, those pages are still idle pages.
 Admin can request writeback of those idle pages at right timing via::
 	echo idle > /sys/block/zramX/writeback
 With the command, zram writeback idle pages from memory to the storage.
 If there are lots of write IO with flash device, potentially, it has
 flash wearout problem so that admin needs to design write limitation
 to guarantee storage health for entire product life.
 To overcome the concern, zram supports "writeback_limit" feature.
 The "writeback_limit_enable"'s default value is 0 so that it doesn't limit
 any writeback. IOW, if admin want to apply writeback budget, he should
 enable writeback_limit_enable via::
 	$ echo 1 > /sys/block/zramX/writeback_limit_enable
 Once writeback_limit_enable is set, zram doesn't allow any writeback
 until admin set the budget via /sys/block/zramX/writeback_limit.
 (If admin doesn't enable writeback_limit_enable, writeback_limit's value
 assigned via /sys/block/zramX/writeback_limit is meaninless.)
 If admin want to limit writeback as per-day 400M, he could do it
 like below::
 	$ MB_SHIFT=20
 	$ 4K_SHIFT=12
 	$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
 		/sys/block/zram0/writeback_limit.
 	$ echo 1 > /sys/block/zram0/writeback_limit_enable
 If admin want to allow further write again once the bugdet is exausted,
 he could do it like below::
 	$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
 		/sys/block/zram0/writeback_limit
 If admin want to see remaining writeback budget since he set::
 	$ cat /sys/block/zramX/writeback_limit
 If admin want to disable writeback limit, he could do::
 	$ echo 0 > /sys/block/zramX/writeback_limit_enable
 The writeback_limit count will reset whenever you reset zram(e.g.,
 system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
 writeback happened until you reset the zram to allocate extra writeback
 budget in next setting is user's job.
 If admin want to measure writeback count in a certain period, he could
 know it via /sys/block/zram0/bd_stat's 3rd column.
 memory tracking
 ===============
 With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
 zram block. It could be useful to catch cold or incompressible
 pages of the process with*pagemap.
 If you enable the feature, you could see block state via
 /sys/kernel/debug/zram/zram0/block_state". The output is as follows::
 	  300    75.033841 .wh.
 	  301    63.806904 s...
 	  302    63.806919 ..hi
 First column
 	zram's block index.
 Second column
 	access time since the system was booted
 Third column
 	state of the block:
 	s:
 		same page
 	w:
 		written page to backing store
 	h:
 		huge page
 	i:
 		idle page
 First line of above example says 300th block is accessed at 75.033841sec
 and the block's state is huge so it is written back to the backing
 storage. It's a debugging feature so anyone shouldn't rely on it to work
 properly.
 Nitin Gupta
 ngupta@vflare.org
--- a/Documentation/admin-guide/btmrvl.rst
+++ b/Documentation/admin-guide/btmrvl.rst
--- a/Documentation/admin-guide/bug-hunting.rst
+++ b/Documentation/admin-guide/bug-hunting.rst
@ -90,9 +90,9 @@ the disk is not available then you have three options:
    run a null modem to a second machine and capture the output there
    using your favourite communication program.  Minicom works well.
-(3) Use Kdump (see Documentation/kdump/kdump.rst),
+(3) Use Kdump (see Documentation/admin-guide/kdump/kdump.rst),
    extract the kernel ring buffer from old memory with using dmesg
-    gdbmacro in Documentation/kdump/gdbmacros.txt.
+    gdbmacro in Documentation/admin-guide/kdump/gdbmacros.txt.
 Finding the bug's location
 --------------------------
--- a/Documentation/admin-guide/cgroup-v1/blkio-controller.rst
+++ b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst
--- a/Documentation/admin-guide/cgroup-v1/cgroups.rst
+++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst
@ -0,0 +1,695 @@
 ==============
 Control Groups
 ==============
 Written by Paul Menage <menage@google.com> based on
 Documentation/admin-guide/cgroup-v1/cpusets.rst
 Original copyright statements from cpusets.txt:
 Portions Copyright (C) 2004 BULL SA.
 Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
 Modified by Paul Jackson <pj@sgi.com>
 Modified by Christoph Lameter <cl@linux.com>
 .. CONTENTS:
 	1. Control Groups
 	1.1 What are cgroups ?
 	1.2 Why are cgroups needed ?
 	1.3 How are cgroups implemented ?
 	1.4 What does notify_on_release do ?
 	1.5 What does clone_children do ?
 	1.6 How do I use cgroups ?
 	2. Usage Examples and Syntax
 	2.1 Basic Usage
 	2.2 Attaching processes
 	2.3 Mounting hierarchies by name
 	3. Kernel API
 	3.1 Overview
 	3.2 Synchronization
 	3.3 Subsystem API
 	4. Extended attributes usage
 	5. Questions
 1. Control Groups
 =================
 1.1 What are cgroups ?
 ----------------------
 Control Groups provide a mechanism for aggregating/partitioning sets of
 tasks, and all their future children, into hierarchical groups with
 specialized behaviour.
 Definitions:
 A *cgroup* associates a set of tasks with a set of parameters for one
 or more subsystems.
 A *subsystem* is a module that makes use of the task grouping
 facilities provided by cgroups to treat groups of tasks in
 particular ways. A subsystem is typically a "resource controller" that
 schedules a resource or applies per-cgroup limits, but it may be
 anything that wants to act on a group of processes, e.g. a
 virtualization subsystem.
 A *hierarchy* is a set of cgroups arranged in a tree, such that
 every task in the system is in exactly one of the cgroups in the
 hierarchy, and a set of subsystems; each subsystem has system-specific
 state attached to each cgroup in the hierarchy.  Each hierarchy has
 an instance of the cgroup virtual filesystem associated with it.
 At any one time there may be multiple active hierarchies of task
 cgroups. Each hierarchy is a partition of all tasks in the system.
 User-level code may create and destroy cgroups by name in an
 instance of the cgroup virtual file system, specify and query to
 which cgroup a task is assigned, and list the task PIDs assigned to
 a cgroup. Those creations and assignments only affect the hierarchy
 associated with that instance of the cgroup file system.
 On their own, the only use for cgroups is for simple job
 tracking. The intention is that other subsystems hook into the generic
 cgroup support to provide new attributes for cgroups, such as
 accounting/limiting the resources which processes in a cgroup can
 access. For example, cpusets (see Documentation/admin-guide/cgroup-v1/cpusets.rst) allow
 you to associate a set of CPUs and a set of memory nodes with the
 tasks in each cgroup.
 1.2 Why are cgroups needed ?
 ----------------------------
 There are multiple efforts to provide process aggregations in the
 Linux kernel, mainly for resource-tracking purposes. Such efforts
 include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server
 namespaces. These all require the basic notion of a
 grouping/partitioning of processes, with newly forked processes ending
 up in the same group (cgroup) as their parent process.
 The kernel cgroup patch provides the minimum essential kernel
 mechanisms required to efficiently implement such groups. It has
 minimal impact on the system fast paths, and provides hooks for
 specific subsystems such as cpusets to provide additional behaviour as
 desired.
 Multiple hierarchy support is provided to allow for situations where
 the division of tasks into cgroups is distinctly different for
 different subsystems - having parallel hierarchies allows each
 hierarchy to be a natural division of tasks, without having to handle
 complex combinations of tasks that would be present if several
 unrelated subsystems needed to be forced into the same tree of
 cgroups.
 At one extreme, each resource controller or subsystem could be in a
 separate hierarchy; at the other extreme, all subsystems
 would be attached to the same hierarchy.
 As an example of a scenario (originally proposed by vatsa@in.ibm.com)
 that can benefit from multiple hierarchies, consider a large
 university server with various users - students, professors, system
 tasks etc. The resource planning for this server could be along the
 following lines::
       CPU :          "Top cpuset"
                       /       \
               CPUSet1         CPUSet2
                  |               |
               (Professors)    (Students)
               In addition (system tasks) are attached to topcpuset (so
               that they can run anywhere) with a limit of 20%
       Memory : Professors (50%), Students (30%), system (20%)
       Disk : Professors (50%), Students (30%), system (20%)
       Network : WWW browsing (20%), Network File System (60%), others (20%)
                               / \
               Professors (15%)  students (5%)
 Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd goes
 into the NFS network class.
 At the same time Firefox/Lynx will share an appropriate CPU/Memory class
 depending on who launched it (prof/student).
 With the ability to classify tasks differently for different resources
 (by putting those resource subsystems in different hierarchies),
 the admin can easily set up a script which receives exec notifications
 and depending on who is launching the browser he can::
    # echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks
 With only a single hierarchy, he now would potentially have to create
 a separate cgroup for every browser launched and associate it with
 appropriate network and other resource class.  This may lead to
 proliferation of such cgroups.
 Also let's say that the administrator would like to give enhanced network
 access temporarily to a student's browser (since it is night and the user
 wants to do online gaming :))  OR give one of the student's simulation
 apps enhanced CPU power.
 With ability to write PIDs directly to resource classes, it's just a
 matter of::
       # echo pid > /sys/fs/cgroup/network/<new_class>/tasks
       (after some time)
       # echo pid > /sys/fs/cgroup/network/<orig_class>/tasks
 Without this ability, the administrator would have to split the cgroup into
 multiple separate ones and then associate the new cgroups with the
 new resource classes.
 1.3 How are cgroups implemented ?
 ---------------------------------
 Control Groups extends the kernel as follows:
 - Each task in the system has a reference-counted pointer to a
   css_set.
 - A css_set contains a set of reference-counted pointers to
   cgroup_subsys_state objects, one for each cgroup subsystem
   registered in the system. There is no direct link from a task to
   the cgroup of which it's a member in each hierarchy, but this
   can be determined by following pointers through the
   cgroup_subsys_state objects. This is because accessing the
   subsystem state is something that's expected to happen frequently
   and in performance-critical code, whereas operations that require a
   task's actual cgroup assignments (in particular, moving between
   cgroups) are less common. A linked list runs through the cg_list
   field of each task_struct using the css_set, anchored at
   css_set->tasks.
 - A cgroup hierarchy filesystem can be mounted for browsing and
   manipulation from user space.
 - You can list all the tasks (by PID) attached to any cgroup.
 The implementation of cgroups requires a few, simple hooks
 into the rest of the kernel, none in performance-critical paths:
 - in init/main.c, to initialize the root cgroups and initial
   css_set at system boot.
 - in fork and exit, to attach and detach a task from its css_set.
 In addition, a new file system of type "cgroup" may be mounted, to
 enable browsing and modifying the cgroups presently known to the
 kernel.  When mounting a cgroup hierarchy, you may specify a
 comma-separated list of subsystems to mount as the filesystem mount
 options.  By default, mounting the cgroup filesystem attempts to
 mount a hierarchy containing all registered subsystems.
 If an active hierarchy with exactly the same set of subsystems already
 exists, it will be reused for the new mount. If no existing hierarchy
 matches, and any of the requested subsystems are in use in an existing
 hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
 is activated, associated with the requested subsystems.
 It's not currently possible to bind a new subsystem to an active
 cgroup hierarchy, or to unbind a subsystem from an active cgroup
 hierarchy. This may be possible in future, but is fraught with nasty
 error-recovery issues.
 When a cgroup filesystem is unmounted, if there are any
 child cgroups created below the top-level cgroup, that hierarchy
 will remain active even though unmounted; if there are no
 child cgroups then the hierarchy will be deactivated.
 No new system calls are added for cgroups - all support for
 querying and modifying cgroups is via this cgroup file system.
 Each task under /proc has an added file named 'cgroup' displaying,
 for each active hierarchy, the subsystem names and the cgroup name
 as the path relative to the root of the cgroup file system.
 Each cgroup is represented by a directory in the cgroup file system
 containing the following files describing that cgroup:
 - tasks: list of tasks (by PID) attached to that cgroup.  This list
   is not guaranteed to be sorted.  Writing a thread ID into this file
   moves the thread into this cgroup.
 - cgroup.procs: list of thread group IDs in the cgroup.  This list is
   not guaranteed to be sorted or free of duplicate TGIDs, and userspace
   should sort/uniquify the list if this property is required.
   Writing a thread group ID into this file moves all threads in that
   group into this cgroup.
 - notify_on_release flag: run the release agent on exit?
 - release_agent: the path to use for release notifications (this file
   exists in the top cgroup only)
 Other subsystems such as cpusets may add additional files in each
 cgroup dir.
 New cgroups are created using the mkdir system call or shell
 command.  The properties of a cgroup, such as its flags, are
 modified by writing to the appropriate file in that cgroups
 directory, as listed above.
 The named hierarchical structure of nested cgroups allows partitioning
 a large system into nested, dynamically changeable, "soft-partitions".
 The attachment of each task, automatically inherited at fork by any
 children of that task, to a cgroup allows organizing the work load
 on a system into related sets of tasks.  A task may be re-attached to
 any other cgroup, if allowed by the permissions on the necessary
 cgroup file system directories.
 When a task is moved from one cgroup to another, it gets a new
 css_set pointer - if there's an already existing css_set with the
 desired collection of cgroups then that group is reused, otherwise a new
 css_set is allocated. The appropriate existing css_set is located by
 looking into a hash table.
 To allow access from a cgroup to the css_sets (and hence tasks)
 that comprise it, a set of cg_cgroup_link objects form a lattice;
 each cg_cgroup_link is linked into a list of cg_cgroup_links for
 a single cgroup on its cgrp_link_list field, and a list of
 cg_cgroup_links for a single css_set on its cg_link_list.
 Thus the set of tasks in a cgroup can be listed by iterating over
 each css_set that references the cgroup, and sub-iterating over
 each css_set's task set.
 The use of a Linux virtual file system (vfs) to represent the
 cgroup hierarchy provides for a familiar permission and name space
 for cgroups, with a minimum of additional kernel code.
 1.4 What does notify_on_release do ?
 ------------------------------------
 If the notify_on_release flag is enabled (1) in a cgroup, then
 whenever the last task in the cgroup leaves (exits or attaches to
 some other cgroup) and the last child cgroup of that cgroup
 is removed, then the kernel runs the command specified by the contents
 of the "release_agent" file in that hierarchy's root directory,
 supplying the pathname (relative to the mount point of the cgroup
 file system) of the abandoned cgroup.  This enables automatic
 removal of abandoned cgroups.  The default value of
 notify_on_release in the root cgroup at system boot is disabled
 (0).  The default value of other cgroups at creation is the current
 value of their parents' notify_on_release settings. The default value of
 a cgroup hierarchy's release_agent path is empty.
 1.5 What does clone_children do ?
 ---------------------------------
 This flag only affects the cpuset controller. If the clone_children
 flag is enabled (1) in a cgroup, a new cpuset cgroup will copy its
 configuration from the parent during initialization.
 1.6 How do I use cgroups ?
 --------------------------
 To start a new job that is to be contained within a cgroup, using
 the "cpuset" cgroup subsystem, the steps are something like::
 1) mount -t tmpfs cgroup_root /sys/fs/cgroup
 2) mkdir /sys/fs/cgroup/cpuset
 3) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
 4) Create the new cgroup by doing mkdir's and write's (or echo's) in
    the /sys/fs/cgroup/cpuset virtual file system.
 5) Start a task that will be the "founding father" of the new job.
 6) Attach that task to the new cgroup by writing its PID to the
    /sys/fs/cgroup/cpuset tasks file for that cgroup.
 7) fork, exec or clone the job tasks from this founding father task.
 For example, the following sequence of commands will setup a cgroup
 named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
 and then start a subshell 'sh' in that cgroup::
  mount -t tmpfs cgroup_root /sys/fs/cgroup
  mkdir /sys/fs/cgroup/cpuset
  mount -t cgroup cpuset -ocpuset /sys/fs/cgroup/cpuset
  cd /sys/fs/cgroup/cpuset
  mkdir Charlie
  cd Charlie
  /bin/echo 2-3 > cpuset.cpus
  /bin/echo 1 > cpuset.mems
  /bin/echo $$ > tasks
  sh
  # The subshell 'sh' is now running in cgroup Charlie
  # The next line should display '/Charlie'
  cat /proc/self/cgroup
 2. Usage Examples and Syntax
 ============================
 2.1 Basic Usage
 ---------------
 Creating, modifying, using cgroups can be done through the cgroup
 virtual filesystem.
 To mount a cgroup hierarchy with all available subsystems, type::
  # mount -t cgroup xxx /sys/fs/cgroup
 The "xxx" is not interpreted by the cgroup code, but will appear in
 /proc/mounts so may be any useful identifying string that you like.
 Note: Some subsystems do not work without some user input first.  For instance,
 if cpusets are enabled the user will have to populate the cpus and mems files
 for each new cgroup created before that group can be used.
 As explained in section `1.2 Why are cgroups needed?` you should create
 different hierarchies of cgroups for each single resource or group of
 resources you want to control. Therefore, you should mount a tmpfs on
 /sys/fs/cgroup and create directories for each cgroup resource or resource
 group::
  # mount -t tmpfs cgroup_root /sys/fs/cgroup
  # mkdir /sys/fs/cgroup/rg1
 To mount a cgroup hierarchy with just the cpuset and memory
 subsystems, type::
  # mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1
 While remounting cgroups is currently supported, it is not recommend
 to use it. Remounting allows changing bound subsystems and
 release_agent. Rebinding is hardly useful as it only works when the
 hierarchy is empty and release_agent itself should be replaced with
 conventional fsnotify. The support for remounting will be removed in
 the future.
 To Specify a hierarchy's release_agent::
  # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
    xxx /sys/fs/cgroup/rg1
 Note that specifying 'release_agent' more than once will return failure.
 Note that changing the set of subsystems is currently only supported
 when the hierarchy consists of a single (root) cgroup. Supporting
 the ability to arbitrarily bind/unbind subsystems from an existing
 cgroup hierarchy is intended to be implemented in the future.
 Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the
 tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1
 is the cgroup that holds the whole system.
 If you want to change the value of release_agent::
  # echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent
 It can also be changed via remount.
 If you want to create a new cgroup under /sys/fs/cgroup/rg1::
  # cd /sys/fs/cgroup/rg1
  # mkdir my_cgroup
 Now you want to do something with this cgroup:
  # cd my_cgroup
 In this directory you can find several files::
  # ls
  cgroup.procs notify_on_release tasks
  (plus whatever files added by the attached subsystems)
 Now attach your shell to this cgroup::
  # /bin/echo $$ > tasks
 You can also create cgroups inside your cgroup by using mkdir in this
 directory::
  # mkdir my_sub_cs
 To remove a cgroup, just use rmdir::
  # rmdir my_sub_cs
 This will fail if the cgroup is in use (has cgroups inside, or
 has processes attached, or is held alive by other subsystem-specific
 reference).
 2.2 Attaching processes
 -----------------------
 ::
  # /bin/echo PID > tasks
 Note that it is PID, not PIDs. You can only attach ONE task at a time.
 If you have several tasks to attach, you have to do it one after another::
  # /bin/echo PID1 > tasks
  # /bin/echo PID2 > tasks
 	  ...
  # /bin/echo PIDn > tasks
 You can attach the current shell task by echoing 0::
  # echo 0 > tasks
 You can use the cgroup.procs file instead of the tasks file to move all
 threads in a threadgroup at once. Echoing the PID of any task in a
 threadgroup to cgroup.procs causes all tasks in that threadgroup to be
 attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
 in the writing task's threadgroup.
 Note: Since every task is always a member of exactly one cgroup in each
 mounted hierarchy, to remove a task from its current cgroup you must
 move it into a new cgroup (possibly the root cgroup) by writing to the
 new cgroup's tasks file.
 Note: Due to some restrictions enforced by some cgroup subsystems, moving
 a process to another cgroup can fail.
 2.3 Mounting hierarchies by name
 --------------------------------
 Passing the name=<x> option when mounting a cgroups hierarchy
 associates the given name with the hierarchy.  This can be used when
 mounting a pre-existing hierarchy, in order to refer to it by name
 rather than by its set of active subsystems.  Each hierarchy is either
 nameless, or has a unique name.
 The name should match [\w.-]+
 When passing a name=<x> option for a new hierarchy, you need to
 specify subsystems manually; the legacy behaviour of mounting all
 subsystems when none are explicitly specified is not supported when
 you give a subsystem a name.
 The name of the subsystem appears as part of the hierarchy description
 in /proc/mounts and /proc/<pid>/cgroups.
 3. Kernel API
 =============
 3.1 Overview
 ------------
 Each kernel subsystem that wants to hook into the generic cgroup
 system needs to create a cgroup_subsys object. This contains
 various methods, which are callbacks from the cgroup system, along
 with a subsystem ID which will be assigned by the cgroup system.
 Other fields in the cgroup_subsys object include:
 - subsys_id: a unique array index for the subsystem, indicating which
  entry in cgroup->subsys[] this subsystem should be managing.
 - name: should be initialized to a unique subsystem name. Should be
  no longer than MAX_CGROUP_TYPE_NAMELEN.
 - early_init: indicate if the subsystem needs early initialization
  at system boot.
 Each cgroup object created by the system has an array of pointers,
 indexed by subsystem ID; this pointer is entirely managed by the
 subsystem; the generic cgroup code will never touch this pointer.
 3.2 Synchronization
 -------------------
 There is a global mutex, cgroup_mutex, used by the cgroup
 system. This should be taken by anything that wants to modify a
 cgroup. It may also be taken to prevent cgroups from being
 modified, but more specific locks may be more appropriate in that
 situation.
 See kernel/cgroup.c for more details.
 Subsystems can take/release the cgroup_mutex via the functions
 cgroup_lock()/cgroup_unlock().
 Accessing a task's cgroup pointer may be done in the following ways:
 - while holding cgroup_mutex
 - while holding the task's alloc_lock (via task_lock())
 - inside an rcu_read_lock() section via rcu_dereference()
 3.3 Subsystem API
 -----------------
 Each subsystem should:
 - add an entry in linux/cgroup_subsys.h
 - define a cgroup_subsys object called <name>_cgrp_subsys
 Each subsystem may export the following methods. The only mandatory
 methods are css_alloc/free. Any others that are null are presumed to
 be successful no-ops.
 ``struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp)``
 (cgroup_mutex held by caller)
 Called to allocate a subsystem state object for a cgroup. The
 subsystem should allocate its subsystem state object for the passed
 cgroup, returning a pointer to the new object on success or a
 ERR_PTR() value. On success, the subsystem pointer should point to
 a structure of type cgroup_subsys_state (typically embedded in a
 larger subsystem-specific object), which will be initialized by the
 cgroup system. Note that this will be called at initialization to
 create the root subsystem state for this subsystem; this case can be
 identified by the passed cgroup object having a NULL parent (since
 it's the root of the hierarchy) and may be an appropriate place for
 initialization code.
 ``int css_online(struct cgroup *cgrp)``
 (cgroup_mutex held by caller)
 Called after @cgrp successfully completed all allocations and made
 visible to cgroup_for_each_child/descendant_*() iterators. The
 subsystem may choose to fail creation by returning -errno. This
 callback can be used to implement reliable state sharing and
 propagation along the hierarchy. See the comment on
 cgroup_for_each_descendant_pre() for details.
 ``void css_offline(struct cgroup *cgrp);``
 (cgroup_mutex held by caller)
 This is the counterpart of css_online() and called iff css_online()
 has succeeded on @cgrp. This signifies the beginning of the end of
@cgrp. @cgrp is being removed and the subsystem should start dropping
 all references it's holding on @cgrp. When all references are dropped,
 cgroup removal will proceed to the next step - css_free(). After this
 callback, @cgrp should be considered dead to the subsystem.
 ``void css_free(struct cgroup *cgrp)``
 (cgroup_mutex held by caller)
 The cgroup system is about to free @cgrp; the subsystem should free
 its subsystem state object. By the time this method is called, @cgrp
 is completely unused; @cgrp->parent is still valid. (Note - can also
 be called for a newly-created cgroup if an error occurs after this
 subsystem's create() method has been called for the new cgroup).
 ``int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
 (cgroup_mutex held by caller)
 Called prior to moving one or more tasks into a cgroup; if the
 subsystem returns an error, this will abort the attach operation.
@tset contains the tasks to be attached and is guaranteed to have at
 least one task in it.
 If there are multiple tasks in the taskset, then:
  - it's guaranteed that all are from the same thread group
  - @tset contains all tasks from the thread group whether or not
    they're switching cgroups
  - the first task is the leader
 Each @tset entry also contains the task's old cgroup and tasks which
 aren't switching cgroup can be skipped easily using the
 cgroup_taskset_for_each() iterator. Note that this isn't called on a
 fork. If this method returns 0 (success) then this should remain valid
 while the caller holds cgroup_mutex and it is ensured that either
 attach() or cancel_attach() will be called in future.
 ``void css_reset(struct cgroup_subsys_state *css)``
 (cgroup_mutex held by caller)
 An optional operation which should restore @css's configuration to the
 initial state.  This is currently only used on the unified hierarchy
 when a subsystem is disabled on a cgroup through
 "cgroup.subtree_control" but should remain enabled because other
 subsystems depend on it.  cgroup core makes such a css invisible by
 removing the associated interface files and invokes this callback so
 that the hidden subsystem can return to the initial neutral state.
 This prevents unexpected resource control from a hidden css and
 ensures that the configuration is in the initial state when it is made
 visible again later.
 ``void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
 (cgroup_mutex held by caller)
 Called when a task attach operation has failed after can_attach() has succeeded.
 A subsystem whose can_attach() has some side-effects should provide this
 function, so that the subsystem can implement a rollback. If not, not necessary.
 This will be called only about subsystems whose can_attach() operation have
 succeeded. The parameters are identical to can_attach().
 ``void attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
 (cgroup_mutex held by caller)
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
 The parameters are identical to can_attach().
 ``void fork(struct task_struct *task)``
 Called when a task is forked into a cgroup.
 ``void exit(struct task_struct *task)``
 Called during task exit.
 ``void free(struct task_struct *task)``
 Called when the task_struct is freed.
 ``void bind(struct cgroup *root)``
 (cgroup_mutex held by caller)
 Called when a cgroup subsystem is rebound to a different hierarchy
 and root cgroup. Currently this will only involve movement between
 the default hierarchy (which never has sub-cgroups) and a hierarchy
 that is being created/destroyed (and hence has no sub-cgroups).
 4. Extended attribute usage
 ===========================
 cgroup filesystem supports certain types of extended attributes in its
 directories and files.  The current supported types are:
 	- Trusted (XATTR_TRUSTED)
 	- Security (XATTR_SECURITY)
 Both require CAP_SYS_ADMIN capability to set.
 Like in tmpfs, the extended attributes in cgroup filesystem are stored
 using kernel memory and it's advised to keep the usage at minimum.  This
 is the reason why user defined extended attributes are not supported, since
 any user can do it and there's no limit in the value size.
 The current known users for this feature are SELinux to limit cgroup usage
 in containers and systemd for assorted meta data like main PID in a cgroup
 (systemd creates a cgroup per service).
 5. Questions
 ============
 ::
  Q: what's up with this '/bin/echo' ?
  A: bash's builtin 'echo' command does not check calls to write() against
     errors. If you use it in the cgroup file system, you won't be
     able to tell whether a command succeeded or failed.
  Q: When I attach processes, only the first of the line gets really attached !
  A: We can only return one error code per call to write(). So you should also
     put only ONE PID.
--- a/Documentation/admin-guide/cgroup-v1/cpuacct.rst
+++ b/Documentation/admin-guide/cgroup-v1/cpuacct.rst
--- a/Documentation/admin-guide/cgroup-v1/cpusets.rst
+++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst
@ -0,0 +1,866 @@
 =======
 CPUSETS
 =======
 Copyright (C) 2004 BULL SA.
 Written by Simon.Derr@bull.net
 - Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
 - Modified by Paul Jackson <pj@sgi.com>
 - Modified by Christoph Lameter <cl@linux.com>
 - Modified by Paul Menage <menage@google.com>
 - Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
 .. CONTENTS:
   1. Cpusets
     1.1 What are cpusets ?
     1.2 Why are cpusets needed ?
     1.3 How are cpusets implemented ?
     1.4 What are exclusive cpusets ?
     1.5 What is memory_pressure ?
     1.6 What is memory spread ?
     1.7 What is sched_load_balance ?
     1.8 What is sched_relax_domain_level ?
     1.9 How do I use cpusets ?
   2. Usage Examples and Syntax
     2.1 Basic Usage
     2.2 Adding/removing cpus
     2.3 Setting flags
     2.4 Attaching processes
   3. Questions
   4. Contact
 1. Cpusets
 ==========
 1.1 What are cpusets ?
 ----------------------
 Cpusets provide a mechanism for assigning a set of CPUs and Memory
 Nodes to a set of tasks.   In this document "Memory Node" refers to
 an on-line node that contains memory.
 Cpusets constrain the CPU and Memory placement of tasks to only
 the resources within a task's current cpuset.  They form a nested
 hierarchy visible in a virtual file system.  These are the essential
 hooks, beyond what is already present, required to manage dynamic
 job placement on large systems.
 Cpusets use the generic cgroup subsystem described in
 Documentation/admin-guide/cgroup-v1/cgroups.rst.
 Requests by a task, using the sched_setaffinity(2) system call to
 include CPUs in its CPU affinity mask, and using the mbind(2) and
 set_mempolicy(2) system calls to include Memory Nodes in its memory
 policy, are both filtered through that task's cpuset, filtering out any
 CPUs or Memory Nodes not in that cpuset.  The scheduler will not
 schedule a task on a CPU that is not allowed in its cpus_allowed
 vector, and the kernel page allocator will not allocate a page on a
 node that is not allowed in the requesting task's mems_allowed vector.
 User level code may create and destroy cpusets by name in the cgroup
 virtual file system, manage the attributes and permissions of these
 cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
 specify and query to which cpuset a task is assigned, and list the
 task pids assigned to a cpuset.
 1.2 Why are cpusets needed ?
 ----------------------------
 The management of large computer systems, with many processors (CPUs),
 complex memory cache hierarchies and multiple Memory Nodes having
 non-uniform access times (NUMA) presents additional challenges for
 the efficient scheduling and memory placement of processes.
 Frequently more modest sized systems can be operated with adequate
 efficiency just by letting the operating system automatically share
 the available CPU and Memory resources amongst the requesting tasks.
 But larger systems, which benefit more from careful processor and
 memory placement to reduce memory access times and contention,
 and which typically represent a larger investment for the customer,
 can benefit from explicitly placing jobs on properly sized subsets of
 the system.
 This can be especially valuable on:
    * Web Servers running multiple instances of the same web application,
    * Servers running different applications (for instance, a web server
      and a database), or
    * NUMA systems running large HPC applications with demanding
      performance characteristics.
 These subsets, or "soft partitions" must be able to be dynamically
 adjusted, as the job mix changes, without impacting other concurrently
 executing jobs. The location of the running jobs pages may also be moved
 when the memory locations are changed.
 The kernel cpuset patch provides the minimum essential kernel
 mechanisms required to efficiently implement such subsets.  It
 leverages existing CPU and Memory Placement facilities in the Linux
 kernel to avoid any additional impact on the critical scheduler or
 memory allocator code.
 1.3 How are cpusets implemented ?
 ---------------------------------
 Cpusets provide a Linux kernel mechanism to constrain which CPUs and
 Memory Nodes are used by a process or set of processes.
 The Linux kernel already has a pair of mechanisms to specify on which
 CPUs a task may be scheduled (sched_setaffinity) and on which Memory
 Nodes it may obtain memory (mbind, set_mempolicy).
 Cpusets extends these two mechanisms as follows:
 - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
   kernel.
 - Each task in the system is attached to a cpuset, via a pointer
   in the task structure to a reference counted cgroup structure.
 - Calls to sched_setaffinity are filtered to just those CPUs
   allowed in that task's cpuset.
 - Calls to mbind and set_mempolicy are filtered to just
   those Memory Nodes allowed in that task's cpuset.
 - The root cpuset contains all the systems CPUs and Memory
   Nodes.
 - For any cpuset, one can define child cpusets containing a subset
   of the parents CPU and Memory Node resources.
 - The hierarchy of cpusets can be mounted at /dev/cpuset, for
   browsing and manipulation from user space.
 - A cpuset may be marked exclusive, which ensures that no other
   cpuset (except direct ancestors and descendants) may contain
   any overlapping CPUs or Memory Nodes.
 - You can list all the tasks (by pid) attached to any cpuset.
 The implementation of cpusets requires a few, simple hooks
 into the rest of the kernel, none in performance critical paths:
 - in init/main.c, to initialize the root cpuset at system boot.
 - in fork and exit, to attach and detach a task from its cpuset.
 - in sched_setaffinity, to mask the requested CPUs by what's
   allowed in that task's cpuset.
 - in sched.c migrate_live_tasks(), to keep migrating tasks within
   the CPUs allowed by their cpuset, if possible.
 - in the mbind and set_mempolicy system calls, to mask the requested
   Memory Nodes by what's allowed in that task's cpuset.
 - in page_alloc.c, to restrict memory to allowed nodes.
 - in vmscan.c, to restrict page recovery to the current cpuset.
 You should mount the "cgroup" filesystem type in order to enable
 browsing and modifying the cpusets presently known to the kernel.  No
 new system calls are added for cpusets - all support for querying and
 modifying cpusets is via this cpuset file system.
 The /proc/<pid>/status file for each task has four added lines,
 displaying the task's cpus_allowed (on which CPUs it may be scheduled)
 and mems_allowed (on which Memory Nodes it may obtain memory),
 in the two formats seen in the following example::
  Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
  Cpus_allowed_list:      0-127
  Mems_allowed:   ffffffff,ffffffff
  Mems_allowed_list:      0-63
 Each cpuset is represented by a directory in the cgroup file system
 containing (on top of the standard cgroup files) the following
 files describing that cpuset:
 - cpuset.cpus: list of CPUs in that cpuset
 - cpuset.mems: list of Memory Nodes in that cpuset
 - cpuset.memory_migrate flag: if set, move pages to cpusets nodes
 - cpuset.cpu_exclusive flag: is cpu placement exclusive?
 - cpuset.mem_exclusive flag: is memory placement exclusive?
 - cpuset.mem_hardwall flag:  is memory allocation hardwalled
 - cpuset.memory_pressure: measure of how much paging pressure in cpuset
 - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
 - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
 - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
 - cpuset.sched_relax_domain_level: the searching range when migrating tasks
 In addition, only the root cpuset has the following file:
 - cpuset.memory_pressure_enabled flag: compute memory_pressure?
 New cpusets are created using the mkdir system call or shell
 command.  The properties of a cpuset, such as its flags, allowed
 CPUs and Memory Nodes, and attached tasks, are modified by writing
 to the appropriate file in that cpusets directory, as listed above.
 The named hierarchical structure of nested cpusets allows partitioning
 a large system into nested, dynamically changeable, "soft-partitions".
 The attachment of each task, automatically inherited at fork by any
 children of that task, to a cpuset allows organizing the work load
 on a system into related sets of tasks such that each set is constrained
 to using the CPUs and Memory Nodes of a particular cpuset.  A task
 may be re-attached to any other cpuset, if allowed by the permissions
 on the necessary cpuset file system directories.
 Such management of a system "in the large" integrates smoothly with
 the detailed placement done on individual tasks and memory regions
 using the sched_setaffinity, mbind and set_mempolicy system calls.
 The following rules apply to each cpuset:
 - Its CPUs and Memory Nodes must be a subset of its parents.
 - It can't be marked exclusive unless its parent is.
 - If its cpu or memory is exclusive, they may not overlap any sibling.
 These rules, and the natural hierarchy of cpusets, enable efficient
 enforcement of the exclusive guarantee, without having to scan all
 cpusets every time any of them change to ensure nothing overlaps a
 exclusive cpuset.  Also, the use of a Linux virtual file system (vfs)
 to represent the cpuset hierarchy provides for a familiar permission
 and name space for cpusets, with a minimum of additional kernel code.
 The cpus and mems files in the root (top_cpuset) cpuset are
 read-only.  The cpus file automatically tracks the value of
 cpu_online_mask using a CPU hotplug notifier, and the mems file
 automatically tracks the value of node_states[N_MEMORY]--i.e.,
 nodes with memory--using the cpuset_track_online_nodes() hook.
 1.4 What are exclusive cpusets ?
 --------------------------------
 If a cpuset is cpu or mem exclusive, no other cpuset, other than
 a direct ancestor or descendant, may share any of the same CPUs or
 Memory Nodes.
 A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled",
 i.e. it restricts kernel allocations for page, buffer and other data
 commonly shared by the kernel across multiple users.  All cpusets,
 whether hardwalled or not, restrict allocations of memory for user
 space.  This enables configuring a system so that several independent
 jobs can share common kernel data, such as file system pages, while
 isolating each job's user allocation in its own cpuset.  To do this,
 construct a large mem_exclusive cpuset to hold all the jobs, and
 construct child, non-mem_exclusive cpusets for each individual job.
 Only a small amount of typical kernel memory, such as requests from
 interrupt handlers, is allowed to be taken outside even a
 mem_exclusive cpuset.
 1.5 What is memory_pressure ?
 -----------------------------
 The memory_pressure of a cpuset provides a simple per-cpuset metric
 of the rate that the tasks in a cpuset are attempting to free up in
 use memory on the nodes of the cpuset to satisfy additional memory
 requests.
 This enables batch managers monitoring jobs running in dedicated
 cpusets to efficiently detect what level of memory pressure that job
 is causing.
 This is useful both on tightly managed systems running a wide mix of
 submitted jobs, which may choose to terminate or re-prioritize jobs that
 are trying to use more memory than allowed on the nodes assigned to them,
 and with tightly coupled, long running, massively parallel scientific
 computing jobs that will dramatically fail to meet required performance
 goals if they start to use more memory than allowed to them.
 This mechanism provides a very economical way for the batch manager
 to monitor a cpuset for signs of memory pressure.  It's up to the
 batch manager or other user code to decide what to do about it and
 take action.
 ==>
    Unless this feature is enabled by writing "1" to the special file
    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
    code of __alloc_pages() for this metric reduces to simply noticing
    that the cpuset_memory_pressure_enabled flag is zero.  So only
    systems that enable this feature will compute the metric.
 Why a per-cpuset, running average:
    Because this meter is per-cpuset, rather than per-task or mm,
    the system load imposed by a batch scheduler monitoring this
    metric is sharply reduced on large systems, because a scan of
    the tasklist can be avoided on each set of queries.
    Because this meter is a running average, instead of an accumulating
    counter, a batch scheduler can detect memory pressure with a
    single read, instead of having to read and accumulate results
    for a period of time.
    Because this meter is per-cpuset rather than per-task or mm,
    the batch scheduler can obtain the key information, memory
    pressure in a cpuset, with a single read, rather than having to
    query and accumulate results over all the (dynamically changing)
    set of tasks in the cpuset.
 A per-cpuset simple digital filter (requires a spinlock and 3 words
 of data per-cpuset) is kept, and updated by any task attached to that
 cpuset, if it enters the synchronous (direct) page reclaim code.
 A per-cpuset file provides an integer number representing the recent
 (half-life of 10 seconds) rate of direct page reclaims caused by
 the tasks in the cpuset, in units of reclaims attempted per second,
 times 1000.
 1.6 What is memory spread ?
 ---------------------------
 There are two boolean flag files per cpuset that control where the
 kernel allocates pages for the file system buffers and related in
 kernel data structures.  They are called 'cpuset.memory_spread_page' and
 'cpuset.memory_spread_slab'.
 If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then
 the kernel will spread the file system buffers (page cache) evenly
 over all the nodes that the faulting task is allowed to use, instead
 of preferring to put those pages on the node where the task is running.
 If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set,
 then the kernel will spread some file system related slab caches,
 such as for inodes and dentries evenly over all the nodes that the
 faulting task is allowed to use, instead of preferring to put those
 pages on the node where the task is running.
 The setting of these flags does not affect anonymous data segment or
 stack segment pages of a task.
 By default, both kinds of memory spreading are off, and memory
 pages are allocated on the node local to where the task is running,
 except perhaps as modified by the task's NUMA mempolicy or cpuset
 configuration, so long as sufficient free memory pages are available.
 When new cpusets are created, they inherit the memory spread settings
 of their parent.
 Setting memory spreading causes allocations for the affected page
 or slab caches to ignore the task's NUMA mempolicy and be spread
 instead.    Tasks using mbind() or set_mempolicy() calls to set NUMA
 mempolicies will not notice any change in these calls as a result of
 their containing task's memory spread settings.  If memory spreading
 is turned off, then the currently specified NUMA mempolicy once again
 applies to memory page allocations.
 Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag
 files.  By default they contain "0", meaning that the feature is off
 for that cpuset.  If a "1" is written to that file, then that turns
 the named feature on.
 The implementation is simple.
 Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag
 PFA_SPREAD_PAGE for each task that is in that cpuset or subsequently
 joins that cpuset.  The page allocation calls for the page cache
 is modified to perform an inline check for this PFA_SPREAD_PAGE task
 flag, and if set, a call to a new routine cpuset_mem_spread_node()
 returns the node to prefer for the allocation.
 Similarly, setting 'cpuset.memory_spread_slab' turns on the flag
 PFA_SPREAD_SLAB, and appropriately marked slab caches will allocate
 pages from the node returned by cpuset_mem_spread_node().
 The cpuset_mem_spread_node() routine is also simple.  It uses the
 value of a per-task rotor cpuset_mem_spread_rotor to select the next
 node in the current task's mems_allowed to prefer for the allocation.
 This memory placement policy is also known (in other contexts) as
 round-robin or interleave.
 This policy can provide substantial improvements for jobs that need
 to place thread local data on the corresponding node, but that need
 to access large file system data sets that need to be spread across
 the several nodes in the jobs cpuset in order to fit.  Without this
 policy, especially for jobs that might have one thread reading in the
 data set, the memory allocation across the nodes in the jobs cpuset
 can become very uneven.
 1.7 What is sched_load_balance ?
 --------------------------------
 The kernel scheduler (kernel/sched/core.c) automatically load balances
 tasks.  If one CPU is underutilized, kernel code running on that
 CPU will look for tasks on other more overloaded CPUs and move those
 tasks to itself, within the constraints of such placement mechanisms
 as cpusets and sched_setaffinity.
 The algorithmic cost of load balancing and its impact on key shared
 kernel data structures such as the task list increases more than
 linearly with the number of CPUs being balanced.  So the scheduler
 has support to partition the systems CPUs into a number of sched
 domains such that it only load balances within each sched domain.
 Each sched domain covers some subset of the CPUs in the system;
 no two sched domains overlap; some CPUs might not be in any sched
 domain and hence won't be load balanced.
 Put simply, it costs less to balance between two smaller sched domains
 than one big one, but doing so means that overloads in one of the
 two domains won't be load balanced to the other one.
 By default, there is one sched domain covering all CPUs, including those
 marked isolated using the kernel boot time "isolcpus=" argument. However,
 the isolated CPUs will not participate in load balancing, and will not
 have tasks running on them unless explicitly assigned.
 This default load balancing across all CPUs is not well suited for
 the following two situations:
 1) On large systems, load balancing across many CPUs is expensive.
    If the system is managed using cpusets to place independent jobs
    on separate sets of CPUs, full load balancing is unnecessary.
 2) Systems supporting realtime on some CPUs need to minimize
    system overhead on those CPUs, including avoiding task load
    balancing if that is not needed.
 When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
 setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
 be contained in a single sched domain, ensuring that load balancing
 can move a task (not otherwised pinned, as by sched_setaffinity)
 from any CPU in that cpuset to any other.
 When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the
 scheduler will avoid load balancing across the CPUs in that cpuset,
 --except-- in so far as is necessary because some overlapping cpuset
 has "sched_load_balance" enabled.
 So, for example, if the top cpuset has the flag "cpuset.sched_load_balance"
 enabled, then the scheduler will have one sched domain covering all
 CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other
 cpusets won't matter, as we're already fully load balancing.
 Therefore in the above two situations, the top cpuset flag
 "cpuset.sched_load_balance" should be disabled, and only some of the smaller,
 child cpusets have this flag enabled.
 When doing this, you don't usually want to leave any unpinned tasks in
 the top cpuset that might use non-trivial amounts of CPU, as such tasks
 may be artificially constrained to some subset of CPUs, depending on
 the particulars of this flag setting in descendant cpusets.  Even if
 such a task could use spare CPU cycles in some other CPUs, the kernel
 scheduler might not consider the possibility of load balancing that
 task to that underused CPU.
 Of course, tasks pinned to a particular CPU can be left in a cpuset
 that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere
 else anyway.
 There is an impedance mismatch here, between cpusets and sched domains.
 Cpusets are hierarchical and nest.  Sched domains are flat; they don't
 overlap and each CPU is in at most one sched domain.
 It is necessary for sched domains to be flat because load balancing
 across partially overlapping sets of CPUs would risk unstable dynamics
 that would be beyond our understanding.  So if each of two partially
 overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we
 form a single sched domain that is a superset of both.  We won't move
 a task to a CPU outside its cpuset, but the scheduler load balancing
 code might waste some compute cycles considering that possibility.
 This mismatch is why there is not a simple one-to-one relation
 between which cpusets have the flag "cpuset.sched_load_balance" enabled,
 and the sched domain configuration.  If a cpuset enables the flag, it
 will get balancing across all its CPUs, but if it disables the flag,
 it will only be assured of no load balancing if no other overlapping
 cpuset enables the flag.
 If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only
 one of them has this flag enabled, then the other may find its
 tasks only partially load balanced, just on the overlapping CPUs.
 This is just the general case of the top_cpuset example given a few
 paragraphs above.  In the general case, as in the top cpuset case,
 don't leave tasks that might use non-trivial amounts of CPU in
 such partially load balanced cpusets, as they may be artificially
 constrained to some subset of the CPUs allowed to them, for lack of
 load balancing to the other CPUs.
 CPUs in "cpuset.isolcpus" were excluded from load balancing by the
 isolcpus= kernel boot option, and will never be load balanced regardless
 of the value of "cpuset.sched_load_balance" in any cpuset.
 1.7.1 sched_load_balance implementation details.
 ------------------------------------------------
 The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary
 to most cpuset flags.)  When enabled for a cpuset, the kernel will
 ensure that it can load balance across all the CPUs in that cpuset
 (makes sure that all the CPUs in the cpus_allowed of that cpuset are
 in the same sched domain.)
 If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled,
 then they will be (must be) both in the same sched domain.
 If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled,
 then by the above that means there is a single sched domain covering
 the whole system, regardless of any other cpuset settings.
 The kernel commits to user space that it will avoid load balancing
 where it can.  It will pick as fine a granularity partition of sched
 domains as it can while still providing load balancing for any set
 of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled.
 The internal kernel cpuset to scheduler interface passes from the
 cpuset code to the scheduler code a partition of the load balanced
 CPUs in the system. This partition is a set of subsets (represented
 as an array of struct cpumask) of CPUs, pairwise disjoint, that cover
 all the CPUs that must be load balanced.
 The cpuset code builds a new such partition and passes it to the
 scheduler sched domain setup code, to have the sched domains rebuilt
 as necessary, whenever:
 - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
 - or CPUs come or go from a cpuset with this flag enabled,
 - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
   and with this flag enabled changes,
 - or a cpuset with non-empty CPUs and with this flag enabled is removed,
 - or a cpu is offlined/onlined.
 This partition exactly defines what sched domains the scheduler should
 setup - one sched domain for each element (struct cpumask) in the
 partition.
 The scheduler remembers the currently active sched domain partitions.
 When the scheduler routine partition_sched_domains() is invoked from
 the cpuset code to update these sched domains, it compares the new
 partition requested with the current, and updates its sched domains,
 removing the old and adding the new, for each change.
 1.8 What is sched_relax_domain_level ?
 --------------------------------------
 In sched domain, the scheduler migrates tasks in 2 ways; periodic load
 balance on tick, and at time of some schedule events.
 When a task is woken up, scheduler try to move the task on idle CPU.
 For example, if a task A running on CPU X activates another task B
 on the same CPU X, and if CPU Y is X's sibling and performing idle,
 then scheduler migrate task B to CPU Y so that task B can start on
 CPU Y without waiting task A on CPU X.
 And if a CPU run out of tasks in its runqueue, the CPU try to pull
 extra tasks from other busy CPUs to help them before it is going to
 be idle.
 Of course it takes some searching cost to find movable tasks and/or
 idle CPUs, the scheduler might not search all CPUs in the domain
 every time.  In fact, in some architectures, the searching ranges on
 events are limited in the same socket or node where the CPU locates,
 while the load balance on tick searches all.
 For example, assume CPU Z is relatively far from CPU X.  Even if CPU Z
 is idle while CPU X and the siblings are busy, scheduler can't migrate
 woken task B from X to Z since it is out of its searching range.
 As the result, task B on CPU X need to wait task A or wait load balance
 on the next tick.  For some applications in special situation, waiting
 1 tick may be too long.
 The 'cpuset.sched_relax_domain_level' file allows you to request changing
 this searching range as you like.  This file takes int value which
 indicates size of searching range in levels ideally as follows,
 otherwise initial value -1 that indicates the cpuset has no request.
 ====== ===========================================================
  -1   no request. use system default or follow request of others.
   0   no search.
   1   search siblings (hyperthreads in a core).
   2   search cores in a package.
   3   search cpus in a node [= system wide on non-NUMA system]
   4   search nodes in a chunk of node [on NUMA system]
   5   search system wide [on NUMA system]
 ====== ===========================================================
 The system default is architecture dependent.  The system default
 can be changed using the relax_domain_level= boot parameter.
 This file is per-cpuset and affect the sched domain where the cpuset
 belongs to.  Therefore if the flag 'cpuset.sched_load_balance' of a cpuset
 is disabled, then 'cpuset.sched_relax_domain_level' have no effect since
 there is no sched domain belonging the cpuset.
 If multiple cpusets are overlapping and hence they form a single sched
 domain, the largest value among those is used.  Be careful, if one
 requests 0 and others are -1 then 0 is used.
 Note that modifying this file will have both good and bad effects,
 and whether it is acceptable or not depends on your situation.
 Don't modify this file if you are not sure.
 If your situation is:
 - The migration costs between each cpu can be assumed considerably
   small(for you) due to your special application's behavior or
   special hardware support for CPU cache etc.
 - The searching cost doesn't have impact(for you) or you can make
   the searching cost enough small by managing cpuset to compact etc.
 - The latency is required even it sacrifices cache hit rate etc.
   then increasing 'sched_relax_domain_level' would benefit you.
 1.9 How do I use cpusets ?
 --------------------------
 In order to minimize the impact of cpusets on critical kernel
 code, such as the scheduler, and due to the fact that the kernel
 does not support one task updating the memory placement of another
 task directly, the impact on a task of changing its cpuset CPU
 or Memory Node placement, or of changing to which cpuset a task
 is attached, is subtle.
 If a cpuset has its Memory Nodes modified, then for each task attached
 to that cpuset, the next time that the kernel attempts to allocate
 a page of memory for that task, the kernel will notice the change
 in the task's cpuset, and update its per-task memory placement to
 remain within the new cpusets memory placement.  If the task was using
 mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
 its new cpuset, then the task will continue to use whatever subset
 of MPOL_BIND nodes are still allowed in the new cpuset.  If the task
 was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
 in the new cpuset, then the task will be essentially treated as if it
 was MPOL_BIND bound to the new cpuset (even though its NUMA placement,
 as queried by get_mempolicy(), doesn't change).  If a task is moved
 from one cpuset to another, then the kernel will adjust the task's
 memory placement, as above, the next time that the kernel attempts
 to allocate a page of memory for that task.
 If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset
 will have its allowed CPU placement changed immediately.  Similarly,
 if a task's pid is written to another cpuset's 'tasks' file, then its
 allowed CPU placement is changed immediately.  If such a task had been
 bound to some subset of its cpuset using the sched_setaffinity() call,
 the task will be allowed to run on any CPU allowed in its new cpuset,
 negating the effect of the prior sched_setaffinity() call.
 In summary, the memory placement of a task whose cpuset is changed is
 updated by the kernel, on the next allocation of a page for that task,
 and the processor placement is updated immediately.
 Normally, once a page is allocated (given a physical page
 of main memory) then that page stays on whatever node it
 was allocated, so long as it remains allocated, even if the
 cpusets memory placement policy 'cpuset.mems' subsequently changes.
 If the cpuset flag file 'cpuset.memory_migrate' is set true, then when
 tasks are attached to that cpuset, any pages that task had
 allocated to it on nodes in its previous cpuset are migrated
 to the task's new cpuset. The relative placement of the page within
 the cpuset is preserved during these migration operations if possible.
 For example if the page was on the second valid node of the prior cpuset
 then the page will be placed on the second valid node of the new cpuset.
 Also if 'cpuset.memory_migrate' is set true, then if that cpuset's
 'cpuset.mems' file is modified, pages allocated to tasks in that
 cpuset, that were on nodes in the previous setting of 'cpuset.mems',
 will be moved to nodes in the new setting of 'mems.'
 Pages that were not in the task's prior cpuset, or in the cpuset's
 prior 'cpuset.mems' setting, will not be moved.
 There is an exception to the above.  If hotplug functionality is used
 to remove all the CPUs that are currently assigned to a cpuset,
 then all the tasks in that cpuset will be moved to the nearest ancestor
 with non-empty cpus.  But the moving of some (or all) tasks might fail if
 cpuset is bound with another cgroup subsystem which has some restrictions
 on task attaching.  In this failing case, those tasks will stay
 in the original cpuset, and the kernel will automatically update
 their cpus_allowed to allow all online CPUs.  When memory hotplug
 functionality for removing Memory Nodes is available, a similar exception
 is expected to apply there as well.  In general, the kernel prefers to
 violate cpuset placement, over starving a task that has had all
 its allowed CPUs or Memory Nodes taken offline.
 There is a second exception to the above.  GFP_ATOMIC requests are
 kernel internal allocations that must be satisfied, immediately.
 The kernel may drop some request, in rare cases even panic, if a
 GFP_ATOMIC alloc fails.  If the request cannot be satisfied within
 the current task's cpuset, then we relax the cpuset, and look for
 memory anywhere we can find it.  It's better to violate the cpuset
 than stress the kernel.
 To start a new job that is to be contained within a cpuset, the steps are:
 1) mkdir /sys/fs/cgroup/cpuset
 2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
    the /sys/fs/cgroup/cpuset virtual file system.
 4) Start a task that will be the "founding father" of the new job.
 5) Attach that task to the new cpuset by writing its pid to the
    /sys/fs/cgroup/cpuset tasks file for that cpuset.
 6) fork, exec or clone the job tasks from this founding father task.
 For example, the following sequence of commands will setup a cpuset
 named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
 and then start a subshell 'sh' in that cpuset::
  mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
  cd /sys/fs/cgroup/cpuset
  mkdir Charlie
  cd Charlie
  /bin/echo 2-3 > cpuset.cpus
  /bin/echo 1 > cpuset.mems
  /bin/echo $$ > tasks
  sh
  # The subshell 'sh' is now running in cpuset Charlie
  # The next line should display '/Charlie'
  cat /proc/self/cpuset
 There are ways to query or modify cpusets:
 - via the cpuset file system directly, using the various cd, mkdir, echo,
   cat, rmdir commands from the shell, or their equivalent from C.
 - via the C library libcpuset.
 - via the C library libcgroup.
   (http://sourceforge.net/projects/libcg/)
 - via the python application cset.
   (http://code.google.com/p/cpuset/)
 The sched_setaffinity calls can also be done at the shell prompt using
 SGI's runon or Robert Love's taskset.  The mbind and set_mempolicy
 calls can be done at the shell prompt using the numactl command
 (part of Andi Kleen's numa package).
 2. Usage Examples and Syntax
 ============================
 2.1 Basic Usage
 ---------------
 Creating, modifying, using the cpusets can be done through the cpuset
 virtual filesystem.
 To mount it, type:
 # mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset
 Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the
 tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
 is the cpuset that holds the whole system.
 If you want to create a new cpuset under /sys/fs/cgroup/cpuset::
  # cd /sys/fs/cgroup/cpuset
  # mkdir my_cpuset
 Now you want to do something with this cpuset::
  # cd my_cpuset
 In this directory you can find several files::
  # ls
  cgroup.clone_children  cpuset.memory_pressure
  cgroup.event_control   cpuset.memory_spread_page
  cgroup.procs           cpuset.memory_spread_slab
  cpuset.cpu_exclusive   cpuset.mems
  cpuset.cpus            cpuset.sched_load_balance
  cpuset.mem_exclusive   cpuset.sched_relax_domain_level
  cpuset.mem_hardwall    notify_on_release
  cpuset.memory_migrate  tasks
 Reading them will give you information about the state of this cpuset:
 the CPUs and Memory Nodes it can use, the processes that are using
 it, its properties.  By writing to these files you can manipulate
 the cpuset.
 Set some flags::
  # /bin/echo 1 > cpuset.cpu_exclusive
 Add some cpus::
  # /bin/echo 0-7 > cpuset.cpus
 Add some mems::
  # /bin/echo 0-7 > cpuset.mems
 Now attach your shell to this cpuset::
  # /bin/echo $$ > tasks
 You can also create cpusets inside your cpuset by using mkdir in this
 directory::
  # mkdir my_sub_cs
 To remove a cpuset, just use rmdir::
  # rmdir my_sub_cs
 This will fail if the cpuset is in use (has cpusets inside, or has
 processes attached).
 Note that for legacy reasons, the "cpuset" filesystem exists as a
 wrapper around the cgroup filesystem.
 The command::
  mount -t cpuset X /sys/fs/cgroup/cpuset
 is equivalent to::
  mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
  echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
 2.2 Adding/removing cpus
 ------------------------
 This is the syntax to use when writing in the cpus or mems files
 in cpuset directories::
  # /bin/echo 1-4 > cpuset.cpus		-> set cpus list to cpus 1,2,3,4
  # /bin/echo 1,2,3,4 > cpuset.cpus	-> set cpus list to cpus 1,2,3,4
 To add a CPU to a cpuset, write the new list of CPUs including the
 CPU to be added. To add 6 to the above cpuset::
  # /bin/echo 1-4,6 > cpuset.cpus	-> set cpus list to cpus 1,2,3,4,6
 Similarly to remove a CPU from a cpuset, write the new list of CPUs
 without the CPU to be removed.
 To remove all the CPUs::
  # /bin/echo "" > cpuset.cpus		-> clear cpus list
 2.3 Setting flags
 -----------------
 The syntax is very simple::
  # /bin/echo 1 > cpuset.cpu_exclusive 	-> set flag 'cpuset.cpu_exclusive'
  # /bin/echo 0 > cpuset.cpu_exclusive 	-> unset flag 'cpuset.cpu_exclusive'
 2.4 Attaching processes
 -----------------------
 ::
  # /bin/echo PID > tasks
 Note that it is PID, not PIDs. You can only attach ONE task at a time.
 If you have several tasks to attach, you have to do it one after another::
  # /bin/echo PID1 > tasks
  # /bin/echo PID2 > tasks
 	...
  # /bin/echo PIDn > tasks
 3. Questions
 ============
 Q:
   what's up with this '/bin/echo' ?
 A:
   bash's builtin 'echo' command does not check calls to write() against
   errors. If you use it in the cpuset file system, you won't be
   able to tell whether a command succeeded or failed.
 Q:
   When I attach processes, only the first of the line gets really attached !
 A:
   We can only return one error code per call to write(). So you should also
   put only ONE pid.
 4. Contact
 ==========
 Web: http://www.bullopensource.org/cpuset
--- a/Documentation/admin-guide/cgroup-v1/devices.rst
+++ b/Documentation/admin-guide/cgroup-v1/devices.rst
--- a/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst
+++ b/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst
--- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
--- a/Documentation/admin-guide/cgroup-v1/index.rst
+++ b/Documentation/admin-guide/cgroup-v1/index.rst
@ -0,0 +1,28 @@
 ========================
 Control Groups version 1
 ========================
 .. toctree::
    :maxdepth: 1
    cgroups
    blkio-controller
    cpuacct
    cpusets
    devices
    freezer-subsystem
    hugetlb
    memcg_test
    memory
    net_cls
    net_prio
    pids
    rdma
 .. only::  subproject and html
   Indices
   =======
   * :ref:`genindex`
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@ -0,0 +1,355 @@
 =====================================================
 Memory Resource Controller(Memcg) Implementation Memo
 =====================================================
 Last Updated: 2010/2
 Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
 Because VM is getting complex (one of reasons is memcg...), memcg's behavior
 is complex. This is a document for memcg's internal behavior.
 Please note that implementation details can be changed.
 (*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst)
 0. How to record usage ?
 ========================
   2 objects are used.
   page_cgroup ....an object per page.
 	Allocated at boot or memory hotplug. Freed at memory hot removal.
   swap_cgroup ... an entry per swp_entry.
 	Allocated at swapon(). Freed at swapoff().
   The page_cgroup has USED bit and double count against a page_cgroup never
   occurs. swap_cgroup is used only when a charged page is swapped-out.
 1. Charge
 =========
   a page/swp_entry may be charged (usage += PAGE_SIZE) at
 	mem_cgroup_try_charge()
 2. Uncharge
 ===========
  a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
 	mem_cgroup_uncharge()
 	  Called when a page's refcount goes down to 0.
 	mem_cgroup_uncharge_swap()
 	  Called when swp_entry's refcnt goes down to 0. A charge against swap
 	  disappears.
 3. charge-commit-cancel
 =======================
 	Memcg pages are charged in two steps:
 		- mem_cgroup_try_charge()
 		- mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
 	At try_charge(), there are no flags to say "this page is charged".
 	at this point, usage += PAGE_SIZE.
 	At commit(), the page is associated with the memcg.
 	At cancel(), simply usage -= PAGE_SIZE.
 Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 4. Anonymous
 ============
 	Anonymous page is newly allocated at
 		  - page fault into MAP_ANONYMOUS mapping.
 		  - Copy-On-Write.
 	4.1 Swap-in.
 	At swap-in, the page is taken from swap-cache. There are 2 cases.
 	(a) If the SwapCache is newly allocated and read, it has no charges.
 	(b) If the SwapCache has been mapped by processes, it has been
 	    charged already.
 	4.2 Swap-out.
 	At swap-out, typical state transition is below.
 	(a) add to swap cache. (marked as SwapCache)
 	    swp_entry's refcnt += 1.
 	(b) fully unmapped.
 	    swp_entry's refcnt += # of ptes.
 	(c) write back to swap.
 	(d) delete from swap cache. (remove from SwapCache)
 	    swp_entry's refcnt -= 1.
 	Finally, at task exit,
 	(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
 5. Page Cache
 =============
 	Page Cache is charged at
 	- add_to_page_cache_locked().
 	The logic is very clear. (About migration, see below)
 	Note:
 	  __remove_from_page_cache() is called by remove_from_page_cache()
 	  and __remove_mapping().
 6. Shmem(tmpfs) Page Cache
 ===========================
 	The best way to understand shmem's page state transition is to read
 	mm/shmem.c.
 	But brief explanation of the behavior of memcg around shmem will be
 	helpful to understand the logic.
 	Shmem's page (just leaf page, not direct/indirect block) can be on
 		- radix-tree of shmem's inode.
 		- SwapCache.
 		- Both on radix-tree and SwapCache. This happens at swap-in
 		  and swap-out,
 	It's charged when...
 	- A new page is added to shmem's radix-tree.
 	- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
 7. Page Migration
 =================
 	mem_cgroup_migrate()
 8. LRU
 ======
        Each memcg has its own private LRU. Now, its handling is under global
 	VM's control (means that it's handled under global pgdat->lru_lock).
 	Almost all routines around memcg's LRU is called by global LRU's
 	list management functions under pgdat->lru_lock.
 	A special function is mem_cgroup_isolate_pages(). This scans
 	memcg's private LRU and call __isolate_lru_page() to extract a page
 	from LRU.
 	(By __isolate_lru_page(), the page is removed from both of global and
 	private LRU.)
 9. Typical Tests.
 =================
 Tests for racy cases.
 9.1 Small limit to memcg.
 -------------------------
 	When you do test to do racy case, it's good test to set memcg's limit
 	to be very small rather than GB. Many races found in the test under
 	xKB or xxMB limits.
 	(Memory behavior under GB and Memory behavior under MB shows very
 	different situation.)
 9.2 Shmem
 ---------
 	Historically, memcg's shmem handling was poor and we saw some amount
 	of troubles here. This is because shmem is page-cache but can be
 	SwapCache. Test with shmem/tmpfs is always good test.
 9.3 Migration
 -------------
 	For NUMA, migration is an another special case. To do easy test, cpuset
 	is useful. Following is a sample script to do migration::
 		mount -t cgroup -o cpuset none /opt/cpuset
 		mkdir /opt/cpuset/01
 		echo 1 > /opt/cpuset/01/cpuset.cpus
 		echo 0 > /opt/cpuset/01/cpuset.mems
 		echo 1 > /opt/cpuset/01/cpuset.memory_migrate
 		mkdir /opt/cpuset/02
 		echo 1 > /opt/cpuset/02/cpuset.cpus
 		echo 1 > /opt/cpuset/02/cpuset.mems
 		echo 1 > /opt/cpuset/02/cpuset.memory_migrate
 	In above set, when you moves a task from 01 to 02, page migration to
 	node 0 to node 1 will occur. Following is a script to migrate all
 	under cpuset.::
 		--
 		move_task()
 		{
 		for pid in $1
 		do
 			/bin/echo $pid >$2/tasks 2>/dev/null
 			echo -n $pid
 			echo -n " "
 		done
 		echo END
 		}
 		G1_TASK=`cat ${G1}/tasks`
 		G2_TASK=`cat ${G2}/tasks`
 		move_task "${G1_TASK}" ${G2} &
 		--
 9.4 Memory hotplug
 ------------------
 	memory hotplug test is one of good test.
 	to offline memory, do following::
 		# echo offline > /sys/devices/system/memory/memoryXXX/state
 	(XXX is the place of memory)
 	This is an easy way to test page migration, too.
 9.5 mkdir/rmdir
 ---------------
 	When using hierarchy, mkdir/rmdir test should be done.
 	Use tests like the following::
 		echo 1 >/opt/cgroup/01/memory/use_hierarchy
 		mkdir /opt/cgroup/01/child_a
 		mkdir /opt/cgroup/01/child_b
 		set limit to 01.
 		add limit to 01/child_b
 		run jobs under child_a and child_b
 	create/delete following groups at random while jobs are running::
 		/opt/cgroup/01/child_a/child_aa
 		/opt/cgroup/01/child_b/child_bb
 		/opt/cgroup/01/child_c
 	running new jobs in new group is also good.
 9.6 Mount with other subsystems
 -------------------------------
 	Mounting with other subsystems is a good test because there is a
 	race and lock dependency with other cgroup subsystems.
 	example::
 		# mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
 	and do task move, mkdir, rmdir etc...under this.
 9.7 swapoff
 -----------
 	Besides management of swap is one of complicated parts of memcg,
 	call path of swap-in at swapoff is not same as usual swap-in path..
 	It's worth to be tested explicitly.
 	For example, test like following is good:
 	(Shell-A)::
 		# mount -t cgroup none /cgroup -o memory
 		# mkdir /cgroup/test
 		# echo 40M > /cgroup/test/memory.limit_in_bytes
 		# echo 0 > /cgroup/test/tasks
 	Run malloc(100M) program under this. You'll see 60M of swaps.
 	(Shell-B)::
 		# move all tasks in /cgroup/test to /cgroup
 		# /sbin/swapoff -a
 		# rmdir /cgroup/test
 		# kill malloc task.
 	Of course, tmpfs v.s. swapoff test should be tested, too.
 9.8 OOM-Killer
 --------------
 	Out-of-memory caused by memcg's limit will kill tasks under
 	the memcg. When hierarchy is used, a task under hierarchy
 	will be killed by the kernel.
 	In this case, panic_on_oom shouldn't be invoked and tasks
 	in other groups shouldn't be killed.
 	It's not difficult to cause OOM under memcg as following.
 	Case A) when you can swapoff::
 		#swapoff -a
 		#echo 50M > /memory.limit_in_bytes
 	run 51M of malloc
 	Case B) when you use mem+swap limitation::
 		#echo 50M > memory.limit_in_bytes
 		#echo 50M > memory.memsw.limit_in_bytes
 	run 51M of malloc
 9.9 Move charges at task migration
 ----------------------------------
 	Charges associated with a task can be moved along with task migration.
 	(Shell-A)::
 		#mkdir /cgroup/A
 		#echo $$ >/cgroup/A/tasks
 	run some programs which uses some amount of memory in /cgroup/A.
 	(Shell-B)::
 		#mkdir /cgroup/B
 		#echo 1 >/cgroup/B/memory.move_charge_at_immigrate
 		#echo "pid of the program running in group A" >/cgroup/B/tasks
 	You can see charges have been moved by reading ``*.usage_in_bytes`` or
 	memory.stat of both A and B.
 	See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should
 	be written to move_charge_at_immigrate.
 9.10 Memory thresholds
 ----------------------
 	Memory controller implements memory thresholds using cgroups notification
 	API. You can use tools/cgroup/cgroup_event_listener.c to test it.
 	(Shell-A) Create cgroup and run event listener::
 		# mkdir /cgroup/A
 		# ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
 	(Shell-B) Add task to cgroup and try to allocate and free memory::
 		# echo $$ >/cgroup/A/tasks
 		# a="$(dd if=/dev/zero bs=1M count=10)"
 		# a=
 	You will see message from cgroup_event_listener every time you cross
 	the thresholds.
 	Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
 	It's good idea to test root cgroup as well.
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
--- a/Documentation/admin-guide/cgroup-v1/net_cls.rst
+++ b/Documentation/admin-guide/cgroup-v1/net_cls.rst
--- a/Documentation/admin-guide/cgroup-v1/net_prio.rst
+++ b/Documentation/admin-guide/cgroup-v1/net_prio.rst
--- a/Documentation/admin-guide/cgroup-v1/pids.rst
+++ b/Documentation/admin-guide/cgroup-v1/pids.rst
--- a/Documentation/admin-guide/cgroup-v1/rdma.rst
+++ b/Documentation/admin-guide/cgroup-v1/rdma.rst
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@ -9,7 +9,7 @@ This is the authoritative documentation on the design, interface and
 conventions of cgroup v2.  It describes all userland-visible aspects
 of cgroup including core and specific controller behaviors.  All
 future changes must be reflected in this document.  Documentation for
-v1 is available under Documentation/cgroup-v1/.
+v1 is available under Documentation/admin-guide/cgroup-v1/.
 .. CONTENTS
@ -1014,7 +1014,7 @@ All time durations are in microseconds.
 	A read-only nested-key file which exists on non-root cgroups.
 	Shows pressure stall information for CPU. See
-	Documentation/accounting/psi.txt for details.
+	Documentation/accounting/psi.rst for details.
 Memory
@ -1355,7 +1355,7 @@ PAGE_SIZE multiple when read back.
 	A read-only nested-key file which exists on non-root cgroups.
 	Shows pressure stall information for memory. See
-	Documentation/accounting/psi.txt for details.
+	Documentation/accounting/psi.rst for details.
 Usage Guidelines
@ -1498,7 +1498,7 @@ IO Interface Files
 	A read-only nested-key file which exists on non-root cgroups.
 	Shows pressure stall information for IO. See
-	Documentation/accounting/psi.txt for details.
+	Documentation/accounting/psi.rst for details.
 Writeback
@ -2124,7 +2124,7 @@ following two functions.
 	a queue (device) has been associated with the bio and
 	before submission.
-  wbc_account_io(@wbc, @page, @bytes)
+  wbc_account_cgroup_owner(@wbc, @page, @bytes)
 	Should be called for each data segment being written out.
 	While this function doesn't care exactly when it's called
 	during the writeback session, it's the easiest and most
--- a/Documentation/admin-guide/clearing-warn-once.rst
+++ b/Documentation/admin-guide/clearing-warn-once.rst
--- a/Documentation/admin-guide/cpu-load.rst
+++ b/Documentation/admin-guide/cpu-load.rst
--- a/Show More
+++ b/Show More